ABSTRACT Title of dissertation: STATISTICAL KNOWLEDGE AND LEARNING IN PHONOLOGY Ewan Michael Dunbar, Doctor of Philosophy, 2013 Dissertation directed by: Professor William Idsardi Department of Linguistics This dissertation deals with the theory of the phonetic component of grammar in a formal probabilistic inference framework: (1) it has been recognized since the beginning of generative phonology that some language-specific phonetic implementation is actually context-dependent, and thus it can be said that there are gradient ?phonetic processes? in grammar in addition to categorical ?phonological processes.? However, no explicit theory has been developed to characterize these processes. Meanwhile, (2) it is understood that language acquisition and perception are both really informed guesswork: the result of both types of inference can be reasonably thought to be a less-than-perfect committment, with multiple candidate grammars or parses considered and each associated with some degree of credence. Previous research has used probability theory to formalize these inferences in implemented computational models, especially in phonetics and phonology. In this role, computational models serve to demonstrate the existence of working learning/per- ception/parsing systems assuming a faithful implementation of one particular theory of human language, and are not intended to adjudicate whether that theory is correct. The current dissertation (1) develops a theory of the phonetic component of grammar and how it relates to the greater phonological system and (2) uses a formal Bayesian treatment of learning to evaluate this theory of the phonological architecture and for making predic- tions about how the resulting grammars will be organized. The coarse description of the consequence for linguistic theory is that the processes we think of as ?allophonic? are actually language-specific, gradient phonetic processes, assigned to the phonetic compo- nent of grammar; strict allophones have no representation in the output of the categorical phonological grammar. STATISTICAL KNOWLEDGE AND LEARNING IN PHONOLOGY by Ewan Michael Dunbar Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 Advisory Committee: Professor William Idsardi, Chair, Advisor Professor Hal Daum? III Professor Naomi Feldman, Co-Advisor Professor Norbert Hornstein Professor Jeff Lidz Professor Rochelle Newman, Dean?s Representative ? Copyright by Ewan Michael Dunbar 2013 Dedication For Chanelle. ii Acknowledgments Tell me that it?s a wonder, so that I may sleep when all I see in the night is this place. ?Ambassador Delenn, ?A Voice in the Wilderness, Part 2? The last five years of my life have been important. I wish I had the space to write the human story of this dissertation. If I do write that story some day, I hope it will carry a lesson I could have benefitted from knowing earlier: the ?exercise? model of intellectual work and development is dead wrong. Those who tell you to ?push yourself harder? to ?get the job done? are sending you the wrong message, unless they are simply trying to make you quit. If this message works, it works largely by accident. This is not to say that you should not find ways to get the job done effectively. But over the last five years, I have flourished whenever I have removed barriers to the contented flow of work, not pushed against them, and the most effective way of doing that has been to remember that I am doing this because it is a part of my life. That story ends with a line something like, ?You are more than your work, and your work will never be more than you?do not let unhappy and confused people tell you otherwise.? I may not be the person to write it. Still, some of the people I owe the greatest debt to are the ones who have helped me to understand this. Intellectually, this work began to coagulate back in 2008 around an insight of Brian Dillon?s, to whose creativity and vision and ability to intuitively grasp technical problems I owe a great debt. My advisor, Bill Idsardi, was key in taking an insight, turning it into a project, and then guiding the project into coherent ideas. But ?coherent ideas? is understating it with Bill. Having a clear vision of the whole field?theory from soup to nuts, integrated with a real theory of inference?is more like it, and is rare enough. The persistence and the audacity with which you reminded me that it can actually be executed is of a higher order still. This has been one of the major forces motivating me over the last five years. And, although the official style sheet indicates that committee member names after the chair must be listed in alphabetical order, Naomi Feldman has been co-advisor since she arrived, to the extent that my unpredictable ventures from the cave to receive advice have supported it. Thanks for always seeing my quest for accuracy and raising me common sense?and for your reminders to stop working. Jeff Lidz and Hal Daum? have, in addition to serving as committee members, always reminded me that if you cannot explain it simply, you have not understood it properly. And, as is often the case, I would like to thank Norbert Hornstein for general inspiration?but, in this case, also for volunteering as an eleventh hour replacement committee member. Thanks also to Amy Weinberg and Rochelle Newman for their roles on the committee. The broader Linguistics Department/CNL Lab group are irreplaceable, and to all those who built the spark and spirit of this place?and it is worth mentioning that it is clear even without turning back time that Colin Phillips has always been instrumental in driving this institutional vision?no thanks will ever be enough. It has a life of its own now. Alexis Wellwood, in addition to being a wonderful friend, stands out as someone who was evidently drawn here not only because she saw the spark, but also because she is a vessel through which this spirit is eager to flow. You have taught me a lot, and you would be surprised at how much of this thesis is made out of the material we have started iii to construct together. To our year?Shevaun Lewis, Brad Larson, Dave Kush, Wing-Yee Chow, Terje Lohndal, Micha?l Gagnon, Chris LaTerza?listing the individual contribu- tions to the intellectual and interpersonal tenor of the department and to the ideas and work in here?I think this burden goes mainly in the human story. However, suffice it to say, I am better off for having met you all, not only because you are all very smart. Mr Shepard would of course have to learn to draw both Dave and Brad extra well in the human story. Dave, thank you for co-pontificating on matters of all kinds. Within the department, Howard Lasnik also deserves one special, general thanks: your attention to detail is more than a beneficial skill?it is a method and a vision and a way of thinking that has profoundly influenced my work and my teaching. Thanks also to Josh Falk, who is, after a semester of helping me gut it of a lot of cruft and dead ends, a co-author of the Java code used in Chapter 3. Thanks to Jesse Shawl, whose Honors thesis work did not make it to being mentioned in the main text, but was a set of corpus and modelling results that formed the testbed for other versions of the gender experiments. Special thanks to Jorge Tartamudeo, the Basement Lab manager. Innumerable ?outside? people have made useful comments and suggestions relating to this work over the years, ranging from one-offs to extended discussions. I did not take down all their names. Some of the people whose comments, however trivial-seeming some of them might have been to them at the time, were useful, are: Jeff Heinz; Elan Dresher; John Kingston; Andrew Nevins; Jordan Boyd-Graber; John McCarthy; Joe Pa- ter; Adam Albright; Kathleen Currie-Hall; Jason Eisner; and Alan Yu. Special thanks, as always, go to Alana Johns, Derek Denis, and Mark Pollard, for providing Inukitut data. Thanks to Anja Lindelof at the Greenlandic Broadcasting Corporation for providing Kalaallisut recordings and transcripts. At the risk of drawing out the articulated web of influence too far, thanks also goes to all those who gave me a good undergraduate and Master?s education at the University of Toronto. In this particular context, the reader is invited to search for the influence of the late Ed Burstynsky, who was the undergraduate advisor when I arrived. He was, I take it, a large part of what built the collegial atmosphere in the Linguistics department that all of us took for granted. Undergrads and grads were all colleagues and friends, and this was a particularly nice oasis to have given the social desert that U of T can be otherwise. His direct influence here is limited to one of his famous lines from LIN100?one that virtually everyone in the department found some reason to repeat?on the topic of phonetic grounding: ?We?ve been writing it =p=, but it could have been ?maple leaf.?? A tremendous thanks goes to Farah Foss?, who helped the Crittenden house fight an illegal eviction attempt and thereby played a key role in saving me from being homeless during part of the writing of this thesis. You and the offending landlord could also go in the special soul-savers section of these acknowledgments, (see below), but you did not mean to touch me. Nevertheless, you did: Craig by putting up walls of self-serving falsehoods, and Farah by calmly and patiently providing us with the truth. Together, you showed me the power of truth, reason, and independent thought: facts matter. I now have no interest in staying quiet when people are hiding behind being ?partly right,? (the most frequent case, unfortunately), and a much clearer head about speaking up. Thanks also to Molly McCullagh and Dave Kush for offering to help. My close personal relationships with many people have been instrumental in my iv work and general survival over the past five years. Some of them have been mentioned already, but many, many others remain. A list, in no particular order, of people who have at least made some small home in my heart within the last five years, if not set up shop there, or come home briefly to resume a long-standing tenancy: Alex Essoe, Dave Kush, Nathan Rolleman, Christina Bjorndahl, Ailis Cournane, Smiljka Tasi?, Fernanda Queiros, Sophie Maudslay, Alex ?Ace? Carruthers, Emily Oppenheimer, Graham Van Pelt, Nika Mistruzzi, Suzanne Freynik, Amanda Brazeau, Corey Simpson, Kate Borowec, Jack Dylan, and Annie Gagliardi. There are others, of course. My immediate family, Joan Smeaton, Earl Dunbar, and Emily Dunbar, and my extended family, especially my Grandma, Blanche, and late Grandpa, Bill Dunbar, obviously drove who I have become, but also let themselves come to be known to me in new and important ways in the last five years, ways which have shaped me substantially. Special thanks to Callie Wright for keeping me sane during the crunch and to Brock Rough and Darryl McAdams for helping me get my license (all remaining errors in driving are mine). Finally, to those of you who have done things particular and deliberate which changed me profoundly from the inside during the last five years?at my behest or not?sorry to embarrass, but the deep contributions you made to this dissertation are the underlying con- tent, the rest is details. Normally these special influences and relationships are entered in the acknowledgments with a lot of words, but I can?t do that without turning this into the human story. I separate out the following list of people from those above at great risk in case it looks like the words and hearts of the people I just mentioned have never punched me in the gut. Of course they have, some many times, some for better, some for worse (but ultimately always for better). But, Romy Lassotta, Chiara Frigeni, Laura Broadbent, and Ann Collins, and Sol Lago: I would have fallen apart without you. Well, okay?it might be fairer to say that I did fall apart without you, and you are the reasons Humpty Dumpty is still on the wall and not a thick powdery paste of shell and yolk. Sol had the patience and respect to watch this process with equanimity many, many times, and is perhaps the most sensible person I know. Finally, to Lucile Marteel: I am still waiting on that report. v Contents List of Tables viii List of Figures ix 1 Introduction 1 1.1 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Simplicity 24 2.1 The poverty of the stimulus: what is to be done? . . . . . . . . . . . . . . 24 2.2 Bayesian models of cognition: why should we care? . . . . . . . . . . . . 33 2.3 The syntactic acquisition model of Perfors et al. . . . . . . . . . . . . . . 42 2.4 Evaluation measures: restrictiveness and simplicity . . . . . . . . . . . . 53 2.5 Bayesian Occam?s Razor . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.5.1 Maximum likelihood and restrictiveness . . . . . . . . . . . . . . 63 2.5.2 Model evaluation in statistics . . . . . . . . . . . . . . . . . . . . 67 2.5.3 Bayesian inference and model evaluation . . . . . . . . . . . . . 74 2.5.4 Conditions for a Bayesian Occam?s Razor . . . . . . . . . . . . . 84 2.6 The Optimal Measure Principle . . . . . . . . . . . . . . . . . . . . . . . 88 2.6.1 Formalizing grammars preliminary: Transparency and structure . 89 2.6.2 Notation and the structure of a grammar . . . . . . . . . . . . . . 96 2.6.3 Relating grammars to priors in an optimal way . . . . . . . . . . 109 2.6.4 Example: deriving a symbol-counting evaluation measure . . . . 111 2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3 Modelling allophone learning 118 3.1 Categories and transformations . . . . . . . . . . . . . . . . . . . . . . . 118 3.1.1 Empirical review . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.1.2 Computational and mathematical models . . . . . . . . . . . . . 124 3.1.3 Phonetic transform hypothesis . . . . . . . . . . . . . . . . . . . 133 3.2 A computational model: Dillon, Dunbar and Idsardi (2013) . . . . . . . . 144 3.2.1 Mixture of linear models . . . . . . . . . . . . . . . . . . . . . . 144 3.2.2 Summary of Inuktitut experiments . . . . . . . . . . . . . . . . . 151 3.3 Selecting transform environments . . . . . . . . . . . . . . . . . . . . . 158 3.3.1 Mixture of linear models with variable selection . . . . . . . . . 158 vi 3.3.2 Experiment: Inuktitut revisited . . . . . . . . . . . . . . . . . . . 162 3.3.2.1 List of sub-experiments . . . . . . . . . . . . . . . . . 163 3.3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 164 3.3.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . 168 3.3.3 Experiment: sex and gender differences . . . . . . . . . . . . . . 171 3.4 Proposed model: learning with features . . . . . . . . . . . . . . . . . . 175 3.4.1 Background: features, geometries, and the contrastive hierarchy . 175 3.4.2 Background: Bayesian category models with features . . . . . . . 187 3.4.3 Feature-based phonetic category models: goals for future research 194 3.5 Further issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 4 Phonetic transforms I: The cognitive architecture 211 4.1 The phonetic surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 4.1.1 Background: Surface representations . . . . . . . . . . . . . . . 216 4.1.2 Status of surface representations under a phonetic transform hy- pothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 4.1.3 Problematic and unproblematic appeals to surface representations in phonological theory . . . . . . . . . . . . . . . . . . . . . . . 231 4.1.3.1 No problems with AC-representations . . . . . . . . . 232 4.1.3.2 Historical changes . . . . . . . . . . . . . . . . . . . . 238 4.1.3.3 Opaque allophony . . . . . . . . . . . . . . . . . . . . 251 4.2 The Lateness of Allophony . . . . . . . . . . . . . . . . . . . . . . . . . 256 4.2.1 Background: Structure-preservation and the cycle . . . . . . . . . 256 4.2.2 Structure-preservation and phonetic transforms . . . . . . . . . . 263 4.2.3 Issues with phonetic EOD blocks in HOCD theories . . . . . . . 268 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 5 Phonetic transforms II: Linguistic phenomena 276 5.1 Incomplete neutralization . . . . . . . . . . . . . . . . . . . . . . . . . . 277 5.1.1 Empirical predictions . . . . . . . . . . . . . . . . . . . . . . . . 281 5.1.1.1 One category, one process . . . . . . . . . . . . . . . . 281 5.1.1.2 Two categories, one process . . . . . . . . . . . . . . . 286 5.1.1.3 Two categories, one categorical process . . . . . . . . . 289 5.1.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 5.2 Phonetic process interactions . . . . . . . . . . . . . . . . . . . . . . . . 293 5.2.1 Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 5.2.2 Possible counterexamples . . . . . . . . . . . . . . . . . . . . . 297 5.3 Statistics in linguistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 311 5.4 Main findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 vii List of Tables 2.1 Made-up heights of ten adult Dutch females in centimetres. . . . . . . . . 68 3.1 Summary of model evaluations from Dillon, Dunbar & Idsardi 2013 . . . 156 3.2 Complementarity of context distributions of categories estimated by mix- ture of Gaussians Inuktitut models in Dillon, Dunbar & Idsardi 2013 . . . 157 3.3 Quantitative summary of Experiments 4?11 . . . . . . . . . . . . . . . . 165 3.4 Difference between the Inuktitut vowel mean conditional on a particular following consonant place and the mean elsewhere, for the four different places of articulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 3.5 Quantitative summary of Experiments 12?13 . . . . . . . . . . . . . . . 172 4.1 The consonant and vowel inventory of Russian . . . . . . . . . . . . . . 238 viii List of Figures 2.1 Gaussians of differing variance illustrating restrictiveness in the use of the likelihood-based inference . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.2 Derivations for the string aaabbb following a context-free grammar and a corresponding categorial grammar . . . . . . . . . . . . . . . . . . . . 98 2.3 Diagram of subpart relations between three CFGs. . . . . . . . . . . . . . 100 2.4 Diagram of subpart relations between four CFGs. . . . . . . . . . . . . . 101 3.1 Mixture of two-dimensional Gaussian distributions . . . . . . . . . . . . 127 3.2 Illustrations of conventional mixture of Gaussians models and mixture of linear models as phonetic category systems . . . . . . . . . . . . . . . . 147 3.3 Second and first formant values for Inuktitut vowel tokens . . . . . . . . 152 3.4 Example fitted models from Experiments 1 and 3 of Dillon, Dunbar & Idsardi 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 3.5 First and second formant values for English corner vowels . . . . . . . . 172 3.6 Example of mixture of linear models fit to English corner vowel data (Ex- periment 13) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 3.7 Three different ways that the clusters in a three-vowel system might be decomposed if the likelihood function in a latent feature model simply adds Gaussian means . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 3.8 One way that the clusters in a three-vowel system might be decomposed if the likelihood function in a latent feature model with feature values +1 and 1 adds Gaussian means . . . . . . . . . . . . . . . . . . . . . . . . 200 3.9 Second and first formant values for Inuktitut vowel tokens . . . . . . . . 202 3.10 Second and first formant values for Kalaallisut vowels. . . . . . . . . . . 205 4.1 Diagram of current versus conventional phonological architectures . . . . 225 5.1 Duration of voicing and aspiration measurements for German stops in in- tervocalic, final, and word-initial position (data taken from Jessen 1998 and Port & O?Dell 1986) . . . . . . . . . . . . . . . . . . . . . . . . . . 278 ix Chapter 1: Introduction Mind swept clean like arctic sand. ?Bruce Cockburn, ?Nanzen Ji? The purpose of this dissertation is to entrench statistical theory in linguistic theory. I use phonology, and, in particular, a part of linguistic cognition called the ?phonetics? phonology interface,? as a model system to illustrate how statistical theory should be used in linguistics. Statistical theory is about inference, and so is linguistic theory. In Chapter 2, I spell out one particular connection between statistical inference and grammatical inference in general terms, not tied to any particular learning problem. The Bayesian statistical ap- proach to inference corresponds to an ?evaluation measure? approach to the language ac- quisition problem (Chomsky 1965). Given some data and a statement of the set of possible grammars, or more generally the set of possible learnable states, the approach is to state the learner?s relative preferences over those grammars, but to make no commitment to the process by which the change from initial to adult state takes place. The analysis in terms of relative preferences for final states lends itself to general proofs, and I lay out a general version of the Bayesian Occam?s Razor principle, whereby the axioms of Bayesian in- ference lead inevitably to a preference for grammars/final states which are simpler under weak conditions of hierarchical structure in the set of grammars. This not only derives 1 the explicit simplicity bias which has been proposed by linguists in evaluation measures for variable-length grammars, but also the bias for analyses that unify lexical items that I argue linguists almost universally and crucially attribute implicitly to the learner. Further- more, since Bayesian inference can be derived from general axioms of rational inference, I put forward that the predicted biases can and should be used as a tool for empirically evaluating grammatical theories, a tool which is as well or poorly supported as the con- jecture that Bayesian statistical theory is sufficient to explain learners? relative preferences for final states. The rest of the dissertation is an extended example of how to use statistical theory as a theory of learners? preferences and to make it do work in elaborating and reasoning about the empirical consequences of a particular theoretical linguistic proposal. In Chapter 3, I reintroduce the hypothesis that allophony is the result of context-dependent phonetic processes, entertained briefly in the 1980s, but never fully explored. Chapter 3 sets up a theoretical framework for phonetic grammar (but leaves the details of precisely articulat- ing the notion of ?allophony? on which the hypothesis turns for Chapter 5). Chapter 3 then presents implemented statistical models for doing inference over phonetic category and process systems, and provides a piece of positive evidence suggesting that the phonetic process treatment of allophonic patterns is better than the usual understanding. The piece of evidence is that the implemented learner does better at finding systems of allophones and from real phonetic data than a minimally different statistical model where allophonic patterns are stated over discrete categories, as in the usual conception. The basic model is re-presented from Dillon, Dunbar, and Idsardi, 2013, and I then present several variants. I present these variants to demonstrate the practice of exploring theoretical proposals by 2 embedding them in formal statistical inference. Then I turn to face the consequences of the phonetic theory of allophony. Chap- ter 4 continues to bracket the details of what exactly constitutes allophony and how ex- actly it is connected to phonetic processes, but makes explicit the principal architectural consequence if it is true, which may be summed up under the oversimplified slogan, ?There is no surface representation.? This is oversimplified because the architecture is still split into two components?a categorical phonological component, and a later pho- netic component?and so there is still a ?surface? representation, in the sense of ?output of the phonological component??but it is quite far from the ?surface.? I distinguish this ?abstract category (AC) representation? of the surface from the ?surface category (SC) representation? that is usually assumed, and I show what kinds of analyses that make cru- cial reference to the phonological output are still acceptable, and what kinds of analyses need to be reformulated. One such case, a historical change from East Slavic, (the post- velar fronting), is somewhat complex. Previous analyses not only make crucial reference to SC-representations, but also posit a kind of ?instantaneous change? in the language, at- tributed to learners. Any such radical change from a grammar faithful to the input to one generating an inconsistent pattern is easily shown to be a marked case once we consider the Bayesian formulation of the learning problem: learners quantitatively balance faith- fulness to the input (fit, likelihood) and markedness of final states (bias, prior probability). I sketch a new analysis in terms of phonetic enhancement, with phonetic processes as a mechanism, which still attributes faithfulness to the learner. In the second part of Chapter 4, I show how the shape of the architecture, ?phonol- ogy feeds phonetics,? can be deduced from the particular representational commitments 3 of the architecture by deducing a kind of non-recoverability principle. Combined with the attribution of allophony to the phonetic component, this explains the well-known em- pirical generalization that allophonic rules are ?late.? I note, however, that level-ordered architectures, (which are usually the vocabulary under which the ?late allophony? gener- alization is stated) do not interact well with this assumption, under the interpretation that each ?level? is divided into an early phonological and a late phonetic component. I show how this could be maintained if non-allophonic information can be still be recovered from the output of the phonetic component. Finally, in Chapter 5, I turn to focus principally on two empirical patterns, one of which (the existence of incomplete neutralization) is known but was sometimes previously thought to be problematic, and the other of which (simultaneous application of allophonic processes) is a new prediction. I use incomplete neutralization to articulate the association between allophone-like patterning and phonetic processes. The link is not a hard one, but is by way of the learner?s preferences: the absence of a phonetic process implies over- lapping distribution, this interacts with the learner?s (simplicity) biases to give detailed predictions about the kinds of corpora that should lead the learner to phonetic transforms, rather than positing separate categories. The use of a formal theory of inference allows ?structure-preservation? to be meaningfully reinterpreted with respect to restrictions on outputs (AC-representation) rather than inputs. In the second part of Chapter 5, I explore the prediction that sets of allophonic pro- cesses should show the simultaneous application pattern. The canonical case is the Cana- dian Raising pattern (counterbleeding; counterfeeding is equally well predicted). This is not problematic the way it is in Optimality Theory, where the relevant grammatical 4 constraints can only be stated with respect to their outputs. Here, the environments for allophony can and must be stated with respect to the inputs. The theory also deviates from serial approaches since the 1960s, which have uniformly rejected simultaneous ap- plication. I argue that the restriction of simultaneous application to allophony solves the undergeneration problem pointed out by Chomsky and Halle, (and predicts that the em- pirical gap Chomsky and Halle point to in the predictions of simultaneous application is not systematic but accidental). I then explore two closely related problematic cases for this prediction: Polish and Dutch voicing assimilation, which appear to interact with fi- nal devoicing in a way inconsistent with simultaneous application. I propose that these involve a late feature deletion operation distinct from the phonetic transforms responsible for the other type of phonetic transforms discussed. This remains outside of the scope of the detailed theory, because the theory does not make use of phonetic features in category representations, but I set the bounds for how it is to interact with phonetic transforms. I leave a worked out theory for future research. I end by summarizing this and other in- teresting future research projects raised in the course of the dissertation, and I discuss the proper role of statistics in linguistic theory. For the sake of the linguist reader, I will use the rest of this section to outline the basics of Bayesian statistical inference. For the sake of the computational or mathematical reader, I will use the rest of this section to give the basic empirical outline of the cognitive problem that this dissertation is about (learning phonological systems), and to fill in some basic theory. 5 1.1 Bayesian inference Although statistics is often understood to be intimately related with quantitative ob- servations, statistical inference is actually more general than that. Under the ?subjectivist? view of probability, what is crucially quantitative is an internal, subdoxastic state of the observing agent we call a ?degree of belief? or ?degree of (rational) credence.? This may be surprising, as many linguists look upon the pushers of statistics with deep suspicion and hostility, and perceive them to be substituting numbers and data for common sense. However, statistics can be pursued as a branch of cognitive science, in which case there is no conflict between statistical inference and standard linguistic theory. Statistics refers to a family of techniques that make use of probability theory to do inference and make predictions about how things work, based on observations. Statistical inference requires that those observations be delimited very broadly into ?events,? but it does not require that the observations be quantitative, or that they ever be directly tabu- lated as ?relative frequencies.? Furthermore, one particular type of statistical inference, Bayesian inference, starts from the assumption that probabilities are in fact in no way grounded in relative frequencies in any finite or hypothetical infinite ?population.? The probability of something is, rather, a ?degree of belief? attributed to a hypothetical ratio- nal agent making predictions about future events, (which will happen to be well-matched by whatever relative frequencies the events turn out to have, under certain general condi- tions), or adjusting its confidence in particular hypothesized explanations for those events. Applied Bayesian statistics is used for scientific inference, and keeps the rational agent strictly hypothetical?some idealized agent who can tell us what to make of the data in 6 front of us. Rational models of cognition, on the other hand, make the assumption of a rational agent an empirical conjecture about how the human mind works. The result of taking this approach?whether it is in speech perception, vision, or learning in some particular domain?is inevitably a kind of minimalist/reductionist argument: ?Use the principles of Bayesian inference, and you will find that the behavior of humans on Cog- nitive Problem X follows from the assumption of rationality, and does not require the ad hoc mechanisms previously proposed.? What does it mean to be rational? Jaynes 2003, roughly following Cox 1946, lays out the following desiderata for a generalization of Aristotelian logic to the case of degrees of belief: (1) Degrees of belief are respresented by real numbers. (2) If some new information C arises and makes a proposition A more plausible, but knowing A^C has no effect on the plausibility of B, then A^B must never go down in degree of belief. (3) If some new information C arises and makes a proposition A more plausible, then it must also make :A less plausible. (4) If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result. From this starting point, Jaynes shows that the following basic axioms hold, where C 7 stands for ?credence?: C [A^BjC] =C [AjB;C]C [BjC](5) =C [BjA;C]C [AjC] C [A_BjC] =C [AjC]+C [BjC] C [A^BjC](6) N i=1 C [Ai] = 1; for mutually exclusive and exhaustive Ai(7) C [A] >0; for all A(8) These in fact constitute one set of axioms for the probability calculus; the alternates all give rise to the same system (although the third is usually extended to countable sets). From now on, I will replaceC with Pr, for ?probability.? The critical consequence of this is Bayes? Rule: Pr [AjB;C] = Pr [BjA;C]Pr [AjC] Pr [BjC] (9) Or, put in terms that are meaningful for learning, perception, and other kinds of inference: Pr [HypothesisjData] = Pr [DatajHypothesis]Pr [Hypothesis] Pr [Data] (10) a.k.a. Posterior probability= Likelihood Prior probability H Pr [Datajh]Pr [h] dh The power of Bayes? Rule may not be immediately apparent. The posterior proba- 8 bility of a particular hypothesis about the correct grammar or model or face or whatever is to be inferred from the input is the degree of belief in the proposition that it was that particular one (the one that would be above labelled A). This is the goal of the computa- tion. The prior probability is the organism?s inherent bias for one solution over another, stated as probabilities (degree of ?prior belief?). The likelihood is a gradient assessment of the degree to which the data is predicted or surprising given the particular hypothesis selected. The denominator is just a normalizing constant that forces the probabilities to sum up to one. (Sometimes we will leave it off and just write , ?proportional to,? when it does not matter.) The power is that organisms surely have some inherent relative prefer- ences, (even if these are totally contentless), and are assessing how consistent or surprising some information is all the time. The general principles of reasoning set out by Jaynes as ?desiderata? can now be shown to have as a consequence a definite rule about the rational way to update one?s beliefs in the face of new information, just on the basis of these two basic functions. The only demand we place on these two functions is that they follow the laws of probability, which is not hard to achieve (though not guaranteed). The Bayesian decision theory that comes out of this goes like this: pick a loss func- tion, L. This function tells me, if I know the ?correct? answer, how ?bad? my solution is. This is different from the prior, which just states my inherent preferences for one solution over another, irrespective of the problem being solved. Then, find a method for using any given data set to construct a solution?call this method q (x)?that minimizes the 9 expected loss, given the available information:1 (11) Q X L [q ; q (x)] f (q jx) dx d q One such decision rule is the one that simply says ?choose the option that has the highest posterior probability.? This option happens to minimize the expected loss if the loss is just wrong = 1, right = 0 (the 0-1 loss). Other loss functions license other strategies (for example, ?when the solution is complex, consisting of multiple components, pick the locally best component for each, not the global best?; or ?pick the expected value (the average solution) under the posterior?). I will ignore the choice of loss function and continue to just assume the MAP strategy, only because anything else would leave the loss or decision rule as a free parameter?this is what Bayesian cognitive scientists usually do, albeit with no particular justification. In fact, a ?full Bayesian? would not in fact ?decide on a grammar? at all (in this dissertation, the most frequent case will be ?decide on a phonetic category system?). Rather, the rational thing to do, taking into account all possible information, would be, on any given utterance, to infer what was said by averaging the recovered semantic representation over all possible grammars, (weighted, of course, by the posterior distribution given all the utterance previously heard) and indeed averaging over every possible analysis of the utterance within each particular grammar to obtain a single ?recovered message??and weighting each analysis by how likely each 1The word ?expected? just means ?averaged using some probabilities as weights, rather than by taking the usual arithmetic mean?; when working with continuous data or hypotheses we use integrals instead of sums and a function f called the probability density instead of probabilities?this will be touched on briefly in Chapter 2, but can be largely ignored?the point is that we are minimizing the loss averaged over all possible hypotheses and observations. 10 of the logically possible grammars predicts it to be. In fact, why pick a single ?intended message? at all? The ?fully Bayesian? result of perception is simply a distribution over possible phonological, syntactic, and semantic representations, as one need only ?take an action? when forced to (for example, when one has to generate an utterance). There need not be any competition, however, about what is the ?most Bayesian? way of doing things: some part of the brain state may be (isomorphic to) a distribution over states that gets updated following Bayes? Rule; while some other part of the brain state may not be subject to uncertainty, but may still be updated in a way that is sensitive to a posterior distribution (say, if the ?grammar in use? is the best grammar according to the posterior over ?candidate grammars?). The question is empirical. For the purposes of this dissertation I hew to the view that the learner does indeed select a single grammar, and that it is the MAP grammar. This is an assumption of convenience. Bayesian inference provides only a computational-level specification of the problem of learning (or perception, or selecting actions, etc), in the sense of Marr 1982. Bayes? Rule says what should be computed as the best solution to a given problem?but not anything about how. It stands to reason that this is all we could conclude: Bayes? Rule simply states an equality as a static fact. Criticisms have arisen of the extreme computational lengths one needs to go to to do even the most coarse and approximate inference using certain types of interesting Bayesian models (Stevens 2011, Berwick, Pietroski, Yankama & Chomsky 2011). These criticisms are misleading, because they place a demand on the Bayesian models that they were never intended to fulfil. The reductionist reasoning that motivates the use of Bayesian models is that human behavior/learning/inference is deducible from general principles of optimization?that is, just from the statement of the 11 problem (including the way that the learner represents its input from the outside world, the learner?s ?intake?: Gagliardi 2012) plus Bayes? Rule is sufficient to determine what learners will do (modulo the questions just raised). Now, it is true that, if it is obvious that the full optimization required is intractable, then we risk raising more questions than we answer. Nevertheless, to carry out Marr?s goal of explaining cognition, one must investigate both of the computational and the algorith- mic levels, and then go on to connect them. The top-down approach asks whether human behavior in, say, learning falls out as an ?optimal solution? to the problem of balancing a ?degree of faithfulness to the data? with a ?lack of markedness of the solution?; the bottom-up approach specifies an algorithm and asks whether it matches human behavior. However, such an algorithm will generally be consistent with some implied assessment of fit and relative biases. To the extent that, for example, perception can independently seen as implying the same fit-to-data function, and to the extent that the biases in learning reflect preferences that can be seen in non-learning tasks, we also need to explain why it is that the learning algorithm appears to optimally balance these two independently mo- tivated forces; and, if it does not, why it does not. The most interesting cases are the ones where the optimal solution appears to be incorrect with respect to human behavior (Pearl, Goldwater & Steyvers 2010, Gagliardi, Bennett, Lidz & Feldman 2012, Stevens, Trueswell, Yang & Gleitman 2013, Gagliardi & Lidz In press)?but we cannot show this without knowing what the optimal solution would be. Accusations of unfalsifiability have been levelled at the Bayesian program because it occupies the computational level (most recently by Jones & Love 2011). The best re- sponse to this is essentially what we just said: since prior distributions and decision rules 12 in cognitive applications constitute empirical hypotheses, they should ideally make some independent predictions. However, another (top-down) approach to constraining the prob- lem would be to develop a theory of ?natural? or ?automatic? prior distributions following in some way from the structure of the problem; while it is well known that there is no universal criterion for picking the unique ?natural? prior distribution for a given problem, (van Fraassen 1989), independent of how the problem is parameterized, I will make a sug- gestion towards some weak constraints of this kind in grammatical inference in Chapter 2. This is enough about Bayesian inference for now. As the reader has already been encouraged to imagine, it has the potential to be a powerful tool for linguistic theorizing, because the goal of linguistic theory is to explain how it is that human beings come to knowledge of language from primary linguistic data. Although linguists are very good at formulating hypotheses about what adult states are possible and impossible, we do not very often fill in the crucial blanks about the relation between the data and the final state (Viau & Lidz 2011, Pearl & Lidz 2009). Chapters 2?5 are intended as an illustration of how to make good on the tantalizing promise that Bayesian inference can step in to fill in these blanks in a principled way. 1.2 Phonology Human language is a complicated system. However it is that human beings actually turn out to work, we as researchers at least generally decompose the system into several different mappings. One mapping, the semantics, can be broadly construed as exchanging 13 information from the whole host of cognitive systems that furnish us with our ?thoughts? (vision, event tracking, theory of mind, and perhaps a system of ?general? concepts) with structured ?meaning? representations, with help of one kind or another from pragmatics, the system by which one reasons about the intentions of the interlocutor. The semantic representation is often thought to be exchanged with a representation of more abstract ?dependencies? by the syntax, a mapping with its own principles, whose representations are then exchangeable for either an abstract perceptual representation of an utterance with that particular meaning, and/or instructions for using the motor systems to generate one. This latter mapping is called phonology. Phonological information is organized into minimal units called morphemes; a scaf- folding for these elements, or for slots for these elements, is provided by the syntactic structure, which is often thought to be a tree. The morphological structure, that structure which holds within a narrower and still incompletely understood domain called a ?word,? which organizes morphemes but which itself sits in the syntactic scaffolding, is some- times thought to follow different principles from the syntax (and sometimes not, in which case syntax is said to deal in morphosyntactic structures). Morphemes associate the in- formation in the syntactic representation and the corresponding semantic representation with arbitrary ?chunks of pronunciation.? Whatever the ?meaning? information is that is associated with the concept dog is associated with the essential information key to recog- nizing or producing the word dog. Some morphemes are also thought to be associated with parts of the syntactic structure that are ?strictly grammatical,? and which bear only some indirect relation to the semantic representation, such the English passive marker, which is often thought to work by playing some formal tricks forcing noun phrases into one 14 position rather than another (associated with a pair of morphemes we write be+en, as in John was eaten by a lion). The morphemes, at any rate, along with their morphosyntactic scaffolding, are where our story begins. The phonological computation is usually divided into two parts, the phonological grammar and the phonetics?phonology interface. Morphemes, on the morphosyntactic end, are finite sequences of elements called segments. The other two ends could be thought of as sequences, too: under a simple view of phonetics, the perceptual representation is just a sequence giving the perceptual system a map with which to recognize each segment as it comes in: d-o-g, and so on; and, under such a simple view, the production systems are also furnished with finite sequences of motor instructions. The job of the phonetics?phonology interface is to exchange these perceptual or production representations for the types of sequences found in the morphemes themselves. The phonological grammar, on the other hand, does some manipulations which actually modify the content of the representations. In English, for example, the word alternation seems to be alternate+ion, or some- thing like this, in terms of its morphological decomposition. However, the pronunciation of alternate is with a final [t], but the corresponding sound in alternation is []. This is a general pattern: deletion, elation, emendation, and so on; it also seems to be related to the pattern [d]/[] found in corrode/corrosion, erode/erosion, and so on. This pattern is called an alternation if we want to focus on the generalization that sometimes [] appears appar- ently in place of [t] ([] ?alternates with? [t]), and it is said by way of explanation that the phonological grammar executes a rule which changes the [t] to [] in some particular en- vironment, or that the phonological grammar at least gives rise to a process by which [t] changes to [] in that environment. 15 Similarly, in Hungarian, we find that ?our windows? is ablak+unk, [blkunk] and ?our glucoses? is gluk?z+unk, [lukozunk], but ?our cauldrons? is ?st+?nk, [ytynk], and ?our chauffeurs? is sof?r+?nk, [of?rynk]. The alternation is between [u] and the front rounded vowel [y]. In fact, it is a much more general alternation, whereby (simplified slightly) suffix vowels mutate into other vowels that ?correspond? in some sense to the last vowel in the stem. (They need to match that vowel in the front vowel/back vowel dimension, but everything else about them stays fixed: [u] corresponds to [y] in the back vowels because they are both high; but here we see the alternation being triggered by the non-high vowels [], [o], and [?].) This is a process called Vowel Harmony. In Spanish, we see [b] alternating with the fricative [?], as in v?mos, [bamos], cabra, [ka?ra], as well as [d] alternating with the fricative [?], as in d?melo, [damelo], cada, [ka?a]. A simple version of the generalization for this rule (Spirantization) is that stops change to fricatives after vowels. Finally, in English, the Deaspiration alternation is virtually always the first rule that is given to linguistics students: top, [thp], stop, [stp]. The puff of air (aspiration) which is ordinarily pronounced after [t], [p], and [k] seems to disappear when [s] appears immediately following in the same syllable. At least some of these alternations are morphologically active?they actually show evidence, via their morphological composition, that one particular segment changed into another one. These are cases where we can also do wug tests (Berko 1958) and find that speakers really do change things on command: we make up a word like wug and we ask English speakers to inflect it in the plural form. We get [wz] for the plural of wug, but [wks] for the plural of wuck, following another general rule of English (Voicing Assimilation). Patterns like English Deaspiration are also often treated as active processes by way of 16 having a particular hypothesis about how they arise and what speakers know about them, but they do not show direct evidence of a ?change? like this (see Chapter 3 for discussion of some relevant speech perception data for cases like this, however). A still more important point to stress about these alternations is that they seem to show categoricity. This fact is fundamental to phonology. The idea is that, in memory, the phonetic details are not (all) stored with a segment; the phonemes in the lexicon pick out a subset of the set of all phonetically possible segments. We evidently need to make more fine-grained distinctions than this when we are specifying exactly what the motor systems should do, or what the perceptual systems should attend to (if we could not do this then we would not be able to explain all the minute differences in pronunciation that we see from one language to the next). This is why we say that the phonetics?phonology interface engages in a translation, to fill in this gradient detail. Apart from categoricity, the other non-trivial property of lexical representations is that they are segmented. This just means that the categories are defined over rather coarse chunks temporally (consonant or vowel-width chunks). The motivation for this is as simple as that: there are some tempo- ral chunks within which the phonetic information falls into equivalence classes; none of the narrower temporal distinctions seem to need to be made in order to state phonological processes. There is also some evidence for segments from speech errors (see Chapter 3) and resyllabification (see Chapter 4); these patterns not only show that the chunks have fairly substantial width, but also that the chunks are smaller than syllables. A third prop- erty of lexical representations is that they are decomposed into cross-classifying features, but I will save the discussion of this for Chapter 3. Processes for which the result is phonetically identical to or indistinguishable from 17 the pronunciation of some other phoneme, or from the output of some other process? are called neutralizing processes.2 (Strict) allophony is another type of pattern, which provides a weaker kind of evidence for categoricity: the results of the processes do not correspond phonetically to any other segment, but the results are at least internally consis- tent. In Spanish, the [?] and [?] sounds only appear as the result of Spirantization, (with the result that they only occur in one, very restricted set of contexts), but it seems that [?]?s are at least fairly phonetically similar to each other when they are pronounced, and do not just appear as pronunciations of totally random sounds, or track finite details of the the pro- nunciation of the preceding triggering vowel (at least this latter point is usually assumed, but the issue has not been thoroughly investigated). Cases where a segment only arises due to some process, but two different underlying segments can generate it, are generally called allophony too, because they generate segments which are (by hypothesis) not found in the lexicon. Another term for this is non-structure-preserving. These phenomena will be the subject of much discussion in this dissertation. Finally, a few details are in order about phonological theory. The classical deriva- tional theory of phonological grammar was first fully articulated in The Sound Pattern of English (SPE; Chomsky & Halle 1968). A sample computation for the (simplified) Hungarian Vowel Harmony pattern is shown here: 2Although in Chapter 5 I cite two examples of published experimental studies which find results like this?total phonetic neutralization by a phonological alternation, into the same equivalence class?and an- other in passing in Chapter 3, the results are easy enough to recreate using phonetic corpora, which are now widely available. The interested reader is invited to create graphs like the ones I provide for the ?incomplete? neutralization cases in Chapter 5, only for the English alternation alternation (try the Buckeye corpus: Pitt, Johnson, Hume, Kiesling & Raymond 2005) and Turkish Vowel Harmony (use the METU corpus: Salor, Pellom, Ciloglu & Demirekler 2007). The result will contrast sharply with the graphs in Chapter 5. 18 (12) lukoz+ynk V! [a back] , 2 6 6 4 V a back 3 7 7 5 C0 +C0? lukoz+unk lukoz+unk The top row shows the underlying lexical representation; the bottom row shows the output of the phonological grammar (only explicitly specified in the ?forward? direction). The notation in the left column shows the rule that executes Vowel Harmony: vowels change to match in the feature [back] with a preceding vowel when zero or more consonants, then a morpheme boundary, then zero or more consonants, intervene. The dash indicates the position of the target segment. Two crucial properties of this theory are (i) that the grammar can contain arbitrarily many rules, and they compose in some way that is also specified as part of the grammar (here only one is shown); (ii) the learner?s inference is over these individual rules. This contrasts sharply with Optimality Theory (OT; Prince & Smolensky 2004). A tableau representing an OT computation is shown here (analysis from Ringen & Vago 1998): 19 (13) =luk o z y nk= j j +b b Ident-IOst[bk] Align-R[bk] Ident-IO[bk] R [luk o zynk] j = +b * [luk o z y nk] j j +b b *! [luk o zynk] j = b *! * [luk o z y nk] j j +b +b *! * [luk o z y nk] j j b b *! * ** This theory works by generating every output (actually, every possible input?output map- ping) for a given input, then filtering these until only one remains. Only a few crucial candidates are shown in any given tableau. There is a universal set of constraints which assign violations (* marks) to candidates. These constraints are given a language-specific ranking (highest-to-lowest: read left-to-right). The computation descends down the list of constraints and only the best candidates?the ones with the fewest violations of the current constraint?are allowed to pass. The rest are excluded (an ! is marked after the first fatal *). Only the best survives (marked withR). This analysis also assumes au- tosegmental theory (which is a theory of how segments are represented and of the basic operations of grammar that is orthogonal to OT; Goldsmith 1976). In that theory, seg- ments can share features; these features are usually shown like this, dangling from their associated segments, to indicate that they are independent of each other in various ways (see Chapters 4?5 for brief discussion). The relevant constraints shown in this tableau are: Ident(ity)-I(nput)O(utput), in two flavors: one restricted to the st(em), and another 20 working across the board. This type of constraint examines the correspondence between segments in the underlying representation and their yields in the output. In this case, the value of the b(ac)k feature needs to be identical, and a violation is assigned for each instance where this is not true. The Align-R(ight edge) constraint, keyed to the [back] feature again, assigns a violation for each instance of a vowel feature intervening between the right edge of the rightmost [back] feature and the right edge of the word. The key features of OT are that (i) for the purposes of the grammatical computation, there is only one compositional ?step? (it is monostratal); the constraints only get to make reference to one single input representation (the original) and one single output represen- tation (the final one) throughout the process of computation; (ii) the basic elements of the grammar are not the actual steps taken to modify an input but the constraints on what mod- ifications are allowed?or, rather, a ranking of these constraints; and (iii) these constraints are ?output-oriented,? and never make reference to just the input (although input?output correspondence constraints are fine). Some points to watch for below are: (i) the grammar should actually have at least two steps, because some of what used to be in the phonolog- ical grammar will move to the phonetics?phonology interface (somewhat reminiscent of Stratal OT). This is found in Chapters 3?5. For (ii), it has never really been clear what exactly the ?representation of the grammar? (in this case, as filters versus instructions to make changes) is supposed to correspond to cognitively. If it computes the same function as some other type of grammar, what is the difference? In this case one could perhaps point to some difference between performing many operations to generate many candi- dates, and performing one operation to generate one candidate, but it is never clear how many of the intermediate data structures in a grammatical computation one is supposed to 21 take seriously, or how exactly they are supposed to stand in correspondence with exactly what brain-instantiated features of the computation. If there is no such ?data structure? interpretation, then all that remains is (ii): the grammar is stated in one way rather than another. I suggest that the only agreed-upon locus for this ?structure of a grammar? is in acquisition; in the Bayesian framework, this structure must somehow be reflected in the prior probability distribution on grammars. This is found in Chapter 2. Finally, for (iii), I propose that, in the phonetics?phonology interface component, grammatical statements are actually crucially input-oriented in a way that does not comport with either OT or SPE-type theories. It is also worth noting briefly that OT analyses generally take non-alternating cases of allophony not to actually be allophony, but rather just knowledge of a combinatorial restriction on segments?although this not an absolutely crucial part of the theory. They can afford to do this with some elegance, because the alternation and the statement of the static generalization do not need to be stated separately: they both arise out of the filtering process. This is fine in principle: different descriptive generalizations can be drawn out of the same finite data set, neither being more ?empirical? than the other, and they lead to different causal theories. However, the description of allophony, as I will argue in this dissertation, has always been far more contingent than just a simple debate over whether a pattern constitutes a ?static? or ?active? alternation. Implicitly, all the standard analyses assume a transcription has been provided, which neatly classifies things into allophones; but, as any beginning field worker or introductory phonetics student knows from their difficulties in coming up wiht a classification for phones, this is a tremendous and risky simplification. The peril of this classical approach for the learner is the subject of Chap- 22 ter 3, and the theory?that allophones are in fact a part of a phonetic module far more respectable than it is usually given credit for?carries through to the end. This concludes my brief overview of the basic material of this dissertation. 23 Chapter 2: Simplicity Mystery. You?re always surrounded by them. But if you tried to solve them all, you?d never get the machine fixed. ?Robert Pirsig, Zen and the Art of Motorcycle Maintenance 2.1 The poverty of the stimulus: what is to be done? This chapter develops the theory of language acquisition. To outline how arguments about language acquisition go, and to set the stage for a particular proposal about acqui- sition, let us begin with a family of informal arguments about acquisition called ?poverty of the stimulus? arguments. From Chomsky 1975: Suppose ? the child has learned to form such questions as [is the man tall?, is the book on the table?, etc.], corresponding to the associated declaratives [the man is tall, the book is on the table]. ? [T]he scientist might arrive at the following tentative hypothesis as to what the child is doing ?: Hypothesis 1: The child processes the declarative sentence from its first word (i.e., from ?left to right?), continuing until he reaches the first occurrence of the word ?is? (or others like it: ?may,? ?will,? 24 etc.); he then preposes this occurrence of ?is,? producing the cor- responding question ?. This hypothesis ? is false, as we learn from such examples as [the man who is tall is in the room?is the man who is tall in the room?, the man who is tall is in the room?*is the man who tall is in the room?]. ? Children make many mistakes in language learning, but never mistakes such as [*is the man who tall is in the room?]. ? The correct hypothesis is the following ?: Hypothesis 2: The child analyzes the declarative sentence into ab- stract phrases; he then locates the first occurrence of ?is? (etc.) that follows the first noun phrase; he then preposes this occurrence of ?is,? forming the corresponding question. ? Hypothesis 2 holds that the child is employing a ?structure-dependent rule,? a rule that involves analysis into words and phrases, and the property ?earliest? defined on sequences of words analyzed into abstract phrases. ? [T]he scientist must ask why it is that the child unerringly makes use of the structure-dependent rule postulated in hypothesis 2, rather than the simpler structure-independent rule of hypothesis 1. ? A person may go through a considerable part of his life without ever facing relevant evidence, but he will have no hesitation in using the structure-dependent rule, even if all of his ex- perience is consistent with hypothesis 1. The only reasonable conclusion is that UG contains the principle that all such rules must be structure-dependent. (31?32) 25 A more recent description of this particular case?cleaned up slightly to avoid the mis- leading procedural metaphors about grammatical knowledge?is to be found in Berwick, Pietroski, Yankama& Chomsky 2011: the interpretation of auxiliary verb-initial sentences like Can eagles that fly eat? obeys a ?structure-dependence? restriction. This restriction has the effect of ruling out ?is it the case that eagles that can fly eat?? as a possible interpre- tation, evidently because the initial auxiliary verb must always be interpreted as modifying the nearest verb which is at the same structural ?level,? (that is, Can [eagles that fly] eat), and never a closer verb with respect to the linear order of items in the string, but which is at a lower structural level (Can [eagles that fly] eat). The relevant principle might be stated in this case as ?interpret the fronted auxiliary as modifying the nearest verb at the same level of bracketing in the grammatical structure??call it Principle S (corresponding to Hypothesis 2). The pieces of the argument from learning go as follows: (14) Here are some logically possible hypotheses the learner could entertain (about how a linguistic system might work) (15) Here is the input available to an inference mechanism operating over these hypotheses (16) Conclude: If the actual learning outcome is one thing, but something else was logically possible, then evidently that logically possible hypothesis is not actually possible Much of the controversy surrounding these arguments?and the counterarguments?stems from the fact that it is rare for either to contain all the logically necessary details. Here are what the pieces should be: 26 (17) Delimit a full range of possible hypotheses (18) Say what the input is (19) Specify the behavior of an inference mechanism (20) Conclude: (assuming the specification of the input is correct) If the actual learning outcome differs from the prediction, the proposed set of hypotheses or the inference mechanism must be different from what was proposed More specifically, if we conclude that the proposed inference mechanism would be unable to decide between the actually correct hypothesis and some other, or would arrive at some incorrect hypothesis, then one obvious move is to propose that the set of hypotheses is actually restricted and does not include the incorrect hypothesis. Apart from that, this all seems quite clear, but it is also clear that the path through a debate about any of this is fraught with danger if we mistake the simplified argument in (14)?(16) for a fully worked out argument like (17)?(20). For example, one type of debate has looked at the facts about the input and the learning outcomes. Do children really not hear enough examples of sentences like Is the man who is in the room tall and Can eagles that fly eat to prefer Hypothesis 2 over Hypothesis 1? And do they really not show evidence of ever preferring Hypothesis 1? (Crain & Nakayama 1987; Sampson 2002; Legate & Yang 2002.) But all this empirical work is beside the point if it only addresses these two possible hypotheses. As Lasnik & Uriagereka 2002 point out, whatever mistakes learners turn out to make or never make, whatever type of evidence turns out to be absent, we will probably still be able to adduce vastly or infinitely many contradictory ways for them to characterize it which they seem to 27 consistently avoid. Consider some hypothetical alternatives to Principle S, adapted from Lasnik and Uriagereka: (21) Interpret a fronted auxiliary as modifying any verb (22) Interpret a fronted auxiliary as modifying the first verb that comes after a complete phrase-structural constituent Both of these new alternatives are consistent with all the data consistent with Chomsky?s Hypothesis 1, but also with a lot of data which would be be good enough to rule it out (like hearing and understanding Can eagles that fly eat?). However, both are still radically incorrect, as there is in fact never any ambiguity as to the verb being modified in these cases, and because there are still more complex sentences (such as Can eagles that fly and swallows that sing eat?) which would be incorrectly interpreted by the second principle.1 The leap from a single example of a wrong hypothesis to a larger set of vastly or in- finitely many logical possibilities is actually left implicit in many of the classic arguments about induction, not just this one from Chomsky: Goodman 1955 probes our intuitions as to whether a scientist or other rational observer would feel that a set of green emeralds, all observed before time t, could ever confirm the hypothesis that all emeralds are ?grue,? where ?grue? means ?green if examined before time t and blue otherwise?; we think not, of course, and he says that this just goes to show that ?lawlikeness? is a non-trivial concept in need of further investigation?and that such an understanding is furthermore crucial to understanding induction. Similarly, Quine 1960 concludes that correct translation is im- 1Note that the second principle makes crucial reference to phrase structure but is still wrong: although the principle playing the role of Principle S in versions of this argument is often given the convenient label of ?structure-dependence,? that label is by no means sufficient to characterize the principle?what is proposed is that the interpretation of sentences makes use of structure in a very particular way, namely, Principle S. 28 possible in many cases, after imagining a linguist probing a consultant for the meaning of the utterance ?gavagai? in the presence of a rabbit. The linguist could not decide on the basis of an affirmation by the consultant of the appropriateness of the utterance in the presence of any number of other rabbits whether it referred to a rabbit or a collection of undetached parts of a rabbit, or a collection of temporal slices of a rabbit (the last two both include actual rabbits as particular cases), and so on. Even if the linguist were to find the right words to ask, he would face similar problems with the meanings of those words, and so on. In each case the argument made to convince the reader of the existence of a problem of induction is presented by way of one or two particular examples, followed (implicitly or explicitly) with ?and so on.? It is fair to say that it is really the ?and so on? on which such informal arguments rest, for better or for worse: this discrepancy between (14) and (17) makes a huge difference. The absence of (19) in the simplified version is also quite important. But something needs to be said about this before we can say anything at all. The only allusion to a mechanism that connects hypotheses and data in the quote from Chomsky is the allusion to the relative ?simplicity? of Hypothesis 1: if there is some measure of simplicity of hypotheses that guides the learner, then we can use that to make predictions about what ought to be learned given a particular set of data and hypotheses. However, once some details of the inference mechanism are spelled out (more than this), we realize that that mechanism, and not only the set of alternative hypotheses, is also a possible candidate for explaining the discrepancy between the ?logically possible? and the ?actually learned.? Within linguistics, much of the reason that (14)?(16) has subbed in for (17)?(20) has been the understanding that, once the set of hypotheses was correctly specified, nothing 29 more needed to be said about the inference mechanism?that it was trivial. This idea was developed in the 1980s in generative grammar, largely following the publication of Chom- sky 1981. Pursuing a conception of language that implies that the set of ?core? syntactic grammars is actually finite, (what became known as the ?Principles and Parameters,? or ?P&P,? approach), Chomsky argued that ?it is quite possible? that one could restrict atten- tion to a proscribed subset, S, of bounded-length sentences, and then find, for each possible grammar, ?decision procedure ? that enables the n grammars to be differentiated in S? (11). A decision procedure in this context is simply a function that returns true or false for any given sentence in S, telling us whether the sentence is permitted under grammar i. The idea of ?differentiating the grammars in S? reduces, in this context, to the further conjec- ture that S exists such that no two grammars share exactly the same extension, restricted to S. Then, given a large enough subset of S, eventually we can identify the grammar. This is an argument against immediately ruling out as ?unlearnable? any grammar for whose language a decision procedure does not exist (e.g., a grammar generating all and only positive instances of the Halting Problem, the Post Correspondence Problem, etc). The idea is that the hardness of the decision problem associated with a particular grammar says nothing about how hard it would be to successfully learn grammars from some set which happens to contain that grammar. The fallacious conflation of these two different kinds of hardness is based on the mistaken idea that the inclusion in the hypothesis space of a grammar that is hard in the decision sense immediately implies that all members of the ?usual? class to which that grammar is considered to belong?in this case, the whole set of recursively enumerable functions, which is to say, any function, computable or not?is therefore included in the hypothesis space too. Obviously, if the problem space for the 30 learner was, ?pick some grammar, about which I have told you nothing except that it is some function you might be able to dream up,? the learner would be faced with a very hard problem! But the division of functions (grammars) into classes for the purposes of analyzing their computability or complexity, although illuminating, is not equal to the division we care about for learning, namely, which grammars is the learner able to learn, and which grammars is the learner not able to learn. That is the point, and this is a good argument for that point.2 However, from this point on in LGB, Chomsky seems to largely take for granted the point that, given that the set of grammars is finite, a practical and trivial learning proce- dure exists that can recover the grammar from data. The logic just outlined is not a good argument for that point. In fact, it is not an argument for that point at all. The procedure whose existence is conjectured (non-constructively) is a counterexample provided against one particular fallacious line of reasoning, as just outlined?nothing more. Subsequent years of research in P&P confirmed that, indeed, the problem of language acquisition is in no way trivial just because the set of grammars is finite (Dresher & Kaye 1990; Niyogi & Berwick 1996; Yang 2002; Pearl 2007). 2 Heinz (2013) reiterates this point quite clearly, but, in spite of this, the misapprehension persists in the field that the argument goes further, and makes irrelevant all and any use of classes of functions established in mathematics or computer science, including the Chomsky hierarchy, in arguments about the human language faculty. This is incorrect; many well-established facts with sweeping empirical implications for the theory of human language, such as the fact that phonological grammars are uniformly sub-regular while some syntactic grammars fall outside the context-free class (Kaplan & Kay 1994, Shieber 1985, Heinz & Idsardi 2011) would not even be stateable without relying on this body of research. As with anything in linguistics, a result will be interesting to the extent that it helps us narrow in on the question of how exactly knowledge of language is acquired, and what that constitutes, a key part of which is narrowing in on the range of possible grammars using whatever tools. Chomsky?s argument can be paraphrased as follows: understanding the distinction between recursive and recursively enumerable functions is not sufficient to solve the problem of language acquisition; the incorrect interpretation of the argument is, at its most pernicious, that analyzing grammars in terms of classes of functions well-established in mathematics is not necessary to solve the problem of language acquisition?an obvious delusion. 31 Thus there is no escaping the fact that the familiar ?poverty of the stimulus? argu- ment, (14)?(16), is not a substituted for a fully worked out argument of the form (17)?(20). What is to be done? Fill in the details. There are plenty of interesting things to be learned about the set of actually possible hypotheses, and about the inference mechanism. How- ever, in order to draw any conclusions about what they are, we must first state clearly what exactly the properties are that are at issue. An informal argument predicated on the very narrow property ?contains one very particular hypothesis X? might help convince us that neither the set of hypotheses nor the inference mechanism turn out to be trivial or obvious. But it would be misguided to keep pursuing that question as a way to convince a skeptic of whatever stripe, because that property is neither interesting nor easy to reach firm conclu- sions about without filling in virtually all the details about the inference mechanism. We need to state precisely what it is that we are really trying to prove or disprove about the set of actually possible hypotheses for the learner, and firmly nail down enough assumptions about the inference mechanism to allow us to reason about whether those properties hold or not; similarly, mutatis mutandis, for any property of the inference mechanism?it needs to be general but precisely stated, and we need to be able to fix some assumptions about the set of hypotheses, before we can draw any conclusions about it. In this chapter I discuss the inference mechanisms; I will say things about the set of hypotheses only in the interest of making a point about these inference mechanisms. The ones I will focus on in this chapter are any which comply with Bayesian inference; I will argue that such mechanisms are indeed a source of explanation. The goal is to establish a particular ?law of inference? which holds in a Bayesian learner, preferring intuitively ?simpler? hypotheses (the Bayesian Occam?s Razor). In particular, the goal will be to 32 tighten up previous accounts of this phenomenon to lay out, given a Bayesian inference mechanism, how to characterize the sets of possible hypotheses and data sets that will lead to such a preference. As Bayesian inference is a class of mechanisms with the prior distribution as a free parameter, I also give a condition on this prior that will guarantee that we get the law to hold (I introduce something called the Framework Consistency Principle that priors need to follow). I finally show that there is a way in which we can see the application of this principle for generating prior distributions as the application of a plausibly domain general ?optimal inference? principle to acquisition, and I spell out what this would mean for learning grammars for human languages. In the next section I give a brief introduction to Bayesian inference as a language acquisition mechanism, and review one recent and prominent debate in the literature in- volving Bayesian inference but which in my view fails to sufficiently engage with certain relevant issues crucially tied to Bayesian inference, namely, the Bayesian Occam?s Razor; the rest of the chapter is in a sense intended to fill in these important and interesting details in that argument. 2.2 Bayesian models of cognition: why should we care? Bayesian models of cognition have flourished in recent years. Bayesian models of cognition use probabilities to track relative ?preferences? for different cognitive states? in the case of language acquisition, for different possible grammars. They are specified in two parts: a likelihood function, a mapping from grammar?data pairs to probabilities, representing a ?fit? score assigned to the data by the grammar; and a prior distribution, 33 a mapping from grammars to probabilities, representing the a priori preferences of the learner. The posterior distribution is a mapping from grammar?data pairs to probabili- ties, representing the learner?s scoring of a particular grammar as a model of a given data set. The posterior distribution can be derived mechanically as a function of the prior dis- tribution and the likelihood function, which is a key benefit of Bayesian inference with broad-reaching practical implications for constructing complex models. Computationally intensive methods for stochastic search are often used to obtain a representative sample of relatively high-posterior grammars when the exact grammars of interest cannot be practi- cally specified for comparison in advance; however, with few exceptions, it is the poste- rior, and not these purely instrumental algorithms, which constitute the model (see Pearl, Goldwater & Steyvers 2010, Phillips & Pearl 2012 for attempts at constraining inference in psychologically meaningful ways). In abstracting away from the search procedure to focus on relative preferences, Bayesian models take what is in linguistics called the evalu- ation measure research strategy: do not specify how the learner works, but merely specify the ?learning relation,? from input internals ! learning outcomes, as it would apply under some idealized circumstances (Chomsky 1964, Chomsky 1965, Chomsky & Halle 1965, Chomsky & Halle 1968). The evaluation measure approach to linguistic theory traditionally applied a function from grammars (not grammar?data pairs) to values or costs which would, minimally, al- low the learner to select a unique high-valued grammar in case two grammars were equally consistent with the data. The received understanding was that this was the only case that the evaluation measure would need to handle, but this rested on the assumption that ?con- sistency with the data? was a binary distinction (fully consistent or fully inconsistent), an 34 obvious simplification (Chomsky & Halle 1968: an exception is LSLT, Chomsky 1975, which attempted the more ambitious goal). This simplification was in part a consequence of the common simplification of the notion of grammaticality to a two-way distinction: if an utterance is observed, and it is assumed for the purposes of learning that what is ob- served is necessarily grammatical, then the appropriate notion of ?consistent with the data? is obviously ?predicts that all of the observed data is grammatical,? and ?inconsistent,? conversely, ?predicts that at least some of the observed data is ungrammatical.? Making the assumption that all of the individual observations must be jointly consistent in order to have any degree of consistency with the data at all is also crucial to maintaining the binary notion of consistency. Thus the binary distinction of consistency is clearly a simplifying assumption. That it is a simplification suggests that we might want the evaluation measure to bear on the learner?s behaviour not only when the degree of consistency with a given data set is equal across two hypotheses, but also when it is different. The Bayesian approach to language acquisition asserts that the consistency function (which takes as input grammar?data pairs) and the evaluation measure (which takes as in- put simply grammars) are both probability measures, conventionally called the likelihood function (probabilities over data) and the prior measure respectively (probabilities over grammars); probability measures output real numbers between zero and one, where zero is a minimum and one is a maximum for both consistency with the data and value under the evaluation measure. Crucially, it also asserts that (almost) all that needs to be known about the learner?s final state given some data can be summarized by a probability measure as well, called the posterior measure, (giving probabilities over grammars) which can be 35 derived from the prior and likelihood as follows: (23) Posterior [G;X ] Likelihood [G;X ] Prior [G] In (23), G represents a grammar and X represents some observed data. Although the constraint imposed by (23) alone specifies a family of measures proportional to each other, the assumption of a unit measure (a measure that integrates to one) means that the posterior is unique (the scaling constant must always be the reciprocal of the integral of Likelihood [ ;X ] over the set of grammars G with respect to the measure Prior [ ]). If the posterior is thought of as the conditional probability of a grammar G given data X and the likelihood as the conditional probability of data given grammar, then (23) amounts to an application of Bayes? Rule (conditional probability will be discussed in more detail be- low). In the context of language acquisition, we may call the derived function the posterior evaluation measure, and, when the distinction is crucial, I will call what is traditionally referred to as simply the evaluation measure the prior evaluation measure. The idea is that the posterior evaluation measure is sufficient to guide the the choices of the learner. If we believe that the learner eventually stops learning and selects a single grammar, then this might be the highest posterior valued grammar; on the other hand, if we believe that the adult state is one where there is still some lingering uncertainty as to what the correct grammar for the ambient language is (as suggested, for example, by the approach of Yang 2002), then the full posterior evaluation measure is needed to characterize it.3 3The wrinkle responsible for the ?almost? above is that, in the broader Bayesian setting, posterior prob- ability measures are usually combined with additional apparatus (a loss function and a decision rule) which 36 Much recent research in language acquisition has made use of the assumption that learning outcomes can be understood as falling out from Bayesian reasoning about gram- mars. This research strategy is useful because it reduces the burden on the next researcher to come along?the one who needs to specify the actual algorithm and/or implementation by which the learner actual comes to the right grammar?reducing it to the problems of computing the posterior distribution from the bias and goodness-of-fit functions (basically trivial using Bayes? Rule) and optimizing over that function?a massive gain, as evidenced by the diverse range of optimization technology and criteria that have been thrown at the language acquisition problem (Dresher & Kaye 1990, Clark & Roberts 1993, Niyogi & Berwick 1996, Yang 2002). Even without doing any search, or attempting to examine the full posterior, it can be very illuminating to look at the shape of the posterior distribution: if the grammar(s) that matches human behavior is worse, according to the posterior dis- tribution, than ones that do not, then, either the theory of grammar (i.e. the combination of bias/prior and grammaticality/likelihood functions), or else the Minimalist assumption that the learner selects the ?best? grammar in the rational sense (that is, following the evaluation measure that follows from its bias and grammaticality functions), is incorrect. Apart from being an interesting empirical conjecture, this assumption has several practical benefits. Two of these were raised directly in the exchange between Perfors, Tenenbaum & Regier 2011 (henceforth PTR) and Berwick, Pietroski, Yankama & Chom- sky 2011 (henceforth BPYC). First, Bayesian statistical approaches do not burden the complete the characterization of the behavior of the actor, (in this case the learner), of which the posterior measure represents only a part (the internal belief state). Like most other cognitive scientists, I make the assumption that we can just think of some trivial apparatus here, with the understood caveat that working out the full decision-making apparatus is not optional in the end if we believe a single grammar is ?selected.? See Chapter 1. 37 researcher with irrelevant representational choices in the way that, for example, connec- tionist models do: neural network models demand real-vector valued inputs, outputs, and internal representations, which requires that researchers make substantive choices in or- der to convert back and forth between these representations and symbolic representations that are really already at the limits of our detailed knowledge about the mental encoding of language. This conversion can sometimes be awkward (as in the case of Rumelhart & McClelland 1986?s famous Wickelphone representation) and can sometimes represent a major research undertaking in and of itself (for example, the representation of hierarchical structure in neural network models still has no agreed-upon general solution). In contrast, the Bayesian approach allows the researcher to simply program whatever data structures seem appropriate as representations, because it is a method for associating numbers (pos- terior values) with these structures and nothing more. In this way, researchers can avoid unintended consequences of what are usually arbitrary choices. Second, Bayesian statisti- cal approaches to inference obey various general and interesting laws, which the Bayesian approach to cognition inherits as laws of reasoning. For PTR, both the easing of the burden of representational choice, and the fact that certain laws of inference hold, are factors that make the Bayesian approach useful in study- ing language acquisition. They present a Bayesian evaluation measure for syntactic gram- matical knowledge which they apply to real (but simplified) child-directed utterances. On the first issue, they write, in their conclusion, that: [W]e have offered a positive and plausible ?in principle? alternative to the negative ?in principle? poverty-of-stimulus arguments for innate knowledge 38 of hierarchical phrase structure in syntax. ? By working with sophisticated statistical inference mechanisms that can operate over structured represen- tations of knowledge such as generative grammars, ? we can more rigor- ously explore a relatively uncharted region of the theoretical landscape: the possibility that genuinely structured knowledge can be genuinely learned, as opposed to the classic positions of nativism (structured but unlearned knowl- edge) or empiricism (learned but unstructured knowledge, where apparent structure is merely implicit or emergent). (331?332) In other words, although learning arguments yield conclusions about inductive biases, the issue has been dealt with as if it were also about what types of representations the mind uses?the question of structured representations, and related questions?even when those inductive biases could be cashed out either way. This conflation is at least in part an artefact of the types of acquisition models used previously, so, using Bayesian methods, one can now more easily refute a claim about the necessity of some strong inductive bias without smuggling in irrelevant representational claims. Such an example makes it clearer that learned/not-learned is a question that is orthogonal to the question of structured/not structured. On the issue of laws of inference, they allude to one particular law of inference that comes out in many Bayesian systems, writing that: These results emerge because an ideal learner must trade off simplicity and goodness-of-fit in evaluating hypotheses. The notion that inductive learning should be constrained by a preference for simplicity is widely shared among 39 scientists, philosophers of science, and linguists. ? The tradeoff between simplicity and goodness-of-fit can be understood in domain-general terms. (313) They are referring to the Bayesian Occam?s Razor. The idea that there is something ?domain-general? about the prior evaluation measure PTR present comes up repeatedly in their paper, and, although it is clear that the issue of domain-specificity is supposed to be somehow tied up in the learning argument being made, it is never clear precisely how. As discussed above, there are two basic approaches to explaining facts about ac- quisition: adjust the assumptions about the hypotheses available to the learner, and adjust the assumptions about the inference mechanism. The Bayesian Occam?s Razor is a fact about the inference mechanism (it says that a large class of prior measures are forced into being biased toward simpler hypotheses by virtue of the basic facts of probability theory being embedded in the inference mechanism). On the other hand, a choice of possible hypotheses is implicit in any choice of prior (or indeed in any discussion of learning). One or both of these things is a plausible candidate for which part of their model is meant to be ?domain-general.? As I argue below, although PTR do not say (or believe) it, the use of Bayes Rule and, more interestingly, the Bayesian Occam?s Razor embedded in the prior, are really the only plausible candidates for something ?domain-general? about their acquisition model. Before proceeding, however, it is important to stress what the limits of this type of research are and in particular how it relates to ?innateness.? The conclusions above, in (16) and (20), are about something inside the human mind and that is the only sense in 40 which they can be said to tell us anything about ?innateness.? In particular, for some au- thors (including, but by no means limited to, PTR) the term ?innate,? as applied to (either) a set of possible hypotheses or an inference mechanism, means or is at least perfectly cor- related with the property ?domain-specific.? This sense of ?innate? (mechanisms specific to language) goes far beyond the property ?biologically fixed? and even farther beyond the property ?mind-internal,? two reasonable alternate meanings for this massively over- loaded term. But the conclusion that the set of possible internal systems that are possible outcomes of learning (the ?actually possible hypotheses?) is restricted in some particular way cannot by itself tell us anything about whether that restriction is an idiosyncracy of language or vision or whatever system the learning is taking place in; similarly for conclu- sions about the inference mechanism (which, as pointed out by Pearl & Lidz (2009), must be assessed separately for its domain-generality or domain-specificity; similarly for the encoding in which the input comes in, which is distinct from the ?grammar? in the narrow sense). To reach such further conclusions, one would need to examine different cognitive systems and compare them, and then hypothesize a relation between the two systems that says in what way they are related. Arguments of the form (17)?(20) do not do this, and more generally neither does any argument which simply starts with some assumptions and assesses whether they support acquisition. We will need to keep this in mind as we review PTR?s argument and BPYC?s reply in the next section: although PTR say they are doing something having to do with assessing the domain-specificity of some representational capacity in syntax, this is wrong; they are not. 41 2.3 The syntactic acquisition model of Perfors et al. PTR present a prior for simple syntactic grammars and compute the posterior for some simplified corpus data (strings of part-of-speech tags representing utterances from CHILDES: MacWhinney 2000). The prior is over context-free grammars generating strings of these tags, and the question is whether the learner will prefer context-free gram- mars which are in or out of the set of right-regular (strictly right-branching) grammars? actually is a bit more complicated than this, but we will leave it at this for the moment. This is interesting, according to PTR, because the properly context-free grammars have true ?hierarchical structure,? whereas the right-regular grammars, the subset of the context-free grammars which are strictly right-branching, are not much more than glorified lists (clarification for the syntactician: the c-command relation is equal to the precedence relation, except for the bottom node; clarification for the computer programmer: working with Lisp should make it clear in what sense linked-lists are ?right-branching structures?; obligatory note for the computer scientist rusty on formal language theory: right-regular grammars generate regular sets, hence the name). They show that the induced posterior measure does indeed favor context-free grammars over right regular grammars. PTR claim that this research is related to the classic example poverty of the stimu- lus argument given above, having to do with the structure-dependence of syntactic rules. BPYC question this claim. In this section, I will review BPYC?s argument for why this is incorrect, and elaborate on their discussion of why flaws in PTR?s logic mean that the pa- per is actually irrelevant to any interesting questions about how restrictive or unrestricted the set of possible syntactic grammars could be, and still support learning. I will also go 42 over again why, even if they did have something to say about this, it would be irrelevant to questions about the domain-specificity of syntactic mechanisms. Finally, I will point to the fact that there actually is something extremely important and interesting about PTR?s paper, but it is buried; this turns out to be the Bayesian Occam?s Razor, which is the topic of this chapter. Let us begin with strictly right-branching grammars. As BPYC point out, PTR?s ter- minology is misleading: the regular grammars they use to stand in for ?non-hierarchical? structure in fact yield hierarchical structures which happen to be strictly right-branching. They are not the same, intensionally, as grammars which generate strings using flat struc- tures, although there is a conversion between the structures yielded. The issue goes beyond terminology, and it goes beyond potentially subtle issues of internal mental representation. The way that almost any linguistic theory of phrase structure works, one is always forced to let in strictly right-branching structures as possible analyses for at least some sentences (it is difficult to find a modern syntax paper without a strictly right-branching analysis for at least one sentence). More importantly, the way that almost any theory of syntax using phrase structure works, one is always forced to let in grammars that could only ever gener- ate right-branching structures! This is very problematic for PTR. The implication in their paper is that ?hierarchical??i.e., for them, not strictly right-branching?grammars are ?really? language-like, and ?non-hierarchical??i.e., strictly right-branching?grammars are not. This, for them, is evidently why it is important to ask whether it can be learned that the best grammar for a given corpus is ?hierarchical.? But, in fact, strictly right-branching grammars are perfectly language-like, in that they are predicted to be possible in most syntactic theories. 43 Of course, it is difficult to imagine how a learner could assign a strictly right- branching structure to sentences like The man likes John, where the man is obviously a complex constituent, a noun phrase?let alone prefer a grammar that gives a strictly right-branching analysis to every possible sentence. The fact that it is difficult to imagine should suggest why it is not particularly surprising that the learner is able to rule these analyses out, but it is important to see that such a ?defective? analysis is possible, be- cause this is exactly the sort of analysis of the corpus that PTR?s learner comes to rule out (whether that is as significant as they claim or not). Some informal reasoning about how acquisition might take place will make clear what this kind of strange acquired grammar would look like. Intuitively, one might think that a learner hearing only the sentences The man likes John and John likes the man would be compelled, minimally, to at least the following conclusions: since John and the man both occur after likes, but John likes the never occurs, the constraint on what follows likes must treat John and the man as of a kind and say that this class of things (of ?constituents?) can follow likes, as opposed to asserting that the can follow likes and man can follow the independently; the learner has independent evidence about the fact that the man is of a kind with John for the purposes of word order restrictions because Man likes John never occurs, and the reasoning follows in a parallel way; so the only method the learner has to analyze the man is using some (learned) principle of analysis into noun-phrases. To see the (informal) reasoning more clearly, label noun phrases J (for ?John-type?). Given the sort of syntactic representation we are talking about, this means the learner comes to believe in something isomorphic to the simple J! the manjJohn, S! J L, L! likes J, which is not strictly right-branching. This is our intuition about what should happen, an 44 analysis so obvious that it is hard to see that it is not the only one. Actually, there are a lot of assumptions in this informal supposition which do not follow from any formal constraints on grammars. For example, it is not logically necessary that the learner analyze all instances of John or the man in the same way; the learner might instead adopt the right-branching grammar J ! the manjJohn, S! John Ljthe M, M! man L, L! likes J. John likes the man is then analyzed as S[John L[likes J[the man]]], and The man likes John as S[the M[man L[likes J[John]]]. Such an analysis is possible even under the ?merge? theory presented by BPYC, which is a basic version of what has come to be called ?minimalist? syntactic theory (Chomsky 1995), even though there are substantial restrictions on phrase structure implied in that theory. For example, the theory asserts that syntactic constituents give a privileged status to one of their elements (the ?head?), and that there is a label assigned to the whole constituent which is uniquely determined by the head. All combinatorial restrictions are then stated in terms of these constituent labels. In terms of the sorts of rules we have out- lined here, this implies that if there is a rule L! likes J and a rule ?! likes, then ? must either be L, or else there must be two accidentally homophonous lexical items likes. But absent constraints from the semantics, it is easy to make up a right-branching grammar that generates our little corpus that is subject to these restrictions: S! the MjJohnjJohn L, M! man L, L! likes S (that is, ?man? has label M, ?likes? has label L and ?the? and ?John? have label S). Although this seems implausible on the face of it, and of course gen- erates even more unattested sentences such as John and The man likes John likes John, the point is that this analysis cannot be ruled out by formal constraints under this the- ory, or many others like it; the explanation for why must have to do with the inference 45 mechanism, then (perhaps something to do with the fact that the grammar overgenerates). If we introduce enough accidental homophony, we can even do better by the over- generation problem: if the John in The man likes John is different from the John in John likes the man, then we are free to assign them different labels. With such a powerful tool as homophony in hand, it is easy to see how the learner could come up with a (perverse) strictly right branching grammar for just about any corpus?and with homophony, as with the kinds of funny (mis-)labelling required to get the previous grammar to work out, there cannot be an outright ban, so if a grammar like this is not the grammar learners arrive at, then there must be some reason other than a ban, a soft preference?hidden somewhere in what we call the posterior evaluation measure. In sum, right-regular grammars are not impossible or non-syntax-like, and no one ever said they were; the idea that it is somehow contrary to standard linguistic theory to posit an acquisition device that can learn to use ?hierarchical? grammars, choosing them over these supposedly ?unstructured? analyses, is incorrect. Such an acquisition device is, rather, taken for granted. Before explaining why this also has nothing to do with the Principle S issue dis- cussed above, we should clean up the characterization of the PTR prior, which has been somewhat unfaithful. What PTR claim to be doing is not merely learning a grammar for a corpus. In their words, ?the interesting claim ? is not about the rule [or grammar] for producing interrogatives (G) per se; rather, it concerns some more abstract knowledge T . ?T is the knowledge that linguistic rules are defined over hierarchical phrase structures. This knowledge constrains the specific rules of grammar that children may posit and there- fore licenses the inference to G? (311, underlining mine). This is actually a very tricky idea, and it may become clear to the reader only after the discussion of model evaluation 46 below. The way it works is like this: the prior is actually not just over grammars?this is the slight inaccuracy alluded to above. Actually, what is learned is a grammar?indicator pair, hG; w i: G is the grammar, and w is a single bit, whose values I will for convenience map to fr;cg. The role of w is that, if w = r, then G is restricted to being right-regular; when w = c, it is not restricted in this way, it is only restricted to being context-free. Equiv- alently, the acquiendum hG; w i, whereG is not right-regular, is impossible if w = r; ifG is right-regular, then w may be either r or c. The inference about w , then is a ?higher-order? inference, in the sense that what we (the learner) take its value to be actually determines how we do inference about G. We will see in the discussion of model evaluation below that such inferences about what kind of solution we are allowed to infer are useful in prac- tical settings, such as selecting which predictor variables should or should not be included in the analysis of some experimental data we might collect. We will also see, throughout the chapter, that these sorts of nested, or hierarchical, inference problems permeate the problem of language acquisition. To take just one example, consider the problem of learning words: when listening to an utterance, the learner may believe that utterance consists entirely of known words, or may decide that it actually contains a new, unknown word. If the utterance seems to be composed entirely of known words, then the learner may (or may not) take the opportunity to adjust how it thinks the words in its lexicon are pronounced, or what they mean, etc., but the range of in-principle-possible lexicons is just the same as it was before; but if the learner believes that a new word has been uttered, then a whole new ranger of possibilities are opened up, most notably the possibility that the contents of the lexicon?the pronunci- ations, semantics, and grammatical information?are just exactly what the learner already 47 thought they were, except that there is a new lexical item, corresponding to a new tuple of pronunciation/semantic/grammatical information. Thus, conditional on some inference that the learner makes, the learner?s range of possible ?solutions? to the problem ?what are the contents of the lexicon? changes. Going back to the PTR prior, we might charac- terize what is being learned as simultaneously learning about what grammar is ambient and about what grammar is. Hierarchical inferences are always difficult to interpret.4 For a moment, the idea of ?learning what grammar is? might seem to be meaningful or even profound; but, upon reflection, it is actually quite difficult to understand. Put aside the idea that ?discovering? the class of grammars, rather than merely discovering the grammar, is somehow a more ?domain-general? inference: we already dispensed above with the notion that learning arguments of this kind can say anything about domain-specificity. To see what kind of question PTR are asking, consider a more general case. Sup- pose that w actually ranges over the whole non-enumerable set of subsets of the set of all rewrite grammars of every kind, so that one value of w picks out the right-regular grammars, another picks out the context-free grammars, another picks out the context- sensitive grammars, another picks out the (Turing-complete) set of all unrestricted rewrite grammars (Type 0 on the Chomsky hierarchy), another picks out the strange assortment of grammars that would be obtained by constructing some particular mapping between real numbers and grammars and then picking only those grammars paired with prime 4This is why the discussion of simplicity in Bayesian inference continues to be at an impasse in the philosophical literature: although the exegesis would take me too far afield, Forster & Sober 1994, Dowe, Gardner & Oppy 2007, and Henderson, Goodman, Tenenbaum &Woodward 2010 are at cross-purposes and fail to engage with some of the crucial conceptual issues, which ultimately have to do with the hierarchical Bayesian inferences that give rise to the Bayesian Occam?s Razor I discuss in this chapter. 48 numbers?the reader gets the idea. Now contrast this with a prior that lacks w , and only needs to pick out some grammar G from the set of all (unrestricted) rewrite systems. The range of possible grammars (i.e., functions) that the learner can acquire is exactly the same. The potential ?models? of any observed data are the same, although one might say that the ?explanation? can vary under the pair?learner, in a sense that is somewhat difficult to make more precise. Short of spelling out this rather thorny intuition, what can we say is the motivation for the higher-level inference in this case? For example, imagine that the usual arguments from linguistic theory about restric- tions on grammars (or, perhaps more neutrally, ?internal language perception/production models?) are incorrect: there are no hard restrictions on possible grammars; rather, any logically possible mapping is possible. This is what many ?emergent? explanations of language amount to, or aspire to amount to?a claim there is no ?innate knowledge? con- straining the hypothesis space (for language, although again, domain-specificity is actu- ally orthogonal). Put the intension of the mapping aside?the claim is merely that the net result of learning ?could be anything.? What does this have to do with whether the learner simultaneously infers a restriction on the class of mappings it is selecting from? We might expect this to be the case if the learner were trying to determine whether what it was hearing was, say, an animal call ( w = a) or human language ( w = l), and then had some differing biases about what the concomitant structure would be in either case. Or, supposing one were presented with data from many languages, asked to learn gram- mars for all of them and then to form some generalization about those grammars, then there would be an obvious motivation for adding a variable corresponding to ?class of grammars??although, as this is not the situation the language learner is put in, we would 49 surely interpret the higher-level inference as one that a scientist, and not an actual lan- guage learner, would do. But, in any case, without some empirical or conceptual reason for assigning grammars/analyses to the various different classes corresponding to differ- ent value of w , the idea that grammatical inference has this additional structure is merely an interesting conjecture, not motivated by anything. More importantly, it is no more or less like any set of standard assumptions in linguistics or anywhere else to assume, or not assume, that the set of right-regular grammars is specially delimited and elevated to some special status distinguishing it from the set of more general context-free grammars. As we will see, the real difference between a learner with and without the additional w inference is that the learner with the additional inference believes that right-regular grammars are simpler in a sense that the learner without it does not. It is just not clear what the viability of such a learner has to do with any pertinent issues in cognitive science. It remains to explain why PTR are wrong that inference under their prior bears on the Principle S issue discussed above?and they say that it does. Citing Chomsky on the example of subject?aux inversion, they write that ?only by defining syntactic rules ? over hierarchical phrase structure representations is a child likely to be able understand that [Is the boy who is smiling happy?] expresses a certain complex thought while [ Is the boy who smiling is happy?] expresses no well-formed thought. Hence our focus here is on the more basic question of how a learner can come to know that language should be represented in terms of hierarchical phrase structure? (310). I have just argued that, (a), following BPYC, ?hierarchical or not? actually is not well-aligned with the distinc- tion (right-regular or not) that PTR try to map it to, (b), again following BPYC, given that the set of possible grammars that the PTR prior allows is roughly that which would be 50 posited under any linguistic theory, the ?coming to know that language should be repre- sented in terms of hierarchical phrase structure? reduces to the rather more underwhelming prospect of ?learning a grammar,? and (c), adding an additional inference about the class grammars are drawn from in a situation where there is only a single grammar at issue is barely different from ?learning a grammar,? except that it adds a subtle claim about the ?structure? of the prior?an idea which we will see come up throughout the chapter, but which has no clear cognitive interpretation in this case. Even given these points, how- ever, there might be some interest if the questions being addressed bore indirectly on a question, structure-dependence of so-called A-dependencies, which has been the subject of a reasonable amount of previous research (Crain & Nakayama 1987, Legate & Yang 2002). However, although PTR cite such an argument as the main motivation for their study?correctly noting that the availability of a tree-like representation is a prerequi- site for any grammatical computations depending on that representation?the mere avail- ability of such representations says nothing about how they will be used in computation. ?Structure-dependent? is shorthand for ?structure-dependent in a very particular sense?? namely, Principle S?a sense which does not in any way follow from the existence of hierarchical structure. Finally, I return to the question about domain-specificity. PTR write that ?[BPYC] have argued that much recent work ? misses the original intention of the argument. ? Our goal here is to explore what we see as a basic issue at the heart of language acquisition?the origins of hierarchical phrase structure in syntactic representation,? with the question about ?origins? evidently being over whether hierarchical phrase structure ?comes from? language or not: ?the argument about innateness is primarily about the role 51 of domain-specificity,? (307). Innateness has nothing to do with domain-specificity. This is very important. One could indeed imagine the PTR study coming out differently, with the learner always favoring right-regular grammars, even when they were inappropriate, thus suggesting an ?innate? (inborn, biological, natural) bias in favor of them; or, one might imagine that a learner with the higher-order inference over grammar classes is es- pecially drawn to the conclusion that the grammar should be drawn from the narrower class, the right-regular grammars (we will see below why one might have expected such a thing to happen in this model). Given that these are evidently not the grammars peo- ple actually come to, this would suggest a model of the mind in which there is no such higher-order inference. In either case, there is something to be concluded about the ?in- nate? structure of the mind. There is nothing, however, to be concluded about whether that particular innate structure is special to language or is shared when humans learn in some other domain. One way to ask that question would be to apply the same model, or a model with comparable structure, to data from another domain. It is unfortunate that the word ?innate,? whose familiar meaning seems easy enough to apply to the relevant issues in cognitive science, has come to mean ?innate and domain-specific? when applied by cognitive scientists, an obviously detrimental conflation. There is no way to draw any conclusions here about the domain-specificity of the set of hypotheses. Nevertheless, I will argue that there is something plausibly domain-general about the prior, in particular, the automatic Bayesian Occam?s Razor it conceals. Although it would require further research to establish parallels in inference across domains, this is at least a plausible candidate for a domain-general mechanism: it falls out under the combination of Bayesian inference (which has nothing to do with language per se) and a way of asso- 52 ciating hypotheses (grammars) with prior distributions that could be thought of as being ?minimalist? in Chomsky?s (1995) sense of ?following from the structure of the problem in some optimal way.? It would not be surprising, therefore, if under close scrutiny this law of inference were found in other domains and were found to hold for the same reasons. We begin with the necessary background about evaluation measures and about Bayesian statistical theory. 2.4 Evaluation measures: restrictiveness and simplicity As discussed above, an evaluation measure for grammars is a function mapping from grammars to some comparable values, intended as a description of the learner?s relative preferences. Evaluation measures were a subject of interest relatively early in the history of generative grammar. As there will be multiple grammars equally consistent with a given data set even under a restrictive theory, the evaluation measure is motivated as an additional device to constrain the behavior of the learner.5 For example, Chomsky & Halle 1965 presented the following two candidate de- 5Evaluation measures have sometimes also been referred to as ?evaluation procedures? (in certain am- biguous passages in Chomsky 1965 ?evaluation measure? and ?evaluation procedure? seem to be conflated) or ?evaluation metrics? (Bach & Harms 1972). I avoid the former term since it is extremely misleading, the crucial property of evaluation measures being that they imply nothing about any actual procedure; I avoid the latter term, although it is the dominant one, to avoid the misleading implication that evaluation measures are, or need to somehow induce, metrics in the mathematical sense (that is, distance functions); the term ?evaluation measure? suggests that evaluation measures need to obey the axioms of a measure, which may not be true in general, but which is acceptable in the current context, as the evaluation measures developed here are indeed measures in the mathematical sense. 53 scriptions of a set of phonological facts in Western Mono (taken from Lamb 1964): w ! kw h? kw ! qw V1h?V2 (24) w ! 8 >>< >>: qw V1h?V2 kw h? 9 >>= >>; kw ! qw V1h?V2 (25) In (24) and (25), V1 and V2 are two different classes of vowels. When flanked by these vowels, the underlying sequences /hw/ and /hkw/ are both realized as [hqw]; otherwise, they are both realized as [hkw]. If the grammars, crucially the one in (24), are understood as rule systems ordered as given, then both grammars characterize this pattern. In (25), the grammar has been modified to avoid making crucial use of rule ordering. However, presumably, the learner must decide on one representation, and yet both must be available. The evaluation measure proposed by Chomsky & Halle 1968 is that ?the ?value? of a sequence of rules is the reciprocal of the number of symbols in the minimal schema that expands to this sequence? (334). This would lead the learner to grammar (24) in the face of two equally consistent grammars. In general, the evaluation measure was supposed to trade off against a graded evaluation of fit, but specifying the fit quantitatively was ignored in practice.6 As discussed above, this evaluation measure corresponds to a prior, and not 6In Chomsky and Halle?s words: ?We will not concern ourselves here with the nontrivial problem of what it means to say that ? a proposed grammar? is compatible with the data ? . In other words, we make the simplifying and counter-to-fact assumption that all of the primary linguistic data must be accounted for by the grammar and that all must be accepted as ?correct?; we do not here consider the question of deviation 54 a posterior, evaluation measure in a Bayesian setting. This evaluation measure is intuitively guided by ?simplicity?: grammars which are ?simpler? in terms of the number of theoretical objects they contain are more strongly preferred. Apart from simplicity, the other main intuition guiding evaluation measures has been restrictiveness. To take another phonological example, Hale & Reiss 2008 con- sider two proposed versions of a rule from Georgian, which has the standard five vowel inventory fi;e;o;u;g, but no low front vowel [?]: l ! l ? 2 6 6 6 6 6 6 6 6 6 6 4 + ATR back low round 3 7 7 7 7 7 7 7 7 7 7 5 (26) l ! l ? 2 6 6 6 6 6 6 4 + ATR back round 3 7 7 7 7 7 7 5 (27) In Georgian, underlying /l/ is realized as [l] before [i] and [e]. Unlike the Western Mono case discussed above, these two grammars are not extensionally equivalent; (27) predicts that the rule should apply to all [ back] vowels, while (26) predicts that it will apply to only [ back; low] vowels. Thus (27) maps /l?/ to [l?], while (26) maps /l?/ to [l?]. However, given that Georgian has only one [+low] vowel, (and assuming that it is treated as [+back]), both grammars will be equally consistent with the evidence. Contrary from grammaticalness, in its many diverse aspects,? (Chomsky & Halle 1968, 331). 55 to the symbol-counting evaluation measure, which would value (27) more highly, Hale & Reiss 2008 propose what amounts to the opposite: ?The correct statement of a rule ? is the most highly specified representation that subsumes all positive instances of the rule, and subsumes no negative instances of the rule? (103). This principle (which would form the basis for a posterior evaluation measure, in our sense) makes reference to only two possible values, correct and, implicitly, incorrect; it apparently assigns correct if no speci- fication could be added to the rule without making it inconsistent with the data, otherwise incorrect. Although it might appear to be stated, like Chomsky and Halle?s evaluation measure, in terms of notation, Hale and Reiss make it clear that the motivation is not no- tational complexity per se, but, rather, restrictiveness: ?The more specific, that is, more restrictive, rule is the one provided by the [language acquisition device]? (104). Here, ?more specific? refers to the amount of information in the statement of the environment, where each piece of information is treated as a constraint, or restriction; a more specific statement will necessarily be more restrictive. In this case, the notion of restrictiveness aligns with increased notational complexity. Simplicity and restrictiveness together form the basis for most of the general prin- ciples of language acquisition proposed in the literature. They are sometimes at odds, as in the Georgian example (but not in the Western Mono example). In spite of this, they can be combined, as indeed Hale and Reiss seem to do implicitly when they assume that learners form a single rule handling both /le/ and /li/, rather than formulating the gener- alization as two rules rather than one. (This ?minimal generalization? proposal can be found throughout the literature on phonological acquisition: Pinker & Prince 1988, Yip & Sussman 1997, Albright & Hayes 1999, Dunbar 2008). 56 The Principles and Parameters approach to linguistic theory focused attention on restrictiveness, in the form of the Subset Principle. One reason for this is simply an id- iosyncrasy of the facts that P&P focused on. In the case of an obligatory phonological rule, as in the Georgian example above, when restrictiveness is invoked to decide be- tween competing environments, it is not because the set of possible surface strings, or the set of possible underlying?surface mappings, is more restricted under one grammar than another; both these sets will always have one element for each possible underlying form, and will merely change depending on the restrictiveness of the environment for a given rule. On the other hand, the prototypical case to which restrictiveness was applied in P&P involved allowing or barring the optional null subject pro. Allowing it has the result of strictly expanding the set of strings (and of sound?meaning pairs). It is presumably be- cause of this difference that researchers generally agreed that the learner?s response should be to value the more restrictive grammar (the one that generates the subset, rather than the superset): if the alternative strategy is to always prefer the less restrictive grammar, then this strategy will surely fail, as both grammars will be equally consistent with evidence coming from the subset grammar. Another reason for the focus on restrictiveness is that it is unclear what simplicity would mean in the context of P&P. Unlike in rule-based theories, in P&P theory, the core parts of a grammar are represented as a fixed-length sequence of parameter values; being of fixed length, no notion of notational simplicity is available. More recent family-related syntactic theories have reduced ?parameter setting? to selection of lexical items, and the lexicon is of arbitrary length no matter what the theory; but in the case of adding a lexi- cal item which has not been attested (such as pro, given data from the subset language), 57 intuitive simplicity (fewer lexical items) coincides with restrictiveness. Optimality Theory is another linguistic theory in which the principal issue in con- straining acquisition has been taken to be enforcing restrictiveness. There is always a possible analysis of any phonological system in which the lexicon is simply a record of the surface forms, and the grammar is trivial; but real grammars are thought to generalize somewhat. The analysis needs to be restrictive in order to prevent the grammar from be- ing general in a way that generates impossible surface forms for items not in the lexicon. This subset problem arises under any theory, but it attracted attention in OT because of the ?Richness of the Base? theory, under which the learner cannot posit the language-specific morpheme structure constraints (restrictions on possible lexical items) which are often appealed to to block some of these cases of overgeneration. Simplicity was once again sidelined, for the same reason as in P&P theory: target OT grammars consist of a total order over the universal set of constraints; thus, on the face of it, every grammar has the same size (the cardinality of the universal constraint set), and notational simplicity has no obvious place in the grammar. As previously alluded to, however, simplicity is only displaced in these theories, not irrelevant. In both cases, the lexicon must be learned, and the choice of a set of stored forms, and thus the choice of a more or less compact set of stored forms, has profound consequences for the rest of the grammar simply because the two parts of the analysis depend upon each other. In syntactic theory which has developed since P&P, parameter settings are under- stood as lexical choices (Chomsky 1995), and the acquisition of the lexicon is therefore the only kind of learning there is. For example, Bobaljik & Jonas 1996 reduce a complex 58 set of facts about Germanic languages to a single parametric difference in the possibility of the Tense head having a specifier position (differences in the availability of object shift, verb raising, and ?transitive expletive? sentences with both an expletive and an overt sub- ject, such as the grammatical Icelandic equivalent of There have many Christmas trolls eaten pudding). For Bobaljik and Jonas, this parameter is encoded as the presence or ab- sence of a strong Determiner (that is, NP movement-inducing) feature on the Tense head. If this grammatical choice is treated as the presence or absence in the lexicon of a Tense head with a strong D feature, or as the optional presence or absence (via underspecifica- tion) of such a feature, then it is without question a simpler grammar that does not allow a strong Determiner feature on Tense. Since choices about the contents of the lexicon are all that can vary across grammars in this theory, simplicity could hardly be more relevant. Similarly, although Hale and Reiss argue for restrictive learning of phonological alternations, there are many cases in which generalization beyond the data obviously hap- pens, a choice which under many theories would constitute a notationally simpler gram- mar. For example, English speakers extend the voicing alternation seen in [l?fs] versus [dz] to nonnative (thus unseen) segments such as [x], as in [bxs] (Bachs). One possible ac- count of such facts is to simply rule out the more restrictive grammar a priori: if there is no constraint in the universal constraint set that can mark disagreement in voicing for [f], [], and so on, as more problematic than disagreement in voicing for [x], then there is no way of encoding the unattested subset grammar, and the choice disappears. A more involved solution is proposed by Hale and Reiss, who similarly claim that universal feature theory rules out a single feature matrix specifying an environment with [f] and [] but not [x]. They, however, must rule out the possibility of specifying such an environment using a disjunc- 59 tion of feature matrices, the availability of which is generally understood to be necessary for descriptive adequacy in rule-based theory, and they rule this out on the grounds that the learner collapses such disjunctions into a single feature matrix wherever possible (see above). This suggests that these disjunctive environments are excluded or dispreferred on what are intuitively simplicity-driven grounds. Empirical evidence supporting overgener- alization in phonology, whether in early acquisition or in historical change, has often been explained by appealing to a bias toward simpler grammars (Kiparsky 1971, Bach& Harms 1972; see also Goro 2007 for a case of apparent overgeneralization in scope acquisition in Japanese).7 Even in Optimality Theory, there has always been a suggestion that the learner has the capacity to expand the set of applicable constraints in some way, and recent computa- tional models have attempted to induce phonotactic constraints within OT-type grammat- ical frameworks, in particular, Hayes & Wilson 2008, Adriaans & Kager 2010. As soon as the grammar becomes variable in length in this way, questions about the value of sim- plicity become quite relevant, and indeed both of these constraint induction systems rank constraints for the purposes of acquisition by their generality, which happens to be quite similar in effect to notational simplicity given the way these learners compute it. Finally, although overt appeals to simplicity biases are somewhat out of fashion, implicit appeals to simplicity biases are still ubiquitous in linguistic theory: virtually every analysis of any pattern will be guided by the principle that it is better to collapse two 7Morphological examples suggesting historical development in the opposite direction, towards greater notational complexity, were cited by Kiparsky 1985; the interesting speculation that these cases might be explained by learners systematically weighting certain parts of the primary linguistic data over others was put forth by Lahiri & Dresher 1984, who presented evidence that learners sometimes pay special attention to certain forms over others in driving grammatical changes. It would be interesting to develop this idea quantitatively in terms of the influence of the likelihood function. 60 patterns into one wherever possible. For example, Zimmermann 2002 gives the following completely run-of-the-mill description of some facts about German and English: in some languages, distance distributives, like the ?each? in The boys have bought two sausages each, or ?jeweils? in the German translation Die Jungen haben jeweils zwei W?rstchen gekauft, can distribute only over individuals; in others, they can distribute over events. This refers to the following two interpretations for the sausage sentences: (28) ?Each of the boys has bought two sausages? (29) ?The boys have bought two sausages each time? (say, each time that they went to the butcher) The first interpretation is always available, but the other one is possible in some languages (like German) and impossible in others (like English and Dutch). Zimmermann?s thesis is entirely devoted to an account of these distance distributives, and centers around the issue of how it is that jeweils can be interpreted in two different ways in German (but not in English and Dutch). However, such an analysis, and, in fact, even the descriptive general- ization formulated above, presupposes that jeweils is the same lexical item in both cases; if the two meanings are actually due to two different words that are simply accidentally homophonous, then we already have an answer to this question. How to analyze the two obviously family-related meanings is still an interesting question, but there ceases to be any demand for a single meaning common to both. This kind of assumption, that words that look the same are the same, is a fundamen- tal tenet of linguistic analysis; any time there is a way to account for the same facts using an ?uninteresting? appeal to an alternate, differently-behaving lexical item, most linguists 61 will choose to do otherwise. This is why we choose to account for phonological pat- terns rather than adding idiosyncratic lexical markings each time we observe a morpheme behaving differently, or, even more radically still, abandoning morphological analysis en- tirely (although it so happens that simple wug-test experiments tell us that grammars also choose to account for phonological patterns). The idea is that positing one lexical item is a better analysis than positing two. It is crucial to understand that this is not an application of Occam?s Razor, which says to the analyst that ?theoretical entities are not to be multiplied beyond necessity.? Occam?s Razor, in its normal interpretation, is a principle for the scientist, which guides the construction of theories. It says that simpler scientific theories are better. For example, Jackendoff 1977 proposed to do away with general phrase structure rules of the form A!B C, because there actually seemed to be systematic relations between the labels on parent nodes and their children (an NP will always contain an N and so on). Jackendoff?s revision, X-theory, caught on, presumably because it was a simpler theory: fewer phrase structure rules are possible under that theory than were under the previous theory. But the tenet of ?no homophony,? and the more general tenet of ?fewer lexical en- tries,? is not a guide towards simpler linguistic theories; it is a guide towards simpler linguistic analyses, and alternate analyses are not alternate theories about how the linguis- tic system works, but rather conjectures about of which grammar a learner has posited, and a grammar is a ?theory? of a different sort (the internal mental model that has been ?theorized? by the learner to account for the language). When we say that one analysis is impossible or implausible because it is more complex, then we are attributing a choice to the learner; it is not our choice to make, but rather an empirical hypothesis about what 62 choices the language acquisition device makes. The ubiquity of such reasoning in linguis- tic theory represents an implicit scientific claim that demands justification. 2.5 Bayesian Occam?s Razor 2.5.1 Maximum likelihood and restrictiveness This section is about statistics. It ultimately introduces a law of inference called the Bayesian Occam?s Razor, which has to do with simplicity, but in this subsection a different law is presented, which has to do with restrictiveness. The goal is to properly understand both the similarities and the differences between these two inference effects once both have been presented. To relieve the burden of the long explanations of statistics that follow, I begin with an example. Suppose we are given a collection of observations, presented here as a histogram. Which of the three curves in Figure 2.1 is a better model for this data? x Fr eq ue nc y -2 -1 0 1 2 0 1 2 3 4 5 6 x Fr eq ue nc y -2 -1 0 1 2 0 1 2 3 4 5 6 x Fr eq ue nc y -2 -1 0 1 2 0 1 2 3 4 5 6 Figure 2.1: Three Gaussian models for some data. The one in the middle is. Why? First we must make sure we have some familiarity with what these curves mean: they are in a sense pictures of normal or Gaussian probability 63 distributions, and in particular they are density functions,8 which means that the height of the curve is directly related to how probable each of these three models says that various different observations are. All three curves predict that the actually observed data are more probable than the data that was not observed?this is good. But the curve on the left is so tall in the middle that the observations to the two sides are predicted to be somewhat less probable than under the second curve; and the curve on the right is so broad that the observations in the middle are predicted to be a lot less probable than under the second curve. The second curve is just right. This choice follows from something called the maximum likelihood principle. The maximum likelihood principle implies that absence of evidence is evidence of absence. The reason is scarcity. We only have a finite amount of credence, and so we must cling to it and not give it away for free to observations that never showed up?and we do not get points for putting our credence in possible observations, only for actual observations. Although this sounds silly, it is exactly why the maximum likelihood principle says that the curve in the middle is the best. In particular, the likelihood under each of these three hypotheses is the probability of the data given that hypothesis, Pr[X jH].9 The rule for evaluating the likelihood given a collection of data points in a case like this is that we multiply together the probabilities that we get back from the likelihood for each of the 8Except that they have been rescaled. This does not matter?the alignment of the outputs of the density function with the absolute number of observations matters not at all for any evaluation; only the relative values matter. But this rescaling makes the curves look like nice ?models? for the data, on the grounds that the pictures look the same. 9This is simplifying slightly, which will continue to be the case as we continue to ignore the usual use of the density function to compute likelihoods in the continuous case, and not the associated probability measure. For this statement to be actually true, we would need to be assuming that the observations are actually observations of some narrow range of the real line, not single points. Points have probability zero. As this is probably the sensible thing to assume if we think about it long enough anyway, and after all this is what the picture shows, (although it is obviously not what it is supposed to mean), assume this. 64 data points. So, the higher the individual probabilities according to the curve, the higher the likelihood. But, since it is a probability measure, the likelihood function, integrated over the whole real number line of possible observations, needs to integrate to one: the area under the curve needs to be one. (For presentational purposes, these curves are not properly scaled to integrate to one, but they do all integrate to the same constant.) This is because the area under the curve in any region of the number line represents the probability of such an observation, and the area under the whole curve represents the probability of any number at all, which is of course the maximum probability?one?because if we see a number, it must fall into the range ?any number at all.? So, if we add some height from the curve in one place, then we must remove some elsewhere to get the area, or ?probability mass,? to balance out along the number line. Adding to the middle takes away from the sides; adding to the sides takes away from the middle. The sweet spot is the maximum likelihood model. It is one in which we give more credence to things that happened more, and (because of the way probability works) give less credence to things that happened less. This is a generalization of the principle of restrictiveness to the case where we have a gradient notion of ?consistency with the data?: what restrictiveness says is ?pick the model that predicts the observed data and as few other logically possible occurrences as you can?; what maximum likelihood says is ?pick the model that gives the highest probability possible to the data, and (in so doing) the lowest probability possible to other logically possible occurrences.? A relation between maximum likelihood and restrictiveness principles has been pointed out before (Collet, Galves & Lopes 1995; Jarosz 2006; Foraker, Regier, Khetarpal, 65 Perfors & Tenenbaum 2009). The value of the likelihood will be related to the ?goodness? of fit (although it will not necessarily determine it) in virtually any probabilistic inference scheme, so that the mere use of probability to do inference will not only predict restric- tiveness, but explain it (in the sense that it will come automatically). Simplicity, on the other hand, is something rather different. It comes for free with Bayesian inference, but not necessarily with other ways of doing probabilistic inference. It is worth correcting one mistake in Tenenbaum 1999 relating to simplicity, restrictiveness, and the Bayesian Occam?s Razor, before we spend the rest of this section introducing the Bayesian Occam?s Razor alone. Tenenbaum introduces something called the size principle in the context of a simple Bayesian framework for concept learning, where ?concepts? are just collections of integers. If the choice for the learner, unlike in the example above, is between different finite sets of integers, rather than different normal distributions, then the likelihood function will change, but the same principles of maximum likelihood will apply. In particular, if the likelihood function simply says expect any of the numbers that belong to the concept equally, then, multiplying the function through once for each of our observations, we get this composite likelihood for N data points observed under the assumption of a concept H with Size(H) integers in it: (30) Pr [X jH] = 1 Size(H) N Just as with the normal distributions, it is better to pick concepts that do not predict unobserved elements to satisfy maximum likelihood. Thus, if there are different possible concepts that all predict the data should be at least possible, but some are larger than 66 others, then we should pick smaller concepts: here, restrictiveness aligns with simplicity. But this is a coincidence. It is not the same thing, because, in general, restrictiveness does not align with simplicity. Tenenbaum states that the size principle is the same as the Bayesian Occam?s Razor discussed by MacKay 2003; but the Bayesian Occam?s Razor is not due to the likelihood alone, and in fact, it can be thought of as a kind of general law about the behavior of prior distributions. If one were to add a prior of the sort that would give rise to a Bayesian Occam?s Razor effect (we will elaborate on what that would look like later, but in this case an example would be most priors over sets of multiple concepts, not all containingn the same number of concepts) then in fact both laws of inference would separately encourage smaller concepts in Tenenbaum?s framework. 2.5.2 Model evaluation in statistics The Bayesian Occam?s Razor arises in hierarchical Bayesian inference, which is best understood by looking at a common hierarchical practice called model evaluation. Model evaluation, in turn, is best understood in contrast with parameter estimation. Pa- rameter estimation is the most intuitive operation in statistical inference. It is what we are doing with any procedure that takes a collection of data as input and returns a model for that same data. Statistics students are almost always given this concept first, whether they realize it or not. The first exercise in almost every statistics class is estimating the location parameter of a normal distribution. This usually goes under the unfortunate label of ?estimating a population mean??a presentational move which is an attempt to go from a concept which is meaningless to students, ?location of a normal distribution,? to one that 67 is slightly more familiar because it includes the word ?mean,? which students understand as ?average.? This presentation is problematic because ?finding the average? is a very misleading way for students to understand what it is they are doing in the example. To properly understand, students need to see the example from the point of view of scientific reasoning, not a mathematical operation. In what follows I explain parameter estimation using this example; then I contrast parameter estimation with model evaluation. Tineke 175 Ineke 183 Anneke 178 Aaltje 192 Marietje 203 Catharijntje 180 Willemijntje 188 Leentje 183 Maaike 192 Astrid 177 Table 2.1: Made-up heights of ten adult Dutch females in centimetres. Suppose that the heights of ten adult Dutch females are collected: see Table 2.1. The student is told that one could easily assume that this kind of data ?follows a normal distribution?; and that it turns out a good way to estimate this normal distribution?s mean (which is called the ?population mean?) is just to compute the arithmetic mean of these ten numbers (185). For the student to properly understand this, the link between the ten numbers we are given and the function that yields the bell-shaped curve?called the nor- mal density function, but let us simply call it f?first needs to be made crystal clear. What we are attempting to do in using a parametric distribution is to make statements that go beyond the data. The student is being told that there is a number m that can be filled in which will fix the function f at some particular point on the number line; among other 68 things, it is a good guess at where the maximum of the curve might be placed on the line. Here is that function for concreteness: (31) f (x) = 1 p 2 p s e (x m )2 2 s 2 The function f is a function of a height x that can help us go beyond the ten mea- surements, to do two related things: calculate how many times more frequent one height should be than another (the ratio f (x1)f (x2) ); and calculate how frequent a given range of heights should be (the area under the curve from x? to x> gives a number that can be read off as a percentage). The idea behind the word ?should? is that the normal distribution is giving us predictions, which we understand as being somehow hypothetical; or, at least, the normal distribution, including the answers we get out of it, if not hypothetical or theoretical, has some ontological status quite different from our ten data points, or in fact any heights we could actually observe. The student has then been told that the average?which seems quite concrete, some sort of numerical summary of the observed heights?is a good substitute or estimate for m , although it is not really m . Students generally stop understanding around this point, because they have been told nothing of any substance about what the function f actually is supposed to mean; but assigning it some interpretation is crucial for understanding the difference between parameter estimation and model evaluation. For example, one interpretation of f and its fixing-point, or location, m is as empir- ical quantities that cannot be known directly. Perhaps if we looked at the Dutch genome, we could do some calculations that would tell us what the exact adult height of a female 69 would be if the developmental program unfolded under some ideal circumstances; then we could list and model all the circumstances that influence development, and they would turn out to satisfy the requirements for the result following a normal distribution.10 That would give us another number, which is called s and tells us the exact shape of the bell curve, which also needs to be inserted before we can use the function f . Even though the numbers m and s are derived quantities, they are still in this scenario ultimately real quantities about which we can in principle be right or wrong, because the Dutch genome is real and therefore has real propensities to behave in certain ways. This is one answer that we could give the student: the location of the normal distribution is of interest because it is some actual physical quantity. This whole idea of interpreting the parameters objectively is itself problematic, how- ever, if we do not believe in such a thing as ?the Dutch population??suppose there is no firm genetic boundary delimiting Dutchness even at a given instant in history. Then no distribution, normal or otherwise, can ever represent an objective law of (Dutch) nature, knowable or not, because there is no such objective thing as ?Dutch nature,? which in this case means there nothing that Dutch women actually have in common that makes them Dutch. The question is why, in the face of this knowledge, one would continue to study the Dutch at all, and the answer is simply to take the idealization one step further: if one imag- ined that there were such a thing as ?Dutch,? here is what it might be; for any given point in time, one can look at some conveniently defined finite population of Germanic-blooded people (say, the ?Dotch? people) and construct a way of translating between this mathe- 10Put aside that the Central Limit Theorem is an asymptotic result, and the fact that the distribution is surely truncated well above zero; the point is that there would be some distribution that could be computed if we had all these facts about the world. 70 matical fantasy and the statistical reality of this population. It may be easier to model the Dutch than to pay close attention to the Dotch. The imaginary facts about the imaginary Dutch would then be a purely instrumental, as opposed to a realist, model of the actual Dotch: similar enough to something real that they are useful in approximating proportions and relative frequencies. In either case, once we assign the normal distribution some interpretation, the point of the exercise of parameter estimation becomes quite intuitively clear. There is a quantity, somewhere in some corner of the actual world, or of the world of ideas, that is supposed to help us to understand something, to give us a model of something real; but we want do not know what that number is, so we must make a guess, informed by some observations. The concept of model evaluation is very different. In an experiment, we may have two conditions in which we measure something like response time. The process of finding the means of our response times, and then deriving the difference between the two means, or then an ?effect size? summarizing this difference in a standardized way, could be seen as parameter estimation, once we introduce the idea that the reaction times have some ideal- ized distribution that can be interpreted as a model of what is going on in our experimental subjects? behaviour, instrumental or realist. (Until we introduce such a model it is just a convenient numerical summary.) However, we may then wish to do a significance test, in- volving something called a p-value, to see if the two conditions are ?really? different; this is not an act of parameter estimation, but something else. Or we may have a more com- plex experimental design, with multiple ?factors,? each of which we have incorporated into our experimental materials in a particular way in order to test a more complicated hypothesis?say the factors are the frequency of a word, its length, and its morphological 71 status, simple or complex, and we have set up a response-time experiment to investigate if there is really any cognitive reality to morphological status. We start by ?fitting a lin- ear model? in which there are different numerical effects of different factors, and of their interaction. This is parameter estimation?and again, we can interpret the model as we see fit in order to understand what the resulting estimates represent. Then, however, we want to ask something about whether there is ?really? an effect of morphological status; there is a number in the result of the model fit which represents this effect quantitatively, but it is not enough to look at it and see if it is exactly zero. We must do something else. There are also significance tests here, but, to illustrate the range of possibilities, there is also another way of reasoning about similar issues, which is something that often goes un- der the heading of model comparison?fit the model with and without the formal term for morphological status, and then compute a comparison statistic and see if it crosses some threshhold (one familiar to experimenters will be the Bayesian Information Criterion, or BIC, ratio). Although significance testing and model comparison have results that need to be interpreted in very different ways, we can still get the idea that these are sorts of meta-level procedures, which we collectively call model evaluation tools. The general idea is that our parameter estimation procedure only gives reasonable results contingent on some set of modelling assumptions, but part of what we are inter- ested in is actually those assumptions themselves. One strange consequence of this is that there will often be two different ways of asking what seems like the same question: are the two conditions in our experiment the same or not? We could see if the parameter es- timates come out exactly the same in the two different conditions, or we could compute a p-value and see if it falls below a standard threshhold like 0:05; we would in general get 72 two different answers. With the right concrete interpretation of our parameter estimation, this can actually make perfect sense: if we are looking at Dutch and German people, we might want to assume they are two different groups and infer something about the ?ideal height? implied by their respective genomes (which could still be accidentally identical), or we might want to infer something about whether they are actually two different groups genetically, which then allows for two ?discrepant? parameter estimates?the genomes are different as regards the implied ideal height, but if we were to simply assume they were different, we might find the best parameter estimates come out the same; or, they are not different, although our limited data would lead us to believe otherwise (this is the case we usually try to rule out in statistical tests). The gap between the parameter estimates and the true values opens up the possibility of either type of discrepancy; and the possibility that the two groups can be different in some way that we do not have information about and still be accidentally identical with respect to the property the parameter represents opens up another avenue to the first type. Thus it makes intuitive sense that we would want to find some way of evaluating different sets of modelling assumptions statistically. This is how parameter estimation differs from model evaluation. Some terminology: the objects of inference in parameter estimation are parameter values or model instantiations; the objects of inference in model evaluation are competing model frameworks. 73 2.5.3 Bayesian inference and model evaluation Bayesian inference uses the posterior distribution over parameter values to assess the ?value? of those parameter values in the sense discussed earlier. We can talk about this in terms of evaluation measures, and, as discussed above, this lets us derive a relation between the prior evaluation measure and the posterior evaluation measure once we are given the likelihood function, using Bayes? Rule. Remembering that in the context of learning the likelihood function takes grammar?data set pairs and gives back a ?degree of consistency,? we can make this concrete, looking at the example with the Dutch women in this light?swapping out ?grammar? for ?model,? and moving temporarily from the realm of cognition to scientific inference. We would say that the normal density function, seen as a function now of both x and m ?imagine that s is known in advance?is the likelihood function; or, if we knew ten heights, as we did above, we could construct a function of m values paired with sequences of observations, x1; : : : ;x10, by simply multiplying together the normal density function 10 times, once for each height xi. In either case, the likelihood function would be telling us how well some guess at the value of m , whatever we need it to represent for our purposes, would ?fit,? or be consistent with, the observations. To do Bayesian inference, we then need to pick a prior measure?in this case some probability distribution that takes as input m only. In the scientific context this is harder to understand than in the cognition context, but, in either case, it will represent a bias in the general sense. To do parameter estimation using Bayesian inference, one simply decides on a way of using the posterior that is derived from of the likelihood plus the prior to pick the ?best? value of m ?very often the one with the highest posterior probability, but sometimes other 74 things, like the mean value of m according to the posterior. Model evaluation, under the Bayesian approach, is simply an extension of parameter estimation. The Bayesian approach to model evaluation is to add an additional parameter? call it w ?to code for different model frameworks. To distinguish w from more ?basic? parameters like m , it is often called a hyperparameter; however, the same Bayesian meth- ods are used to compare values of w as would be used for any other parameters. To take an example, consider now a common type of statistical model, alluded to briefly above: the linear model. We discussed above the hypothetical example of an ex- periment in which we manipulate word length, frequency, and morphological complexity, and measure response time as a dependent variable. A common and very simple type of parameter estimation for this data?first without the hyperparameter?would be one that operates under the assumption of this model: (32) y e = b 0 + b 1x1 + b 2x2 + b 3x3 Here, y is the dependent variable, and x1, x2, and x3 are the three independent vari- ables; they might be binary, representing different conditions, (coded in some carefully chosen but arbitrary way, like 0?1), or they might be real-valued?for our purposes it does not matter. The values b 0 and b 1?b 3 are the parameters to be estimated, which in this case are called the regression coefficients. The idea in a linear model is that if we could subtract some random error, e , from each observation, we would see the perfect linear relation between independent and (corrected) dependent variables given by the equation. The error e follows a normal distribution with location zero in the standard linear model, 75 which is mathematically convenient and gives certain standard procedures for parameter estimation interesting alternate interpretations (in particular, the ?maximum likelihood? procedure also has the ?minimize the distance to the line? interpretation). Bayesian ap- proaches compute a posterior distribution over the set of possible parameters, in this case the set of real-valued quadruplets hb 0; b 1; b 2; b 3i. To keep things very simple, suppose that we can get away with only considering the effects of word length and morphological complexity, and that each of the influences on the response time can either be exactly numerically equal to the independent variable, or else have no effect at all. In other words, consider this simpler model in which we have removed the term b 0 (the intercept) and one of the coefficients (the one for frequency, by whatever arbitrary convention we use to label the independent variables): (33) y e = b 1x1 + b 2x2 By whatwe have said we have then restricted the parameters (a pair, or two-dimensional parameter vector, hb 1; b 2i) to the set Q= fh0;0i ;h1;0i ;h0;1i ;h1;1ig. The most intuitive way of computing the posterior is to move through all four possibilities. The likelihood would be computed, for each possibility, by filling in the values y, x1, and x2 associated with a particular observation, rearranging and subtracting to obtain e , then applying the normal density function. (Again we are ignoring the parameter s that we would need to actually do this evaluation, but in principle we could evaluate this in a Bayesian way, too, by moving to a set of three-dimensional parameter vectors, with a third for s .) To extend this to multiple data points, as alluded to before, we would simply multiply all the values 76 together for a given hypothetical parameter vector hb 1; b 2i. The validity of this move re- quires the assumption that the data points are ?independent? of each other in a technical sense, and only that assumption; again, this will be elaborated on further below. To derive the posterior, we then multiply in the prior. Letting L(X ; b 1; b 2) be the likelihood as just explained, the posterior for, say, h1;1i is: (34) L(X ;1;1) Pr[h1;1i] hb 1;b 2i2QL(X ; b 1; b 2) Pr [hb 1; b 2i] If there were only one observation, say, an observation of 500 milliseconds (y= 500) as a response time, for a word of length 5 (x1 = 5) in the morphologically complex word condition (x2 = 1), then we would get e = 500 5 1 = 494. The likelihood would be computed using the normal density function, applying it to the number 494. (In addition to the fact that we need to know s to do this, there is also the fact that response times to words?500 ms is a reasonable response time in the ?lexical decision? task?do not appear to be on the same scale as either of our two dependent variables, which one would think ought not to make any difference to whether a model can be made to fit the data; it does in this toy example, and that is why we never actually use this hypothesis space where effects can only be exactly one and there is no intercept term.) In the denominator, we need to sum over all values of this quantity for all possible values of hb 1; b 2i. Suppose the prior were uniform: 14 for each of the four parameter values, which means no a priori bias to any of them. Then we would see that the prior term cancels, and we get the following, 77 where f is the normal density function: (35) f (494) f (500)+ f (499)+ f (495)+ f (494) The posterior probability of h1;1i would be this fraction, and for any other parameter value we can do the same. For the sake of having some numbers, we can plug in s = 100 and compute, and we find that we have a change in the subjective probabilities of some ideal observer?from a uniform distribution with Pr[ ] = 0:25 across the board?to the distribution h0;0i : 0:21, h0;1i : 0:22, h1;0i : 0:27, h1;1i : 0:29. This updating is the basic operation we use to do parameter estimation using Bayesian inference. We can then introduce our hyperparameter w . Remember that model evaluation compares different modelling assumptions. Here we can let w = 1 stand for the model we just examined; in principle w = 2 could be anything at all, but suppose it stood for this interesting model: (36) y e = b 1x1 How do we connect this with Bayesian inference? The idea is to see the selection of models as a restriction on the parameter values. In this case, the restriction is mathe- matically equivalent to setting b 2 to be necessarily zero. For the probabilistic agent, what this means is that, if w = 2, Pr[h0;0i] = Pr[h1;0i] = 0, but if w = 1, this is not the case, and there is some non-zero probability assigned to the parameter vectors with b 2 = 0: the prior on hb 1; b 2i changes depending on the value of w , and changing the prior can have 78 the same effect as making the inference restricted to a particular subset of the hypothe- sis space, by making use of the minimum value for probabilities, 0. No matter what our model comparison, we can always find a way of contriving to make our model comparison a choice of different priors in this way. The formal apparatus is as follows: first, the question we are asking is now different. The posterior of interest is Pr[w = 1jX ], Pr[w = 2jX ]. The prior distribution on hb 1; b 2i we were discussing before is not the prior anymore; it is a conditional probability distri- bution, which will be of use to us, but in a different way. Although we used the uniform distribution as an example, it could have been anything?the point is that this distribu- tion is now treated as Pr [hb 1; b 2i jw = 1], the ?conditional probability of hb 1; b 2i, given w = 1.? The distribution Pr [hb 1; b 2i jw = 2] could also be anything, although in our case it needs to be some distribution with Pr[h0;0i jw = 2] = Pr[h1;0i jw = 2] = 0. This can be used to compute the posterior. Using Bayes Rule to compute the posterior for w = 1 requires that we find a prob- ability distribution proportional to the following: (37) Pr [X jw = 1]Pr [w = 1] The law of total probability (which actually follows from the basic axioms of prob- ability theory) allows us to expand this in terms of hb 1; b 2i. It says (in one simple form): given some conditional probability distribution Pr [AjB], partition the whole space of events B that we can condition on in some way, say fB1; : : : ;BKg. Pr[A] = Ki=1Pr[AjBi]Pr[Bi]. 79 This gives us a way to expand Pr[X jw = 1]: Pr[X jw = 1] = hb 1;b 2i2Q Pr[X j hb 1; b 2i & w = 1] Pr [hb 1; b 2i jw = 1](38) Notice that the distributions we employ remain conditional on w = 1; this is just because the law of total probability applies equally to conditional distributions as to any other. The probability of the data (the likelihood) is mediated only by the model instantia- tion, however, and not the model framework. Thus we can drop that part of the condition, and for the whole expression, we obtain: Pr [X jw = 1] = hb 1;b 2i2Q Pr[X j hb 1; b 2i] Pr [hb 1; b 2i jw = 1](39) What acts as the likelihood is actually an average of the likelihood values for each of the possible parameter vectors, weighted not equally but by the conditional prior prob- ability of each of those parameter vectors. Because we are governed by the law of total probability, we have no choice but to derive the likelihood for w = 1 from the individual likelihoods in this way (which also happens to be quite convenient). To see this again, let us change to a more readable notation: we will write a subscripted l for the likelihood given some data set, where the subscript tells us what the parameter vector we are assum- ing is?so that l 0;0 is the likelihood for some data given the model hb 1; b 2i = h0;0i, for example; and for the conditional prior on hb 1; b 2i we will write p, similarly subscripted, and with a superscript to indicate the model framework, so that p(1)0;0 = Pr [h0;0i jw = 1]. 80 What we are saying is that the term Pr [X jw = 1] expands as follows: (40) l 0;0p(1)0;0 + l 1;0p (1) 1;0 + l 0;1p (1) 0;1 + l 1;1p (1) 1;1 This is what makes model evaluation possible in a Bayesian setting: we add a layer to the inference, another variable upon whose value the model instantiation is contingent? not determined by it, but influenced by it, via a change in its prior distribution. What we have is a hierarchical model, with the model framework w occupying a higher ?level? than the model instantiation hb 1; b 2i. We will finish this section by using this setup as a concrete example of the law of inference we want to derive: prefer simpler model framework?the Bayesian Occam?s Razor. Consider comparing the two model frameworks by taking the ratio of the two pos- terior values, Pr [w = 1jX ], Pr [w = 2jX ]. We could easily check to see how many times more a posteriori probable one framework was than another in this way. The prior Pr[w = 1], Pr[w = 2] is some bias over model frameworks, but suppose the prior on w is uniform. Then it will cancel, and we will be left with this ratio: (41) l 0;0p(2)0;0 + l 1;0p (2) 1;0 + l 0;1p (2) 0;1 + l 1;1p (2) 1;1 l 0;0p(1)0;0 + l 1;0p (1) 1;0 + l 0;1p (1) 0;1 + l 1;1p (1) 1;1 Now, what we know about our restriction in model w = 2 is that it has the effect of removing two terms: (42) l 0;0p(2)0;0 + l 1;0p (2) 1;0 l 0;0p(1)0;0 + l 1;0p (1) 1;0 + l 0;1p (1) 0;1 + l 1;1p (1) 1;1 81 Here is the key intuition which we will carry throughout the chapter: probability distributions, including each of the two individual conditional prior distributions here, must sum to one, and so the prior probabilities in the numerator are ?compressed? with respect to those on the bottom. That is, p(1)0;0 + p (1) 1;0 = 1, but p (2) 0;0 + p (2) 1;0 1. The idea is to leverage this fact in such a way as to make it guaranteed that the model framework in the numerator?the simpler model, the one with the more restricted subset of the hy- pothesis space?takes the same likelihood values, l 0;0 and l 1;0, and weights them by larger numbers than the more complex model framework in the denominator. Then we can rightly say that the simpler model will be necessarily preferred?will have higher posterior probability?all things being equal (what exactly this means will be spelled out momentarily). It turns out that all we need to do is to assume that the change of model framework leaves the relative preferences for different parameter values intact. That is to say, over some shared subset of Q = fh0;0i ;h1;0i ;h0;1i ;h1;1ig, the model frameworks at least agree on how many times more a priori likeli one parameter vector is than another. In this case, we want the two model frameworks to give the same ratio of conditional prior probabilities for the two ?restricted? values, so p(2)0;0 . p(2)1;0 = p (1) 0;0 . p(1)1;0 ; and this, by a bit of algebraic manipulation, is equivalent to saying that p(2)0;0 and p (2) 1;0 are equal to the corresponding values p(1)0;0 and p (1) 1;0, multiplied by 1 p(1)0;0+p (1) 1;0 . The ratio becomes: (43) 1 p(1)0;0 + p (1) 1;0 l 0;0p(1)0;0 + l 1;0p (1) 1;0 l 0;0p(1)0;0 + l 1;0p (1) 1;0 + l 0;1p (1) 0;1 + l 1;1p (1) 1;1 The two conditional prior terms in the numerator are identical to the first two terms 82 of the denominator, up to a scaling factor; the scaling factor is the ratio at left, and, notably, it is necessarily no less than one (because the bottom of the ratio is no more than one). This means that this scaling factor increases the preference for the reduced model; this is the essence of the Bayesian Occam?s Razor. Prior probability distributions, like any prob- ability distributions, must work with the same, finite amount of ?probability mass?; less restricted distributions must spread the probability mass around, while for more restricted distributions, the mass is more concentrated. The scaling factor can be seen as the degree to which the probability is concentrated, and because of the laws of probability theory, the scaling factor must always act to increase the a priori preference for the simpler model. It is not guaranteed that the simpler model will be preferred; the scaling factor is contributed only by the model instantiations shared by the two model frameworks, and so if those parameter vectors do not fit the data very well, then the scaling factor will need to compete with the strong preference for the more complex model brought on by the right-hand ratio. The less the additional parameters made available under the complex model contribute to the value in the denominator, the more the Bayesian Occam?s Razor effect will be seen. This contribution will be small if the additional parameters are a priori considered highly implausible, or they yield poor-fitting model instantiations compared to the narrower set of parameter values. Whether the reduced model will be preferred depends on precisely how large the scaling factor is, and precisely how much the additional parameters contribute to the evaluation. This is the powerful sense in which, all things being equal, Bayesian model evaluation will prefer the simpler model. Here is what we have done: we have constructed an example of Bayesian model evaluation, and we have shown that this construction leads to a bias?a preference in the 83 prior distribution, broadly construed?for simpler models. There is an intuition that this construction is no accident; what got us the result was simply that we found a way to lever- age the fact that, for the simpler model, the prior probability is more highly concentrated on certain values. Although we have not proven it yet, this Bayesian Occam?s Razor is general enough to be considered a law of inference. In the next section, we spell out the details of the law, including the conditions under which it holds. 2.5.4 Conditions for a Bayesian Occam?s Razor We now give a strong condition under which a BOR will emerge which substan- tially generalizes the individual examples of the BOR presented in Jaynes 2003, MacKay 2003. Although it is possible to see analogous effects under more general circumstances, the condition represents a pair of model frameworks which alter the prior bias as little as possible, a property which can be seen as an optimality condition on the evaluation measure. (44) Framework consistency principle (FCP): For any pair of model framework parameter values w 1, w 2, the prior distributions to which they correspond must be similar over some subset of their respective parameter spaces. (45) Framework consistency principle + data (FCPD): For any pair of model framework parameter values w 1, w 2, the prior distributions to which they correspond must be similar over some subset of their respective parameter spaces with respect to the likelihood function. 84 Intuitively: two model frameworks may define radically different types of models; how- ever, they satisfy the FCP if there is some subset of these model instantiations which are similar across the two frameworks. ?Similar? will mean that the relative prior probabili- ties of all possible parameter values, or sets of parameter values, remain the same across the two models?that is, within the similar set, the biases for one model instantiation over another are the same. ?Similar with respect to a likelihood function? will add the addi- tional clause that this subset of the models is the same across the two models with respect to how each of the model instantiations treats the given set of data. In a simple case, such as the one we considered in the previous section, the two sets of parameters will be identical across the two model frameworks. In this case, a very simple version of similarity is: (46) Similarity (preliminary): Two probability distributions Pr1 and Pr2 over the common parameter space Q are similar over some A Q if, for all S A, Pr1 [SjA] = Pr2 [SjA]. Assuming that the likelihood is dependent only on q , and not w , (capturing the fundamen- tal intuition behind model evaluation), this is sufficient to satisfy both descriptions above. In particular, this says that the conditional probability given membership in the similar set A is the same across the two distributions, where this is defined as Pr1 [S]=Pr1 [A] (simi- larly for Pr2). This is the only way to satisfy the condition that the relative probabilities remain the same.11 11The only reasonable relative-difference condition is that Pr1 [S]=Pr1 [T ] = Pr2 [S]=Pr2 [T ], which in- cludes the case where T = A. 85 Generalizing this to the case where the two parameter spaces,Q1 andQ2 are distinct, we obtain: (47) Similarity: Two prior distributions Pr1 and Pr2 over Q1, Q2 are similar for some A1 Q1, A2 Q2 if there is a bijection f : A1! A2 such that, for all S A1, Pr1 [SjA1] = Pr2 [ f (S)jA2]. (48) Similarity with respect to a likelihood: Two prior distributions Pr1 and Pr2 over Q1, Q2 are similar for some A1 Q1, A2 Q2 with respect to a fixed likelihood l (X j ), if there is a bijection f : A1! A2 such that, for all S A1, Pr1 [SjA1] = Pr2 [ f (S)jA2], and l (X jq ) = l (X j f (q )), for all q 2 A1. Since Pr1[A1jA1] = Pr2[A2jA2] = 1, it is for all practical purposes assured that f (A1) = A2 (where f applied to a set means the image of the set under f ). For any pair of model framework parameter values w 1, w 2, Pr [X jw 1] = (Pr [A1jw 1]=Pr [A2jw 2]) Pr [X jw 2]. Ap- plying this to the example of the previous section, we find we need not appeal to the more complex version, however; the scaling result holds immediately. Assuming that FCPD holds, then, more generally, for nested models (A1 =Q1, A2 Q2), we have the following Bayes factor (applying the same notation from the previous section): (49) 1 Pr [A2jw = 2] A2 l [ ] dp (2) [ ] A2 l [ ] dp (2) [ ] + A2 l [ ] dp (2) [ ] As before, the right-hand ratio is the relative weighted fit of the reduced model as compared to the full model (a function of the fits of the individual parameter vectors and 86 their prior probabilities); it can never favor the reduced model and must be at most one. The left-hand ratio is the relative prior probability of the parameter vectors permitted under the reduced model, taken as a set. Since this is by definition one under the reduced model, this ratio can never favor the full model: it must be at least one. In the most general case, where the model frameworks overlap for some similar set, but neither is nested within the other: (50) Pr [A1jw = 1] Pr [A2jw = 2] A2 l [ ] dp (2) [ ] + Pr[A2jw =2] Pr[A1jw =1] A1 l [ ] dp (1) [ ] A2 l [ ] dp (2) [ ] + A2 l [ ] dp (2) [ ] In this case, it is not obvious which model is ?larger,? where larger now means having more prior probability mass assigned to the complement of the similar set; the denominator model is simply the one in terms of which the ratio is written, and has no special status. The numerator now contains an additional term, boosting the weighted fit to account for the parameters which do not correspond to any in the denominator model; meanwhile, the scaling factor is reduced by multiplying in the prior weight of the similar set under w = 1, now no longer necessarily one. In sum, any prior distribution over model frameworks and parameter values which has the FCPD property will show the BOR effect, now generalized to (50). Although it is incoherent to say that the BOR effect shows up without the need to incorporate a bias for either model (since the BOR is a part of the prior, and thus is itself a bias), it is reasonable to say that the bias arises in a non-arbitrary way: the FCP property will hold whenever the prior is assigned according to the general principle that the bias toward different, equivalent model instantiations should not change as a function of the model 87 framework, and the FCPD property will hold whenever this happens to have no effect on the predictions for a particular data set. Note also that none of this tells us which model is ?larger? on the basis of anything apart from the prior measure itself; this is actually rather difficult when two model frame- works both give rise to infinite conditional hypothesis spaces of the same cardinality. We will return to discuss this issue further when we apply the BOR to grammar frameworks, which means much of the next section will be dedicated to being specific enough about what it means to specify a grammar that we can associate the ?subset? of grammars with an independent notion of ?lower complexity?; however, for the time being, we can simply attend to the crucial idea: one model framework must be ?more complex? than another when its prior assigns more probability mass outside a set which is shared (by a bijec- tion) between the two. This leads to the Bayesian Occam?s Razor for the same reason that maximum likelihood principle gives rise to restrictiveness-like effects. 2.6 The Optimal Measure Principle The goal of this analysis, as it continues, is to highlight the circumstances under which Bayesian inference for grammars will give rise to a simplicity bias. We now know a good way to identify cases of BOR: look for hyperparameters which imply a change in the ?size? (in terms of the uncertainty in the prior measure) of the set of possibilities for other parameters in the specification of a grammar; the clear cases will be those for which these other parameters? distributions are independent of the choice of hyperparameter. The point is deeper, though, as we wish to show that such priors are ?natural? given only the 88 statement of ?what a grammar looks like.? In Chapter 1, I raised the question of what it means for a grammar to ?look like? something anyway, raising the example of OT versus SPE type grammars, both of which can be ?compiled out? to finite-state transducers?so aren?t they equivalent? Now, in that case, the three different versions of any given phono- logical mapping will also yield some structure, the ?trace? of the computation, which will not be isomorphic across the three intensions. But the CG versus CFG case does not have this property. The question is, is the way we write out the grammar per se meaningful? The traditional answer in generative grammar was yes, (Chomsky 1957, Chomsky 1965, Chomsky & Halle 1968). Now, it need not be the case that the notation we choose is something we interpret as meaningful. But it does need to be the case, if grammars are non-atomic, that the relevant cognitive systems necessarily implicitly assert something about how to interpret the pieces of a grammar, implying an assertion about what those pieces are (e.g. in P&P, each grammar is a complex of parameter values, and in Aspects grammars, each grammar is a complex of PS and transformational rules)?and thereby implying an assertion of some ?structure? over the set of grammars. That is what the notation should ideally capture, and, once we see this clearly, it will be obvious why the BOR should hold for hierarchically structured sets of grammars. 2.6.1 Formalizing grammars preliminary: Transparency and structure In this section I introduce the idea that, whenever we put forward a theory of how language works cognitively, we are implicitly saying something about a structure over the set of grammars. Before saying anything about what a structure is, what a grammar is, or 89 anything else from first principles, let?s consider an illustrative example. In rule-based phonology, it is assumed that a grammar consists of a collection of rules each of the following form: (51) A! B C?D To review: the phonological representation of a form is a sequence of symbols, and a rule of this form (roughly) replaces a subsequence matching CAD with CBD. For ex- ample, the rule i! e ?k; as the symbols for segments such as i, e, and k are actually complex objects made up of collections of features in this theory, the formalism also al- lows for only particular features to be matched and changed (for example, a single rule might change i and u to e and o respectively by setting A = +high and B = high). With a fully explicit representation of the input and the grammar, this basically reduces to an- other case of the same thing, with some minor added complexity due to the fact that the collection of features in each phoneme must be sequentialized. The rules compose, and the order in which they compose is a part of the grammar too. There are independently motivated constraints on the ways in which rules can apply to their own output which have the salutary effect of preventing any rule system from being super-regular (Kaplan & Kay 1994). The set of all such grammars (?SPE grammars?) is the set of all sequences of such rules. This set is (enumerably) infinite because both A, B,C, and D, on the one hand, and the sequence of rules itself, have unbounded length. Berwick & Weinberg (1984) introduce a notion of ?transparency? for understand- ing how theoretical descriptions of grammars relate to actual performance systems. More 90 generally, we may consider how any two systems are or are not related (whether one is theoretical and the other is a real performance system, or both are actual cognitive sys- tems) by considering what structure they share. A ?structure? is just some collection of mappings which apply to elements of a set. Here are some different structures that could be constructed over the set of SPE grammars (not necessarily mutually consistent in any sensible way): (52) The grammar i! e ?k is a subpart of the grammar i! e ?k ; ! k /?t , and so on. [A single two-place function, Subpart(x;y), mapping to > or ?] (53) The grammar i! e ?k is a subpart of the grammar i! e t?k , and so on. [A single two-place function, Subpart(x;y), possibly different from the previous] (54) The grammar i! e ?k is shorter than the grammar i! e t?k which is shorter than the grammar i! e ?k ; ! k /?t , and so on. [A single two-place function, Shorter(x;y); or perhaps a one-place function, length(x), mapping to a nonnegative integer or a member of some other set with an existing order; or, redundantly, both] (55) The grammars i! e ?k , e! i ?k , and hi! e /t?i are, with some others, members of a set A3; the grammars i! e t?k , e! i t?k , and i! e ?tk are, with some others, members of a set A4; and so on. [A single one-place function, Container(x), mapping either to a single set or to a set of container sets] 91 (56) The grammar i! e ?k has the extension fha;ai ;hik;eki ;hds;dsi ; g, and so on. [A single one-place function, Extension(x), mapping to a set of input?output pairs] We can see a linguistic theory as being correct if the set of grammars shares some, but presumably not all, structure with the human linguistic system. For example, a correct linguistic theory will almost certainly not capture cellular level details of how the brain works, but it will at least be a set of grammars with the correct extensions. Similarly, what we mean by ?correct? in ?correct extensions? is also ?sharing some structure with the brain??in that case with what we conventionally think of as static states representing inputs and outputs, as opposed to computations (possible grammars). We like to think that some meaningful structure is captured in the notation we use for grammars, too: ?choice of notations and other conventions is not an arbitrary or ?merely technical? matter ? . It is, rather, a matter that has immediate and quite drastic empirical consequences? (Chomsky 1965, 45). This obviously does not mean that there is a wax tablet or a paper tape inside the brain on which is written S ! NP VP (or whatever). It means that there is something inside the brain which shares some structure with S! NP VP, structure which in the linguist?s grammar might be considered ?notational.? In general we may think of various different sorts of higher-order properties of the string that codes the grammar as potentially meaningful in this way: not the appearance of the symbols themselves, but their combination, order, number, and so on; and, of course, the grammar will perhaps embed, and definitely imply, certain kinds of input and output representations, which for the linguist are sequences of symbols, but must, as we have 92 already said, share some structure with the brain?s representations. So what does it mean to ?capture? or ?share? structure? Sharing implies two sharers? so, start with two sets, S0 and T0. Again, the structure associated with a grammar could be anything in this very general case where we are just trying to define ?what is common?? length, ordering, a grouping into subsets, a set of pairs. Suppose the set S0 (in our case grammars) has an associated set of mappings f1; : : : ; fn, each potentially mapping to some set other than S0 (S1; : : : ;Sm, for some m n; one here might be nonnegative integers, for example, representing length). These mappings constitute the ?structure.? What we mean by ?share? is that T0 also has n mappings associated with it, and they each ?corre- spond? in some sense to a function of the structure of T0. We will say that they correspond, or are shared, by virtue of a (generalized) homomorphism F which serves to map from S0;S1; : : : ;Sm and f1; : : : ; fn, on the one hand, to some other sets and mappings, on the other. One of these image sets must wind up being T0, of course, and the mappings must ?work the same? as the structure f1; : : : ; fn. In particular: a = b , F(a) = F(b)(57) fi(a) = b , F( fi)(F(a)) = F(b)(58) Some details: F is understood to be ?overloaded? to apply, in the first case, to ele- ments of S0; S1; : : : ; Sm (?atoms?), and in the second case, to mappings as well. In words, (i) F preserves the distinctness of atoms (crucially, it does not collapse two atoms into one), and (ii) F preserves the action of all the mappings fi. It is understood in the state- ment of the second axiom that F( fi) applies to F(a), which means that the domain Dom 93 of F( fi) is always at least the image of F as applied to Dom( fi), which we write as F(A) if fi : A! B; the axiom implies then that the codomain of F( fi) is at least F(B) (and what happens outside F(A) and F(B) is irrelevant so long as the action has been preserved for A and B). So, if fi : S0 ! S j then F( fi) : F(S0)! F(S j); by the first axiom, S0 and S j remain distinct to the extent that they were distinct in the first place. That the action of the mappings is preserved is exactly what is meant by ?structure is shared.? Take a somewhat contrived example and we will see how this works; we will also see that in reality we go beyond demanding ?some structure be shared? to saying exactly what structure and how that structure is shared. Categorial grammars and context-free gram- mars are two different formal devices for specifying sets of strings, and they are ?weakly equivalent.? This means that, given any context-free grammar specifying a particular set of strings (say, for example, the grammar fS! aSb;S! abg, specifying the language anbn = fab;aabb;aaabbb; : : :g) there is a categorial grammar specifying the same set (in this case, one is fa = A;S=B;b = SnA;BnSg), and conversely. Here are the pieces: for each type of grammar, we have a function which takes a given grammar to its string language, say LCFG : GCFG ! L , from context-free grammars to sets of strings, and LCG : GCG!L , from categorial grammars to sets of strings, respectively. We can see each of these functions as augmenting a particular set of grammars with some structure. What a mathematical linguist will generally do in order to show this weak equivalence is to specify a way of converting a context-free grammar into a categorial grammar (and then the other way around), such that, when the language of either grammar is derived, it remains the same (Bar-Hillel, Gaifman & Shamir 1963; Pentus 1993). To say that this ?way of converting? satisfies weak equivalence is to say it is a homomorphism in our 94 sense, and in fact a particular one: when we convert a CFG g into a CG F(g), we require that the structure-defining function LCFG be preserved. When we are given the task of writing the proof, we are given the constraints that F(LCFG) = LCG, which is to say that the ?natural? definition of ?string language? must be preserved, and that F(l) = l for any language inL . So the demand of weak equivalence is one example of a demand for shared structure, and the challenge is to show that some structure (string language) can be preserved, in this case in a fairly exacting way (the strings need to be exactly the same). Other kinds of equivalence are somewhat less stringent. One ?strong equivalence? of CFGs and CGs says that, for a particular subset of the CFGs (Chomsky normal form CFGs), a conversion can be found to CGs that will preserve the derivation tree for any string. This is a stronger result in the sense that it preserves the action of a mapping with richer outputs (sets of string?tree pairs), but it is also weaker, in the sense that we must find a satisfactory way of converting our CFG derivations to CG derivations. The conversion is quite simple in this case, and follows from the conversion of the grammar (relabelling of nonterminals), but it is no longer an identity mapping. Nevertheless, the idea behind the challenge is the same: find a way of converting one grammar to another, such that some particular type of structure associated with the grammar (in this case, the set of sentence derivations) yields, if not precisely the same values, values which are ?equivalent? in some meaningful way that we specify in advance (here, the tree has the same shape but only the corresponding, not the identical, labels on the nodes). We will look at this example in a bit more detail shortly. To bring this back to our first general point: if one of the sets is a set of grammars allowed by some linguistic theory, and the other is the set of possible different linguistic 95 systems as implemented in the brain, then this is a very general way of seeing require- ments that we as linguists place on our theories. The grammars we specify must at least have some structure in common with those that are in the brain. It can also be seen as a way of understanding the relation between different parts of the language faculty: a production system and a comprehension system may not work in the same way at all, and they might not be specifiable using the same information (they might have differ- ent ?grammars,? broadly construed). For example, a speech perception system might require the acquisition of some auditory parameters, while a speech production system might require acquiring motor parameters. They might also operate over very different representations. But in some sense they follow the ?same? grammar?in what sense? In the sense that they have some structure in common, be it an isomorphic yield of lexicon? surface pairs, a strong similarity between perceptual categories and the realization of their production counterparts, or merely some highly abstract similarity between the two per- ception and production inventories. That shared structure is the ?competence? that these ?performance? systems both instantiate. The first point, then: empirical demands on lin- guistic theories are assertions of structure sharing under homomorphism. 2.6.2 Notation and the structure of a grammar Now, there is one very particular type of structure sharing that is relevant in the rest of this chapter. It is based on an idea that has been carried forward in some rough form from the evaluation measures of early generative grammar to the present day, but which has not had a precise sense since those early days: the form of the grammar itself matters, 96 not just the representations it works with, or the input?output mappings it countenances. In the current view, it is not the notation of the grammar per se that matters, but it is certainly something about the grammar itself, and not directly about the mappings it implies. To continue our example in this context, consider a conversion between a Chomsky normal form CFG and a strongly equivalent CG. Here is a CNF grammar for anbn: S ! AB B ! DZ Z ! b(59) D ! AB B ! b A ! a Now here is a CG for anbn. a : S=B b : B(60) D=B BnD The derivation under each grammar for aaabbb is shown in Figure 2.2. They are clearly isomorphic. In general, these two grammars will yield string?derivation pairs for which the derivations are isomorphic for a given string.12 12Precisely what the details of this ?obvious? isomorphism are (what exactly F needs to say when applied to derivations for this correspondence to be interesting, and indeed how derivations are coded) is irrelevant here. 97 S A a B D A a B D A a B b Z b Z b S S=B a B D D=B a B D D=B a B b BnD b BnD b Figure 2.2: Derivations for the string aaabbb following to the CFG (left) and CG (right) given above. The CG proof is flipped vertically from its usual order (conclusions are above their premises) to emphasize the isomorphism. We have already said that such a construction is possible, and in fact always possible in the case of CNF grammars. What we have not said is that there is another non-trivial property that these two grammars share. Cross off all but the first three lines from the CFG and cross off the second and fourth lines from the CG. We get these two new grammars: S! AB(61) A! a B! b a : S=B(62) b : B Both of these grammars specify the singleton language fabg. Thus both of the original grammars ?contain? a grammar for fabg, notationally. Furthermore, for both 98 of the original grammars, the reader can verify that there is no way to remove any lines without yielding a grammar for fabg or a grammar for fg (though in most cases with some redundant information in the form of useless productions or lexical entries; the latter are the grammars that cannot complete a yield/parse for any terminal strings, and/or contain no start symbol, and the minimal such grammar is the empty grammar). Thus both of the original grammars only contain grammars for anbn, fabg, and fg. Since the grammars for fabg also all contain grammars for fg, we can firmly state some relations in this small corner of the landscape of CFGs and CGs. For one thing, Gfg Gfabg Ganbn , where we mean the empty CFG, the CFG in (61), and the CFG in (59), respectively, and by we mean, for now, ?is a subpart of.? And it would seem, at least given these cases, that the CNF?CG conversion procedure respects this hierarchy: F(Gfg) F(Gfabg) F(Ganbn), where is either the same or almost exactly the same relation across CFGs and CGs, depending on exactly how we formalize it. What?smore, each of these three grammars sets up an equivalence class, as each is the grammar yielded under ?reduction? by the removal of useless productions or lexical entries from many others?in fact infinitely many others, and so, if we make it so that evaluates containment not of G1 and G2 directly, but for the reduction of G1 and G2, then we have now carved a path of shared structure through a highly restricted but infinitely large corner of the grammar landscape. The important point here is that we have translated ?different sorts of higher-order properties of the string that codes the grammar [may be] potentially meaningful [cogni- tively]? into the idea that there are separable pieces of meaningful information that make up a grammar, and these stand in a homomorphic relationship with separable pieces of meaningful information learned by the brain as the specification of (a particular part of) 99 the linguistic computation. In this case, we have pointed to a ?subpart? relation as an eaxample, which is intutively easy to relate to the notion of ?length.? Before we break down the ?subpart? relation into the more primitive structures that are directly implied by the statement of the grammar, let us dwell on it briefly to make sure that the notion of ?structure? is clear, staying now on the CFG side. The grammar for fabg given in (61) constitutes an addition of information to the empty grammar, and can also be refined in various ways, including the addition of another rule B! c (this yields the language fab;acg). This is shown in Figure 2.3. S ! AB A ! a B ! b B ! c S ! AB A ! a B ! b fg Figure 2.3: Diagram of subpart relations between three CFGs. There is another grammar apart from our fabg grammar which stands in the same relation with these two other grammars, namely the grammar for facg which has been added to the diagram in Figure 2.4. 100 S ! AB A ! a B ! b B ! c S ! AB A ! a B ! b S ! AB A ! a B ! c fg Figure 2.4: Diagram of subpart relations between four CFGs. What these examples illustrate is that a context-free grammar takes the form of a collection of pieces of information of a particular kind (context-free rules), and this kind of specification gives rise to some structure: one can add or remove these pieces and get back different grammars, which induces a relational structure. If we had specified the same language, with the same derivational trees, in a different way, using ?rules? that expand a tree not only by one level, but instead by substituting a more substantial part of a subtree, then we could have also had a grammar containing a single rule that would generated aabb with the same structure as assigned by this grammar, and the relational structure would have been different. The particular choice of grammar specification asserts a particular way of organizing the information used to specify the mapping, which is at least in part captured in the containment relations we have just illustrated. In fact, there need not be any subpart relations between grammars at all, depending on how we code the information in the grammar. Suppose we annotate every rule in the CFG, and every lexical entry in the CG, with the label entry [k] of a total of [n]. Every rule has such a label attached, and a collection of symbols is not allowed to be a grammar unless it has these labels correctly 101 marked; now, any time we remove lines, we get an illegal grammar?so now there can in fact be no subpart relations of the kind we have talked about. Again, the particular choice of grammar specification asserts a particular way of organizing the information used to specify the mapping. The next point of this section, again: we demand that the organizational scheme implied by a grammar formalism be real, in the sense that it is homomorphic with something in the brain. Let us now spell this out this idea of an organizational scheme, or a structure, in a simple setting that is transparently extensible to any of the three particular examples of grammars we have discussed in this section. Suppose a grammar were a sequence of sequences of lexical items (imagine the right-hand side of a rewrite rule). Here is one way of organizing this information; we will translate it with some formal detail that introduces as few additional substantive assumptions as possible. ?A Grammar contains a List of Lists of items from the Universal Lexicon?(63) 9R:9LR: List(LR;R) ^ 8r:r 2 R!9S:List(r;S) ^ 8s:s 2 S! s 2UL 102 To properly fill out the examples that are to come, we also need the following condition: List(L;S)! 9s:s 2 S^8t::Prec(L; t;s) [there is a head] ^ 9s:s 2 S^8t::Prec(L;s; t) [there is a tail] ^ 9s:8u:[8t::Prec(L; t;u)]! u = s [head unique] ^ 9s:8u:[8t::Prec(L;u; t)]! u = s [tail unique] ^ 8s:9t:[t 2 S^ Prec(L; t;s)]! [8u:Prec(L;u;s)! u = t] [precedents unique] ^ 8s:9t:[t 2 S^ Prec(L;s; t)]! [8u:Prec(L;s;u)! u = t] [successors unique] ^ 8s:8t:Prec(L;s; t)!:Prec(L; t;s) [precedence asymmetric] (64) Simply asserting the sentence (63) by itself is not enough to specify a grammar of this kind. The identity of the lexical items and what exactly the orders (which are associated with particular Lists) are are properties of the objects that satisfy the formula, not of the formula itself. Given a finite collection of objects, the sentence (63) asserts the constraints on a logical structure that makes it into a grammar. In particular, suppose we have some sentence P = 9a1 : : :9ak:F(a1; : : : ;ak), where F is quantifier-free, which has (63) as a logical consequence (under any interpretation that satisfies our sentence, (63) will also be satisfied). Call P a grammatical description. If our collection of objects DP can be used as the domain for some modelMP j= P (with basic relations like elementhood fixed in the model in a standard way), then hP;DPi is a grammar. We may call DP the gram- 103 matical content for the sake of having a label, although there is surely content in P also (after all, there is more than one possible P). In this sense, the sentence (63) induces an ?organizational structure? associated with the set of grammars. We can then assert interesting derived structures. For example, suppose we have some grammatical descriptions S and T . If any grammatical content that satisfies S also satisfies T , then T is a grammatical consequence of S. Now suppose we take a model that satisfies S, calledMS, and remove some objects from the domain or some relations from the model, as well as references to the removed objects among the functions and relations of the model, to obtain a new modelMT . IfMT j= T butMT 6j= S, and if the set of atomic domain elements of the modelMS (here we will take any elements of UL to be atomic) is minimal while both satisfying S and containing the elements ofMT , then we might say that hT;MT i is a subpart of hS;MSi, in a particular sense. (The atomic minimality condition ensures that the difference between the nested models is not such that the elements of the smaller one are no longer useful in the larger model, in which case one could simply add many elements to model the entire new sentence; the condition rests on the notion that in that case we could construct a more complex sentence still which would make use of the newly useless elements.) Here is an example. Consider the grammar [[A;B]; [a]; [b]; [c]] (remember that in this setting, the ?rules? are ordered). Seen as a set of objects, this must be at least (in order to model a grammar description): (65) Objects A, B, a, b, c, which have the property that they are elements of the object UL (a constant whose makeup is fixed); our use of labels here is arbitrary but is 104 meant to make clear that the objects are distinct (66) Objects we might label fA;Bg, fag, fbg, fcg to make clear how the ?element? relation holds with respect to the objects just mentioned (67) Objects we might label [A;B], [a], [b], [c] to make clear that the List property holds of them, and which of the objects just mentioned they are lists over, and to indicate what Prec relations hold (68) An object f[A;B]; [a]; [b]; [c]g (given this label by us as only a mnemonic for the relations it participates in, as before) (69) An object [[A;B]; [a]; [b]; [c]] (as before) A sentence that describes a grammar that could be modelled under this domain is as fol- 105 lows: 9R;LR;r1; : : : ;r4;s1; : : : ;s4; l1; : : : ; l5:(70) List(LR;R)^ r1 2 R^ ^ r4 2 R ^ List(r1;s1)^ ^ List(r4;s4) ^ l1 2 s1^ l1 2UL ^ l2 2 s1^ l2 2UL ^ l3 2 s2^ l3 2UL ^ l4 2 s3^ l4 2UL ^ l5 2 s4^ l5 2UL ^ Prec(LR;r1;r2) ^ Prec(LR;r2;r3) ^ Prec(LR;r3;r4) ^ :Prec(LR;r1;r1) ^ :Prec(LR;r2;r1) ^ :Prec(LR;r3;r1) ^ :Prec(LR;r4;r1) ^ :Prec(LR;r4;r4) ^ :Prec(LR;r4;r1) ^ :Prec(LR;r4;r2) ^ :Prec(LR;r4;r3) ^ :Prec(LR;r2;r2) ^ :Prec(LR;r3;r2) ^ :Prec(LR;r4;r2) ^ :Prec(LR;r1;r3) ^ :Prec(LR;r3;r3) ^ :Prec(LR;r4;r3) ^ :Prec(LR;r1;r4) ^ :Prec(LR;r2;r4) ^ :Prec(LR;r4;r4) ^ Prec(r1; l1; l2) ^ :Prec(r1; l1; l1) ^ :Prec(r1; l2; l1) ^ :Prec(r1; l2; l2) Now, another grammatical description could be satisfied by the same model if it simply failed to introduce require any new distinct objects or inconsistent orderings (both distinct- ness and ordering here are demanded only by Prec relations and the conditions in (63) that 106 govern them). For example, one could remove r4 and all references to it from the sentence; or one could remove l5 and all references to it; or both. In any case, the same model would suffice unchanged to satisfy the new sentence?the absence of any assertions demanding the existence of an element of the domain does not mean that that element does not exist. This would be true for any other model of the larger sentence, not just this one. The new sentence would be a grammatical consequence of the old one. Suppose we remove r4. In a sense, we have a grammatical description of something we might call [[A;B]; [a]; [b]] (regardless of whether we include the object corresponding to [c] in the model). However, this is somewhat misleading, as the description would be satisfied with alternate objects in the domain: we could have just as easily written [[C;D]; [e]; [ f ]], because there is also a model for the same grammatical description that has that object at the top level. If we put the demand on the model that its domain be a proper subset of the one outlined above, however, with none of the relations changed except to remove references to the given elements, then the only way to continue to satisfy all the requirements of the new sentence in a minimal way is to remove the object we called [c] (we have already changed the top-level list to something we would more naturally label [[A;B]; [a]; [b]] by updating the relations in the model to remove references to [c]).13 This comports with our intuition that the resulting grammars are nested, with one being a 13We could remove c and fcg, in fact, and make the model still ?more minimal??not because there are no longer any conjuncts in the sentence that used to be satisfied by virtue of their inclusion (l5 2 s4 is still in the sentence) but because in fact they never needed to be included in the first place. While r4 needs to be distinct from r1; ;r3 because the four have conflicting precedence requirements, l5 and s4 have no precedence requirements and the relevant clauses could be satisfied by other domain objects, such as b and fbg. In fact, only two lexical items are needed in the model which is truly minimal. This seems to demand that either that the universal lexicon be treated as a set of properties, or else that the notion ?subpart? is actually only relevant for this subset of the grammars that are truly minimal. This seems to amount to a technical detail and does not change the basic point that a grammatical framework is a set of constraints on the organization of grammatical intuition, and that these constraints can give rise to subpart relations, so I will not pursue it further mere. 107 subpart of the other.14 In short: we have given a rough outline spelling out the types of constraining infor- mation that are implicit in a grammatical formalism, and shown that these constraints are a structure on collections of information over which further structure can be defined. It should be pointed out that some theories of grammar could be formulated so that every grammar has the same grammatical description, and differs only in content. For example, in standard monostratal Optimality Theory, a grammar consists of an order on a fixed, universal set of constraints. No organization is demanded except to say that the only available relation is something like Prec, or that there is a mapping from constraints to the integers or some other set with a fixed order. Presumably, any grammatical description for such a grammar is a grammatical consequence of any other; similarly for the Principles and Parameters approach. We have already discussed an issue that?as we are about to show?can be reduced to the same thing, namely, the irrelevance of a notion of simplicity for these types of grammars. The conclusion we reached above was that the apparent fixed length and uniform information content of these grammars is more limited than we might think naively, because they need to be supplemented by other kinds of variable-length information that interact with this fixed-length information. In particular, we briefly touched on the problems of picking out a subset of the set of logically possible constraints, and of picking out a subset of the set of logically possible lexical items, both generally understood to be real learning problems (that is, things that 14It should be noted that there are other, different cases of grammatical consequence. For example, T is a grammatical consequence of S for T = 9x:P(x)_Q(x), S = 9x:Q(x), or T = 9x:P(x), S = 9x:P(x)^Q(x). In the first case, as in the subpart case, the consequent sentence tolerates a whole class of models that S does not, but, unlike in the subpart case, this is not due to the addition of an object or relation to the model in S. In the second case, an additional relation, but not an additional object, is demanded in the model by S. 108 need to be specified somewhere). To take a concrete example, the difference between the theory that the set of universal constraints is finite and the theory that it is infinite is that in addition to the order (or List) over the constraint set, the grammar also needs to specify which subset of the constraint set the order is over, and so needs to predicate something of each element of that subset.15 A grammatical description for one such possible grammar would assert the existence of more elements than another, and thus it would no longer be the case that all grammatical descriptions were consequences of all others. That would allow for the kind of higher-order structure we are discussing here to be asserted, and, as we will see, that, in turn, provides a natural way for the Bayesian Occam?s Razor to apply automatically to some inferences in these frameworks. 2.6.3 Relating grammars to priors in an optimal way Here is an idea about how we could construct priors from grammars. It is very weak, but it is enough to get the Bayesian Occam?s Razor to hold: (71) Optimal Measure Principle. If T is a grammatical consequence of S then the prior evaluation measure Pr[ jS] is similar to Pr[ jT ] for M S fM jM j= Sg, M T fM jM j= TgnfM jM j= Sg. 15It is worth emphasizing that, while we have made use of domain elements that have the properties of complex objects such as sets and lists, they were actually contentless placeholders, as they only had these properties in virtue of relations external to themselves; only the atoms (in the above, those domain elements which are part of the universal lexicon) can be thought to have intrinsic content. If we had instead had a set, such as the set of language-specific constraints or lexical items, as a contentful element of the domain, rather than having the elements in the domain and relating them to a placeholder object, then we would be asserting that the grammar is one-dimensional, as opposed to jU j-dimensional (in this case, infinite-dimensional). That is fine as an alternate structure for the grammar, but it needs to be kept in mind that the question is an empirical one if we demand that the grammatical description be shared structure with the brain, which is to say if we take the strong but axiomatic demand of this whole chapter seriously. 109 If the grammatical description stands in for the framework parameter w , this asserts that FCP holds for grammars. In particular, T represents the structure of a grammar which is just ?like? another, described by S, except that it is satisfied with some models (MT nMS) that S is not?in the cases we examined above, T required ?less? information in the sense of a smaller domain. The meaning is that there must be a bijection between some of the ?less? and ?more informative? models so that Pr[ jM S ;S] = Pr[ jM T ;T ]. These conditions on the prior imply that, at least for the shared subsets, the prior distribution on information within those subsets is independent of the choice of structure. We will discuss concrete examples in more detail shortly, but for the moment think of the presence or absence of an additional rule: for at least some such additional rule, the learner?s biases on the contents of the rest of the grammar do not change simply because that rule is present or absent.16 This condition is extremely weak, however, and, although grammatical consequence clearly gives some structure across which we can now posit mappings, and furthermore clearly gives rise to at least some mappings betweenM T andM S for which the model from the first class represents a ?less complex? grammar than the grammar from the second, it is far from clear that all such cases will be like this. We can place a stronger condition by adding additional structure preservation to the condition for similarity: (72) Asymmetric similarity over subpart models: Two prior distributions Pr1 and Pr2 over two sets of models Q1, Q2 for sentences T ,S respectively are similar for some A1 Q1, A2 Q2 if there is a bijection f : A1! A2 such that, for all sM A1, 16Actually, it is even slightly weaker than this: we could permute the possible ?smaller? grammars until their prior aligns with the prior on that part of the grammar in the ?larger? structure. There is no guarantee that the priors can ever be made to align; but what it does show is that the BOR will hold under a wide range of conditions. 110 Pr1 [sMjA1] = Pr2 [ f (sM)jA2], each domain element and relation of sM is also present in f (sM), and f (sM) is atom-minimal for a domain element which both satisfies S and contain sM. Now it is clear that, if a model for the consequent sentence does not satisfy the antecedent, it is due to some missing element, because the presence of this element on the other side of the bijection is enough to guarantee satisfaction of the antecedent. In the case where A1 is the entirety ofMT nMS, this stronger notion of similarity is enough to guarantee that the failure of certain models in MT nMS to model S is entirely due to this difference in complexity; in this case, the BOR must hold in favor of the sentence T , all things being equal. Although this condition is still fairly weak, to the extent that it holds it falls intu- itively into the class of ?optimal? principles sought out under the Minimalist program of Chomsky 1995: although the sense of ?optimal? is somewhat vague, the intuition about the OMP, where it holds, is that the information in the prior ?follows from the structure? of a particular grammar to a large extent, because non-changes to some sub-part of the structure of the grammar track non-changes in the prior over the contents of that substruc- ture. 2.6.4 Example: deriving a symbol-counting evaluation measure Recall the prior evaluation measure of Chomsky& Halle 1968 discussed above: The ?value? of a sequence of rules is the reciprocal of the number of symbols in its minimal representation. Putting aside the notion of ?minimal representation? (which allows for the 111 collapsing of environments using brackets and so forth), we can simply take the measure to evaluate a particular representation of a particular grammar. First note that this will never sum to one, nor any other finite quantity (the harmonic series is not convergent), and thus cannot be a probability measure. However, it can be weakened to merely some function decreasing in the number of symbols in the grammar, with no real consequence for any of its uses in the literature. Although it is now clear that the OMP will guarantee such preferences in the prior in the fairly general case where the two grammars stand in a consequence relation for which the antecedent grammar description can be satisfied by adding to some model for the consequent one?in which case the antecedent grammar is clearly ?more complex??the full import of the similarity condition and the effect of the likelihood has not been explored intuitively in the grammatical context. Begin with the similarity condition: this says that the addition of some clause neces- sitating an additional domain element or relation in the model (which would presumably be represented as additional symbols in a meaningful grammar notation) does not change the relative preferences for existing elements of the grammar, and under the strong version just outlined, this must be the case across the entirety of the set of possible models. Suppose we add a rule to the grammar A! B=C?D, which we would perhaps spell out formally as a List, following the above, [[[A]; [B]; [C]; [D]]], now [[[A]; [B]; [C]; [D]], [[E]; [F ]; [G]; [H]]]. The other possible models satisfying the description of the first grammar contain alternate lexical items, and may also contain extraneous elements which would nevertheless not serve to satisfy the description of the second grammar. Focusing on the choice of lexical items, the distribution over the choice of items in A! B=C?D must be the same regard- 112 less of the presence or absence of the other rule. (Recall that this is the prior distribution, so no consideration of consistencey with the data is needed.) The only thing worth pointing out about the condition on the likelihood is that it may be gradient, unlike the classical evaluation measure, handling, for example, the aggregate consistency with all the available data. The precise specification of the prior will yield a precise trading relation between the goodness of fit made available by the various different models across the two grammatical descriptions, on the one hand, and the BOR scaling factor, on the other. 2.7 Discussion The goal of this analysis has been to highlight the circumstances under which Bayesian inference for grammars will give rise to a simplicity bias. We return now to questions about the paper by PTR, in light of these new tools. Why does the PTR model work? Does it have to do with the Bayesian Occam?s Razor? Does the PTR prior obey the OMP? We now know a good way to identify cases of BOR: look for hyperparameters which imply a change in the ?size? (in terms of the uncertainty in the prior measure) of the set of possibilities for other parameters in the specification of a grammar; the clear cases will be those for which these other parameters? distributions are independent of the choice of hyperparameter. The PTR prior is riddled with these: there is a (trivial) distribution on the selection ?regular-only versus CFG?; there is a distribution on the total number of nonter- minals; there is a distribution on the number of productions; and there is a distribution on the number of items on the right-hand side, for a given rule. In each case, the choice does 113 not affect the prior on the dependent parameters at all. Note that it is independently true that these hyperparameter distributions themselves contain biases for smaller grammars (chiefly because they are distributions over the positive integers and decay as they go to infinity); but this is actually unrelated to the BOR. There would be a preference for smaller grammars regardless, simply because of the structure of the prior. Notice, however, that in the case of ?regular-only versus CFG,? the ?smaller? set in our ?less diffuse? sense is clearly the set of right-regular grammars. The model prefers proper CFGs in spite of, not because of, the BOR in this case. As discussed earlier, and as pointed out by PTR, the problem with right-regular grammars is that they must yield bad analyses for many sentences, even where they fit well. The analyses might be bad because they need to make use of productions which give away too much probability to unattested sentences (where the probabilities in the grammars have a high degree of uncertainty about which productions should apply), thus dispreferred by restrictiveness; or they might be bad because they are too large, which would lead to problems with simplicity and restrictiveness both. PTR?s results are that the likelihood prefers regular grammars, while the posterior prefers CFGs, thus pointing to some combination of the BOR and the other biases in their prior. The increase in size (which would be dispreferred by both types of biases) is in the number of rules, which must increase to obtain a similar fit in the regular grammars, as PTR point out. PTR?s result thus really does follow from a plausibly ?domain-general? effect, that of the very general law of inference we call the BOR, at least to some degree. However, as pointed out above, the suggestion that particular choice of hypothesis space is one that stands in well for some plausible set of hypotheses available to ?general cognition? any 114 better than it stands in for one which is specific to language has no real basis. What is to be done with the Bayesian Occam?s Razor? Of course, it will be em- bedded in most hierarchical Bayesian models we use to account for language acquisition; however, there is more that can be said about this. In particular, with the knowledge that ?flat? grammars (classical OT and P&P) need not be equipped with a notion of simplic- ity, while structured grammars must, because of BOR, it would be reasonable to expect that flat and structured grammars ought to make systematically different predictions about acquisition and historical change, which we should be able to extract reasonably clearly just by sketching out the predictions in particular cases informally. A thorough review of attributions of various historical changes to simplicity throughout the literature would be a reasonable place to start. With such evidence in hand, one can then use Bayesian statistics not only as a theory of language acquisition per se, but also as a meta-theory: a theory about what linguistic theories do and do not predict, which can be used to derive predictions about those theories (in this case, about the behavior of a learner), and then compare them. Finally, it is worth discussing the ontological status of the sorts of formal constraints we used to translate the notation of grammars into the organization of certain pieces of in- formation. In particular, while it seems quite clear that some of what learners do inference over is indeed ?organizational,? which is to say, referring to the structure of the grammar (its size and shape), the utility of such information outside acquisition is somewhat diffi- cult to understand. If, say, parsing makes reference only one grammar at a time?which is to say, one specification of the necessary information necessary to get a human parser to work?then why should that information be structured in any way that resembles the 115 organization of that information for the learner? This is somewhat relevant, because the more ?natural? the organizational principles of grammar, the more ?optimal? the OMP seems to be. There is one reason that I can think of, which is that, if the length of grammars is un- bounded, then in fact there will be some possible grammars for which only certain subsets of the information they contain will be able to be accessed at a given time by any device whatsoever with a finite memory. This is definitely true for the lexicon anyway?it would surely be impossible to make simultaneous use of information about large numbers of lex- ical items?and it would therefore be quite interesting if some organizational structure of the lexicon (say, as uncovered by lexical access times) were a place the BOR could hang; we could then investigate learners? preferences to see whether they did indeed track this structure. 2.8 Conclusion In this chapter, I have presented a detailed explanation of what it means to talk about the Bayesian Occam?s Razor which is sometimes referred to in the Bayesian literature. I have outlined the action of this law of inference, and certain general conditions under which it will hold. I have then discussed what it means for a grammar to itself have struc- ture (rather than merely assign structure), and claimed that this structure is what is really being referred to when the ?notation? of a grammar is taken to embed some empirical claims. Using a relatively neutral example of how such structure could be spelled out, I have developed what it would mean to apply the Bayesian Occam?s Razor to a structured 116 grammar. Finally, I have argued that the BOR makes Bayesian statistics a particularly use- ful tool for investigating various theoretical and theory-comparison questions from new angles. 117 Chapter 3: Modelling allophone learning The first question I ask myself when something doesn?t seem to be beautiful is why do I think it?s not beautiful. And very shortly you discover that there is no reason. ?Attributed to John Cage 3.1 Categories and transformations This chapter proposes Bayesian models for learning segmental categories and al- lophonic processes, two crucial parts of linguistic cognition studied in phonology. The models are based on a new idea about what it means to be a context-dependent ?phonetic rule,? namely that there is an ?addition? operation in a gradient phonetic space, and each separate effect of context is its own addition. The conjecture is furthermore put forth that all cases of allophony are phonetic rules in this sense. I use the learning models to argue that such a model is feasible, and that something like this might even be crucial to learning phonetic categories. 3.1.1 Empirical review Phonology investigates the cognitive systems involved in producing and recogniz- ing speech: the auditory system as it applies to speech, the motor system, the system of lex- 118 ical memory that underlies the ability to store and recall the forms of words. There is also a connection to traditional grammar, which, among other things, tries to describe patterns in how different sounds are pronounced in different contexts in a particular language. Once we know what these patterns are, certain crucial facts about them constrain our under- standing of the cognitive mapping that converts lexical representations to pronunciations. In Chapter 1, I reviewed the standard assumptions about what is in the lexical memory system: lexical memory for a form consists of a finite sequence of segments reflecting the sequence of sounds, and each segment can be classified as a member of some discrete set (the inventory). Well-understood patterns in pronunciation seem to respect segments and inventories, and we use the fact that they do to support theories about what information is in the lexicon. The linguistic patterns in question are the processes discussed in Chapter 1 (both neutralizing and strictly allophonic). The idea behind saying that processes respect segmentation is that processes do not have effects that arbitrarily subdivide words; the idea behind saying that processes respect the discrete-valued nature of segments is that, when a segment changes from one to another, the resulting segment is pronounced just like other instances of that segment which are coded lexically, or which result from other processes (that is, the resulting phonetic realizations are statistically the same as for some other category, all other things being equal). If the patterns are understood as changes in pronunciation that are carried out in a mapping from lexical to pronounced form called the phonological grammar, then discrete-valued segments can be seen as crucial parts of the computation of this mapping, and segments and must have some cognitive status in the lexical representations and the grammar that manipulates them. The goal of research in phonology has been to find a way of formulating these mappings that satisfies the usual 119 requirements for a linguistic theory: they must share some structure with the way the brain does it. Beyond the fact that it gives rise to linguistic patterns in pronunciation which gen- eralize in particular ways, there are other sources of evidence about this division of lexical representations into segments, and the classification of segments into discrete categories. Speech perception experiments often ask speakers to identify many slight variations on a sound, along some acoustic continuum, (?is it ee as in bee or ih as in bit??), and then test the ability of listeners to discriminate between these small changes in pronunciation. Discrimination ability tracks the identification curves, meaning that, as judgments become clearer about which segmental category a sound belongs to, people?s ability to tell small differences apart gets worse, a phenomenon which is referred to as categorical percep- tion or the perceptual magnet effect (Abramson & Lisker 1970, Pisoni 1973, Kuhl 1991). This suggests that discrete classes of segments have some special cognitive status apart from just being convenient ways of labelling the stimuli in the task; this idea that there are categories?equivalence classes of sounds?follows the the usual understanding of how the phonological system works, in particular, the idea that phonemes form categories. The reasoning is not airtight, as the association of categorical perception with discrete cogni- tive categories is only one explanation. Another would be that perception is warped in a way that simply happens to align with the identification curves, perhaps because the prototypical pronunciations relatively dense statistical clusters in the input. Nevertheless, one does not need to do an experiment to see that processes that implicate phonological representations have categorical effects: when listeners hear a sequence that is either pit or bit, they surely either believe it is one word or the other?or assign some probability 120 to either?but not some interpolated word; so at some point in the process of recognizing speech, the encoding of a sound needs to have some information about whether it is codes for one of these words or another. The idea that there are segments in the first place says that words are broken down into sequences?it is not just a matter of recognizing as a sound as coding either bat or bad, rather, that decision is made up of smaller decisions about b?a?t or b?a?d. Again, linguistic patterns seem to operate over temporal sub-syllabic chunks, but a skeptic could argue that this does not require that these temporal chunks have any cognitive status. Inde- pendent evidence for such chunks comes from speech error research: many speech errors seem to be substitutions of entire segments, like [blejkfrud] for [brejkflud], ?brake fluid,? and [frto] for [frto] ?fish grotto? (Fromkin 1973, Dell 1986, Frisch & Wright 2002). Wan & Jaeger 1998 report that this is true even for Mandarin speakers in Taiwan, where, at least at the time their experiment was done, the phonics instruction in school was done entirely using a quasisyllabic orthography called Zhuyin (bopomofo), not the alphabetic pinyin system used on the mainland, meaning that there is no chance that the speakers were relying on some extra-linguistic representation of the written form, (as has some- times been claimed to be responsible for segment effects in other languages), because speakers sometimes make segmental errors within what would be single Zhuyin symbols. Thus it appears that the division of lexical stored forms into segments, and the discrete classification of those chunks into phonemes, shows some cognitive effects apart from the fact that there are phonological alternations that obey this discretization. As for the processes themselves, the experimental literature focuses on the strictly allophonic process discussed in Chapter 1. Strictly allophonic processes output segments 121 that do not appear anywhere apart from in the output of these processes, like the exam- ple discussed there of Spanish [bota]/[la?ota]. The traditional understanding is that the distinction is not coded in the lexicon, because it does not need to be, and it seems to be the case that listeners are worse at perceiving the difference (for Spanish, see Boomer- shine, Hall, Hume & Johnson 2008). There are two types of results: the first shows that strictly allophonic pairs of sounds are not distinguishable or less distinguishable in per- ception; the other shows that, while two sounds might be distinguishable sometimes, in a form where an allophonic process applies, listeners are worse at telling the resulting allophonic pronunciation of the sound apart from the other one. Each may true for differ- ent allophonic pairs, depending on the status of those sounds in the language. The first approach tries to separate out the effects of allophony on the perception of segmental cat- egories per se from the effects of perceiving in the presence of an allophonic alternation; Kazanina, Phillips & Idsardi 2006 showed that, even outside of the environment where the allophonic change occurs, Korean speakers, for whom [d] and [t] are strict allophones, cannot tell these sounds apart (between sonorants the voicing on stops changes, but the result, a sound like [d], is not a kind of voicing that can ever code for a distinct word from the one with a corresponding [t]; it is in other languages, like Russian, and Russian speak- ers, naturally, discriminate these sounds perfectly well). The second shows that, even if the segments themselves are in some cases distinguishable, allophonic processes still have a detectable impact on perception: Peperkamp, Pettinato & Dupoux 2002 showed that French speakers had no problem discriminating the voiced?voiceless pair [a]/[a?] in isolation, but found it very difficult to discriminate [azo]/[a?zo], (they told them that they were hearing a new language and the two syllables were separate words), because there is 122 a process of voicing assimilation in this environment; there is no lexical contrast between []/[?] at all in French, so the process is truly allophonic, but voicing is contrastive for other fricatives in French, which presumably explains the results in isolation. The facts about perception do support some cognitive status for allophony and the lack of contrast it induces, and the traditional approach, documented in Chapter 1, is to say that the stored forms do not code the allophonic distinction, (for example, they code allophonic [?] as if it were just another []), and then it gets added by the phonological grammar. However, there is another approach which says that allophonic processes are not true processes, in the sense of changes from lexical form to pronounced form, carried out by the grammar. This has become standard in much recent phonological theory: a principle called lexicon optimization is thought to determine what happens when lexical forms need to be stored in memory, and this principle essentially says that that learners will store underlying forms in a way that makes them as similar as possible to what was perceived (see Prince & Smolensky 2004 for an explanation of why the way Optimality Theory works makes it tempting to invoke such a principle). This means that, unless there are independent examples of a morpheme appearing in its pre-allophonic form?unless a morpheme is pronounced as [a] in one form, but the pronunciation of that same morpheme changes when some ending is added, to make something like [a?so]?then the lexical form will be just like the perceived pronunciation, and so the stored form is usually thought to code all the non-contrastive allophonic segments in many cases. The result is that allo- phonic ?processes? are no longer grammatical changes in this theory, but merely static grammatical knowledge of cooccurrence restrictions (phonotactics: see below). By itself, this does not predict that listeners? perception should be affected by allophonic patterns, 123 although it does not rule it out; however, the results from speech perception studies are somewhat troubling for this perspective. After all, if listeners do not reliably discriminate allophonically related segments, does this also mean that the distinction is not made any- where up the processing stream?and, crucially, is it visible to the phonological grammar? If not, then the right way of thinking of the cognitive status of allophonic processes is not as co-occurrence restrictions, because in that case there would simply be no way to state such a grammatical restriction. In any case, listeners unpack the allophonic processes and correctly recognize the words they affect, and learn, as speakers, to reproduce the allophonic pattern. Thus, some- thing must be said not only about how phonetic categories are learned, but also about how the allophonic patterns themselves are learned (along with everything else). I will save the discussion of why allophonic processes deserve this special attention for Chapters 4 and 5. I turn now to some learning models for each of these things, before introducing a new hypothesis about how allophony works, and reviewing a new learning model for both categories and allophonic processes. 3.1.2 Computational and mathematical models Imagine the following procedure: 1. Choose one of a finite number of items stochastically (maybe with some bias) 2. The items are each associated with a probability distribution over observables? say, sets of vowel formant measurements, or anything at all, like colors on a color wheel?so choose a particular observable in proportion to how likely the selected 124 probability distribution says it ought to be We start with a probability distribution over categories, (items), a finite set of possible values for some variable we could call z; we select from that distribution; we use the outcome to change how we (stochastically) select something else, call it y, an observable. This is a generative model for what is called a mixture model. It is called this because if we were to ask what the relative probabilities of different observables are, we would not be able to give just one answer: we would have to outline all the possibilities, one for each of the items we might have drawn?a mixture of different distributions. Now imagine doing the process in reverse: given some observable, choose one of the items. The criterion is to choose the ?best.? In our simple Bayesian formulation, we will say that what this means is to maximize the posterior probability of z given y; we will use Bayes? Rule to compute this (see Chapters 1 and 2); but this is not the only way of making this decision. There are many ways of doing inference back to the category selection in this way?but a phonetic category system always consists of a set of categories, each of which gives rise to a different perceptual map, and so the problem of perceiving speech sounds can necessarily be seen as having the same abstract structure as inverting a generative mixture model. Furthermore, the problem of learning the categories necessarily has the following structure: learn the phonetic maps associated with each possible phonetic map; learn the perceptual bias for different categories, if there is any. Then, so long as the preferences in the phonetic map and in the perceptual map follow probability theory, the phonetic category learning problem must be the problem of learning a mixture model. 125 Previous research on phonetic category learning often uses standard statistical meth- ods for fitting mixture models. The perceptual maps are generally multivariate Gaussian distributions (out of convenience), which, in our figures, will appear as in Figure 3.1. Figure 3.1 shows a mixture of three two-dimensional Gaussian distributions. Multidi- mensional Gaussian distributions generalize the single-variable Gaussian distribution?a symmetrical probability distribution based on a ?sum of squared error? computation, with a location parameter m setting the center, and a scale parameter s 2, the variance, setting how quickly the probability falls off away from the center?to multiple dimensions. The center becomes a p-dimensional vector, and the scale becomes a p-by-p matrix, listing not only the variance on the p dimensions, but also their covariance (unscaled correla- tion). The set of all observations with probability no less than some fixed p is an ellipsoid which is aligned with the axes if the variables on the p dimensions are not correlated, and otherwise has some rotation in proportion to the degree of correlation; it is the covariance matrix that sets this size and shape, while the location parameter is responsible for where the ellipse is centered. Previous phonetic category learning models employing Gaussian mixtures include: de Boer & Kuhl 2003 (long English corner vowels, =i=, ==, =u=); Vallabha, McClelland, Pons, Werker & Amano 2007 (four English and four Japanese vowels, =i= and =e= and their short?for English, lax?counterparts; their data was resampled from Gaussians; they also used a mixture of nonparametric category distributions in place of Gaussians in a second experiments); McMurray, Aslin & Toscano 2009 (voice onset times for English stops; data was resampled from Gaussians); Feldman, Griffiths & Morgan 2009 (all En- glish vowels, taken from Hillenbrand, Getty, Clark & Wheeler 1995; data was resampled 126 Figure 3.1: A mixture of two-dimensional Gaussian distributions. Multidimensional Gaussian distributions generalize the single-variable Gaussian distribution?a symmet- rical probability distribution based on a ?sum of squared error? computation, with a loca- tion parameter m setting the center, and a scale parameter s 2, the variance, setting how quickly the probability falls off away from the center?to multiple dimensions. The cen- ter becomes a p-length vector, and the scale becomes a p-by-p matrix, listing not only the variance on the p dimensions, but also their covariance (unscaled correlation). The set of points with total probability p is an ellipsoid which is aligned with the axes if the variables on the p dimensions are not correlated, and otherwise has some rotation in proportion to the degree of correlation. This mixture has three categories, all with the same covariance matrix. Each ellipse is a 66% confidence region: in any direction, the probability of the deviating farther than that from the center is the same as we go around the edge of the ellipse, and the total probability of all the points in the ellipse is 0.66. from Gaussians; the model included a second layer implementing a lexicon where the to- kens were assumed to be organized into sequences). Another clustering model that has been applied to phonetic category learning (the output of which can be interpreted as a non-Gaussian mixture model) is k-means (Hall & Smith 2006). Crucially, fitting all these models is done unsupervised: the learner is given a set of observations, and does not know any information about which tokens came from which categories; they must induce some 127 mixture model that fits the data well. Fitting a mixture model, as we said, involves selecting category maps, which, given that those category maps need to be specified in some way or another, means finding some parameter values for each. The most common approach today to fitting a mixture model in a Bayesian way is to put a nonparametric Bayesian prior on the parameter values, which is a prior on the set of distributions over the parameter values; the most popular is the Dirichlet process prior. To understand the idea of choosing a prior?which is a probability distribution itself?on the set of probability distributions over parameter values, and how this relates to mixture models, consider a concrete example, a mixture of multivariate Gaussians. Think of it first as a set: if there are three categories, the set should have three pairs of parameter values (or, in our earlier version, three items, each associated with a single set of parameter values): f D m 1 ;S1 E ; D m 2 ;S2 E ; D m 3 ;S3 E g. Each S gives the shape and orientation of a different set of ellipsoids, and each m gives the center for them. Now, since we also need, as part of the process of generating an observation in the generative model, some probabilities for selecting each of the different categories (the bias we talked about?even if it is implicit: 13 , 1 3 , 1 3)?and since any probability distribution has got to be stated over some set anyway?we can just say that, when we are choosing a particular mixture model, we are actually choosing a probability distribution on parameter values?a distribution where it just so happens that all but a finite subset of the parameter values (in this case exactly three) have zero probability. When we are choosing a mixture model, we are looking for distribution over a set. Thus, to use Bayesian inference to decide on a mixture model, we need a prior distribution over probability distributions. 128 A Dirichlet process (Ferguson 1973) is such a distribution. When we draw a sample from a Dirichlet process, we get back a discrete distribution on the set of parameters, meaning that that distribution puts non-zero probabilities on a countable subset of the parameters; this is exactly the sort of distribution we want to characterize a mixture model, as it captures the idea that the ?select a category? step is the selection of one of a collection of distinct items (there is not a continuous range of vowel phonemes in English, there are distinct equivalence classes). Although it might seem surprising that we have not said that the draws are distributions that put probability on finite sets of parameters, this is actually important, as it allows us to capture another fact, that of ?held-out probability?: there is always an outside chance that we observe a new and totally unfamiliar category. This might not seem important or even correct for an adult speaker of a language, but it is crucially important in learning: the learner needs to be able to posit categories it may not have heard before while keeping high probability on its old posited category structure, and in order to do this coherently, mixtures with high probability after some learning has taken place need to assign some some probability to the new category?otherwise, the mixture?s existing high probability could not influence its future posterior probability. This approach is not unreasonable for adult speech perception either: for an adult perceiver it might not be impossible to perceive a sound as belonging to a new category, even if the way speech perception works makes it unlikely. When we use a distribution drawn from a Dirichlet process prior as our way of selecting a category, we are said to have an infinite mixture model, because the model actually has infinitely many categories. (The word ?process? applied to a probability distribution just means that draws are of infinite dimension.) Of course, the number of categories that will actually be assigned in a finite sample must be 129 finite!?and if this number were not generally substantially smaller than the sample size the Dirichlet process prior would not be very useful as a mixture prior?it is, although we can adjust how many categories we expect by changing the hyperparameters of the Dirichlet process. To do this Bayesian inference?given some data and a prior on mixture models, find the parameters of a high-posterior mixture model?we would ideally be able to analyti- cally solve for the maximum-posterior mixture. We cannot do this in this case, unfortu- nately, so we might then appeal to the idea of drawing a sample of many possible mixture models from the posterior distribution and then finding one that has very high posterior probability. We can almost do this?we use a type of Markov chain Monte Carlo, which is a way of drawing samples that have the posterior distribution as a limiting distribution. The MCMC technique we use is called Gibbs sampling, (Geman & Geman 1984), which walks over each of the different variables in a model, one by one, replacing the current value with a sample taken from the posterior distribution of that variable conditional on the current values of all the others. We will see that the crucial variables in our model are the parameter values, which need to be assigned, point by point, to the data, giving rise, implicitly, to the mixture. We Gibbs sample one parameter value assignment conditional on all the rest of them. The models will get more complicated, because we will add hyper- parameters, but the essence of Gibbs sampling stays the same?given some initialization: (73) Gibbs sampling (one step). 1. a a sample from the distribution of AjB = b;C = c;D = d 2. b a sample from the distribution of BjA = a;C = c;D = d 3. c a sample from the distribution of CjA = a;B = b;D = d 130 4. d a sample from the distribution of DjA = a;B = b;C = c We do this for some number of steps, taking the ha;b;c;di tuple we are left with at the end of each iteration as a single sample. Usually we first run for a large number of iterations (?burn-in?), and use various rules of thumb to see whether the variables have actually con- verged to the posterior; after this we collect samples, throwing out a few in between each to correct for the fact that Gibbs samples are highly time-correlated (?lag?). If it is reason- able to calculate the value of the posterior for each sample we draw (which does not mean it was ever possible to maximize it analytically!) then we can pick the highest-posterior sample; we can also take an average where that is appropriate or possible (although it is not for our models). We have discussed mixture models, and we have outlined how we will be estimating them from data. Since we have said that mixture models capture systems of phonetic categories, but we will be studying systems of phonetic category and allophony, it is worth reviewing some of the approaches that have been taken in the computational literature to learning allophony. There are basically two sorts. First, there are systems that directly learn phonological grammars of one kind or another. Second, there are systems that try and come up with criteria for detecting allophony, without giving a full account of how the grammar would actually be learned (actually, there is only one system of note, but it is used in a few different papers). We will put all of the supervised phonological grammar learning models aside, which is to say, models that need to be provided explicitly with the correct underlying lexical representation as part of the input for each observed form. This leaves basically two kinds of learners that deal with allophony: learners of phonotactic grammars, 131 and learners that follow the heuristic of Peperkamp, Le Calvez, Nadal & Dupoux 2006. Phonotactic learning means learning sequencing restrictions on segments in a lan- guage. For example, in English, blick is a possible word, but bnick sounds very strange. That English speakers have this judgment reveals their knowledge of phonotactic restric- tions, many of which are language-specific and learned. As was discussed briefly above, standard Optimality Theory enforces a constraint that underlying forms should be just like surface forms unless there is an alternation to support a discrepancy, so much of allophony is treated as constraints on surface representations, not as changes that take place in the grammatical mapping. These constraints will at any rate be found by surface phonotac- tic learners, such as the Hayes & Wilson 2008 learner and the phonotactic component of the word segmentation model of Blanchard & Heinz 2008. These learners adjust their ex- pectations for phone sequences on the basis of observed sequences, so that, for example, in English, they would come to disprefer [#sth] sequences, with an aspirated stop out of position?strongly, presumably, because these will never occur in the pre-processed data that is usually presented to these learners (they could occur in a larger system subject to misperception). The heuristic of Peperkamp et al. is aimed at looking in finer detail at these sequence distributions with the explicit goal of constructing a rule relating two segments. The idea is to make pairwise comparisons between segments on the basis of their immediate left and right contexts?for example, [t] will not occur after # in English. The full context- distributional profile of [t] reveals that it is in complementary distribution with [th], and, as undergraduates learning phonology are instructed to do, a good learner is expected to take this as a cue to posit a rule relating the two in order to explain this fact. The Peperkamp et 132 al. heuristic is to quantify the discrepancy between the two context distributions in a way that is graded (using symmetrized KL-divergence), so if some illicit sequences actually do occur, the learner will still be able to detect that the distributional profiles of the allophoni- cally related elements are quite different. If a pattern is truly allophonic, then it does seem clear that relatively high KL-divergence is expected. Noise or other patterns will obscure this to some degree. The effect of this heuristic should be predicted, in some form, by any system that evaluates predictions on the basis of proposed allophonic processes, because these hypotheses will themselves predict high KL-divergence. Neither type of learning system is of immediate interest to us here. The proposal that I make in the next section takes allophony in a very different direction. In particular, it implies that there are no categorical surface representations (sequences of segments) that could be used to learn from the way these learners do. This will be discussed at greater length in Chapter 4 and defended empirically in Chapters 4 and 5. We move to that proposal now. 3.1.3 Phonetic transform hypothesis We have discussed the basic architecture of the phonological system: there are lex- ical representations that are sequences of discrete-valued segments. There is a mapping that converts between these kinds of representations, on the one hand, and the kinds of rep- resentations that are used in receptive and productive systems, ?phonetic representations,? a mapping called the phonological grammar. (Occasionally it is suggested that there are two different grammars, one for perception, and one for production, but it is usually un- 133 derstood that the two are not just mutually consistent but in some sense the same.) The phonological grammar gives rise to phonological alternations, which are changes to seg- ments that the grammar makes depending on the context they appear in, like the Spanish [b]/[?] alternation, or the English alternate/alternation alternation, or the Hungarian vowel harmony alternation, discussed in Chapter 1. Almost universally, the grammars that pho- nologists provide treat these alternations as changes from one discrete-valued sequence to another discrete-valued sequence of segments. On one end of the phonological grammar is the lexicon, and on the phonetic end of the phonological grammar the representations need to be consistent with ?phonetic interpretation,? which is to say that they need to be the kinds of representations that the cognitive systems in perception and production work with. The receptive system needs to take perceptual representations and map them to lexical representations via the grammar, and the production system needs to take the output of the grammar and pronounce it. It has long been recognized, however, that phonetic cognition is not simply a matter of static ?interpretation.? Rather, there also learned, context-dependent alternations that happen as part of a ?phonetic grammar.? The reasoning is simply that there are some alternations that do not seem to be discrete-valued. The idea is that the phonetic interpretation component also has the ability to make changes based on the context, and these can be learned and language-specific (this does not do justice to the reasoning: we will discuss why something like this has to follow from this premise in Chapter 4). Claims of context-sensitive gradient phonetic changes under architectures that clearly also support discrete-valued alternations as well are to be found in Sledd 1966, Liberman & Pierrehumbert 1984, Port & O?Dell 1986, Cohn 1990. 134 For example, Liberman & Pierrehumbert 1984 contrast (1) the alternation between the two types of pronunciations of the English indefinite article, [ej]/[?] versus [?n]/[?n], which depends (only) on whether the following word starts with a vowel or a consonant; and (2) the insertion of a short closure between [n] and [s] at the end of a syllable in Ameri- can English, so that tense is pronounced something like [tnts]. For the first, they write that ?the observed sound pattern is exactly what is expected if the phonological [category] rep- resentation contains an =n= in one case and not in the other?; but, for the second, although one proposal might be a discrete-valued rule inserting the segment [t], yielding [tnts] at the surface representation, there are systematic differences between the ?[t]? inserted in tense and underlying lexical [t] in the same environment, with the latter being systematically longer in duration. The crucial step in the reasoning is that this process does not reflect discrete categories, contrary to the usual understanding of what phonological rules do; but the first process does. The rule is gradient, thus it is understood to be part of a different, phonetic component of grammar. (If it seems unclear how this is distinguishable from an argument for a strict allophonic versus a neutralizing process, it is; see Chapter 4.) To make this serious, we need a theory of context-sensitive phonetic operations. I will propose the outlines of one. I will assume a framework where the grammar specifies the operations directly (a ?derivational? theory), rather than a framework, like Optimality Theory, where the grammar is represented as a set of constraints on the mapping, and then deduces what it needs to do for a given input. I believe this is an important step regardless of whether we think that phonetic grammars should be specified positively: we still need to understand what the ?compiled-out? grammars that we could obtain can and cannot look like. Here is a strong hypothesis about the phonetic transforms that form the basic 135 operations of context-dependent phonetic grammar: (74) Linear additive phonetic transform hypothesis (LPT). Phonetic transforms are additive and they are given by linear functions of the context. There is a lot embedded in this statement. Let us go over it, point by point, starting from the idea that phonetic representations are ?gradient,? in the sense that they can make much finer-grained distinctions than the discrete-valued categories we attribute to the lexicon. Phonetic representations Call the set of all possible phonetic representations S. Addition of phonetic representations There is an operation we call that applies to pairs of phonetic representations to give new ones. We will put aside for the moment which of the properties associated with addition we will attribute to . Before saying what it means to be ?linear,? we need to say what it means for a transformation to be a ?function of the context.? What this means is that contexts have representations too (of course). Context representations Call the set of all possible context representations A. Phonetic transformations Call the set of all possible phonetic transformations T . Trans- formations T 2T are mappings T : A! (V !V ). Additivity For any context a, a transformation T gives rise to a unique phonetic repre- sentation t(T (a)) such that T (a)(r) = r+ t(T (a)) for any phonetic representation r. Now we see how transformations can be functions of context. For them to be linear func- tions, we need to say something else about context representations. 136 Addition of contexts There is an operation we call that applies to pairs of contexts to give new ones. Context representations could either be phonetic representations or lexical-type (?phono- logical?) representations, or they could be qualitatively different from either, but they need to be recoverable from and convertible to the phonological representation output by the phonological grammar. If we make a firm statement here about their being exactly identi- cal to one or the other, we would be placing constraints on what phonological or phonetic representations need to look like, and that is not the goal at present. However, the crucial assumption in everything that follows is that they at least can be phonological. We now say what it means for transformations to be linear in the context. Linearity A transformation T is linear in A if t(T (a b)) = t(T (a)) t(T (b)). The usual is additional requirement added here is that we be able to ?scalar multiply? the context representation and get a transformation that has a scalar multiple of the effect. Since we have not said anything about scalar multiplication existing for context represen- tations, we will put this aside for the time being. Putting this together with additivity, we get the following. Linear additive transformations For all contexts a, b, T (a b)(r)= r (t(T (a)) t(T (b))). As we said, we do not want to say a lot about these two ?addition? operations here and let this lead us to a lot of new conclusions about phonetic and phonological representations (up to now we have not even said that they are addition-like in any way). Rather, we want to take what we know about phonetic and phonological representations to constrain the 137 operations, so that any theoretical conclusions we draw are based primarily on this one crucial premise. Thus we would like to constrain both of the combinators based on some basic facts about what it would mean for a context to affect a phonetic representation. To preview: we will resort to using real space for both, out of convenience, so that we can continue to use mixture-of-Gaussian type models, because, once we appeal to the use of Gaussians, we are going to need to be working with real numbers; as a result we will just think of these two operators as regular addition. However, we would like for only the simulation arguments in this chapter to rest on that assumption, (out of technical necessity), and none of the theoretical arguments in the following two chapters. Starting with , the idea is that two different contexts can have effects that will combine in some way (according to ) whenever the context is really a combination (according to ) of the two contexts. Without saying anything about how contexts are represented, the only circumstance in which two separate contexts should affect the same phonetic representation is if they are both present in the environment: an effect of pre- ceding coronals and an effect of following uvulars should only be combined on a vowel when there is a coronal preceding it and a uvular following it. In our version, we will use discrete phonological representations, (for the most part), which are binary valued, and we will represent them as 0/1-valued vectors, one bit for each feature. That is why we will be able to interpret as +: if we add a vector that is all zeroes except for the ?preceding coronal? position to a vector that is all zeroes except for the ?following uvular,? position, we will have a vector that is all zeroes except for the ?preceding coronal? position and the ?following uvular? position, interpreted as a conjunction of the propositions ?has a pre- ceding coronal? and ?has a following uvular.? Real addition would also have a sensible 138 meaning if we valued the context dimensions as 1 and +1 (which we will do when we talk about feature-based learning models later in the chapter). Then, if we add+coronal to coronal, we obtain 0 coronal, which we could interpret as removing the coronal feature from the representation, and thus, in this context, from the set of things that will affect the phonetic output. Moving on to , the idea of ?combining? phonetic representations needs to be taken in the context of what those phonetic representations give us. In terms of perception, we have been talking about the ?phonetic maps? for individual segments. These are the recog- nition models that tell the listener which phonetic (in this case perceptual) tokens to expect, given that a particular category is being uttered. The representation we said was used for a phonetic map above was a Gaussian (which is probabilistic, so in fact it says more than which tokens to expect, but how much to expect it). This sets up a bit of a tension. When we say ?phonetic tokens,? we really mean ?phonetic representations? in some sense. How- ever, when we talk about the ?phonetic representation? of a category being learned, we mean that some Gaussian distribution is learned. These two things are in conflict. It is true that we can specify a Gaussian by giving only one ?possible token,? standardly the mode (?center?) of the distribution. However, we cannot specify all Gaussians in this way; we can only specify Gaussians with some fixed size and shape. In general, we also need to specify these, via the covariance matrix, and, given the general acoustic shape of phonetic categories (Figure 3.1 above is typical), it seems like a good working assumption that the learner needs to do this too. The problem is that, to map out the category assignment of perceptual representations in a particular region, one needs to specify that region, which seems to be a different kind of information than just a single perceptual representation?at 139 the very least we need to provide more than one representation?so which type of repre- sentation is a ?phonetic representation?? The information associated with a category about how to recognize it, or the information being recognized? The problem when we start to talk about combining phonetic representations is, therefore, what we are adding to what. The usual Gaussian representation in terms of a location and scale gives us some helpful guidance as to how to resolve this. Suppose a Gaussian phonetic map of a single category really is what we want to call a ?phonetic representation.? Then real-vector addition of two such representations moves the center of the phonetic map to the sum of the two centers, which may be outside the interpretable range for another category (think of adding two front vowels, both with a second formant of about 3000 Hz); it also changes the shape in a way that does not make very much sense: adding two covariance matrices gives a family of ellipses for which the principal axes have been summed both in their orientation (add the eigenvectors) and their squared lengths. Thus, adding a category to another category will not give a very good result. However, there is no phonological pattern that we wanted to treat as adding two categories anyway; rather, what we want to add are phonetic representations that may not be interpretable as categories. In the model presented here, the type of transform we will use just changes the location. If we maintain that the specification of the Gaussian is a ?phonetic representation,? (which it must be in some sense), then to get the result that the location changes under vector addition, we need the transforms to be vectors containing some amount that will be added to the mean, and all zeroes where the covariance matrix would be affected. Thus the transforms we use have phonetic representations that could never really be categories, but we still consider them all phonetic representations, subject to (in this case) real-valued addition. 140 Tosum up: we have put forward a hypothesis about what the basic context-dependent phonetic operations are, namely, that they are ?additions? which combine in a way that preserves the ?addition? structure of the contexts that generate them. We have used this to put some constraints on what the addition of contexts should look like; we have spelled out the fact that the structure of the problem requires phonetic representations that serve two different purposes, categories and transforms, to look somewhat different. But what is the value of the LPT if we made the ?addition of contexts? operation to order and know nothing about what kind of ?addition? might preserve this structure, except that we are required talk, unnaturally, about ?adding? two types of phonetic representations that seem fundamentally different? The key is in the notion that could only be acting to combine contexts. We can continue our reasoning: if two different contexts are present, then we should be able to add them to get a context representation that states that both contexts are present (similarly for two absent contexts, the first present and the second absent, and so on). This is simply an accumulation of independent pieces of information, and, as such , like addition, should be commutative and associative: Associativity of (a+b)+ c = a+(b+ c) Commutativity of a+b = b+a This has consequences: must also be associative and commutative, at least over the 141 phonetic representations that we get as the effects of transforms: r (t(T (a)) (t(T (b)) t(T (c)))) = r ((t(T (a)) t(T (b))) t(T (c)))(75) r (t(T (a)) t(T (b))) = r (t(T (b)) t(T (a))) This is different from the types of changes we see in phonological grammars: changes in particular environments from one segment to another, or additions/deletions. These are associative (if we have three processes and want to convert the action of two of them a single process, it does not matter which two we pick), but they are not commutative: changing A! B and then B! C does not give the same result as changing B! C and then A! B (see Chapter 1). This has nothing to do with whether phonological gram- mars are monostratal or derivational; it is just a fact about the kinds of changes we see on strings, namely, that one segment in one environment will change in one way, while another will change in a different way, and this breaks commutativity of composition in general. If it never happened that grammars needed to be stated as compositions of multi- ple operations, and we therefore never had to face this fact, it would still be a fact, and the fact would therefore remain that, under this view, phonetic grammars are fundamentally different from phonological grammars. Now, there are some reasons we might think that might not be commutative: suppose that contextual information (assuming that it is discrete) is organized into feature hierarchies, as some phonologists have suggested. Then it might not make sense to ?add? a feature to contextual information without first adding its parent in the tree. Then the effect on the operation of would be different, but still predictable, namely, that the order of 142 composition would necessarily track the hierarchical organization of features. In any case, LPT has the effect of constraining the operation of phonetic transformation according to the means by which contextual information can be combined. This will be cashed out in the model presented in the next section as follows: since, under our assumptions of convenience about how to interpret all of this as real-valued Gaussian phonetic maps, we have decided that we are only going to allow phonetic trans- forms to affect the location of the Gaussian (so, they always add zero to the covariance matrix), we can state a phonetic category as a Gaussian linear model: the context is a vector for us, so separate all the different dimensions out into components x1, x2, and so on, up to xh 1. Then the transformed location of a phonetic category map that starts at a0, in context, is: T (a0) = a0 + x1a1 + + xh 1ah 1(76) If x is an augmented context vector, h1;x1; : : : ;xh 1i, then we can write this as a matrix multiplication: T (a0) = A T x(77) ?where A is an h p matrix (p being the number of dimensions of the phonetic location vector) where the first row is a0, the second is a1, and so on. This matrix will represent the category A, along with all the transformations that can apply to it. I finish this section with the following conjecture, which will be explored (and ex- 143 plained) in more detail in Chapters 4 and 5: allophony is phonetic grammar. This says that the output of the phonological grammar does not contain any of the information that is phonetically present but not contrastive lexically, like the epenthetic stop found after the [n] in English tense. It does contain the result of other processes, like the [] before the -ion in English alternation. The details of what exactly should count as an ?allophone??apart from saying that it is a pronunciation that can only ever occur as the result of a particular context-dependent process, and not lexically?will be left for Chapter 4. The conjecture is not new: ?It has been our experience,? write Liberman and Pierre- humbert, ?that cases of ?allophonic variation? often turn out to have properties like those of [the English tense phenomenon]. This leads us to suspect that a correct division of labor between phonological representation and phonetic implementation will leave the output of the phonology rather more abstract than it is usually assumed to be? (228?229); Kiparsky 1985 pointed to this suggestion as being of potential interest to phonologists. However, as far as I know, this dissertation is the first place the consequences for phonological theory or for phonological acquisition have been worked out. 3.2 A computational model: Dillon, Dunbar and Idsardi (2013) 3.2.1 Mixture of linear models Mixture models over some set of observables Y can all be divided into three parts: 144 Q a set of possible parameter values(78) P a discrete distribution over Q F jq a distribution over Y The distribution F jq is a ?category model?: each different value of q gives rise to a different set of expectations about the observables, and we call such a set of expectations a category model. The distribution F jq might be specified as a Gaussian distribution on Y with location parameter q if Q were real numbers that could act as the location; or as a Gaussian distribution on X with parameters q = h m ; s i if Q were made up of pairs of real numbers with positive real numbers that could act as the location and scale respectively; or as some other distribution on Y that somehow depends on q . The point is that different possible values of q specify different categories: for each, there is a different distribution F jq over the observables, which in our case are percepts. The distribution P gives our expectations about which of these categories will be realized, and it is discrete, which means that there are countably many categories. In all the cases we care about, the number of categories is really finite, but, as discussed above, moving to the countable case allows us to handle what it means for a percept to belong to a ?previously unseen category? neatly. Under the phonetic transform hypothesis, we assume that this phonetic model for a category?a single segment in the perceptual inventory?changes depending on the ?en- vironment,? which means some temporal window of context around the segment; fur- thermore, we assume that this change takes the form of a simple vector addition to the 145 location of the category which is a linear combination of all the pieces of information available about the context (?linear transform hypothesis?). In this model, we assume that a system of categories is actually a system of compos- ite objects: the mixture model for such a system is a mixture of linear models. From now on, we will illustrate these as in the rightmost picture in Figure 3.2. Two conventional mixture models are shown in the other two pictures, to draw the contrast. Each category in the conventional mixture models is drawn as an ellipse (in this case, the ellipses de- limit the fifty percent confidence regions of two-dimensional Gaussians). However, in the mixture of linear models, there is a shift in the category which depends on the context. In this case, we have shown a model sensitive to one environment, and as the context information was assumed to be a simple indicator (zero or one) in this particular system, we have illustrated the context-dependence by drawing two ellipses, one solid and one dotted, for each category. The solid lines show the categories in the environment repre- sented as 1 0 T and the dotted lines show the categories in the environment coded as 1 1 T . See above for more details. The explicit representation of a single category in a mixture model is a grammar for that category, which we hypothesize to share structure with the brain (in this case, the perceptual system) under some reasonably strong homomorphism. It specifies the cate- gory. In a mixture of linear models, it is a particular parameter matrix A, which contains the intercepts and the effects of different contextual features. The locations of the ellipses shown in the third panel in Figure 3.2, on the other hand, are not explicitly represented, except for the intercept. Rather, these ellipses are epiphenomenal, in the sense that they 146 Figure 3.2: Illustrations of two different conventional mixture models (left and center) and of a mixture of linear models (right). In the conventional mixture model, each category corresponds to a single distribution over the observables. In the mixture of linear models, each category is complex, and gives rise to a family of possible distributions, with different locations. In particular, the location is a linear combination of the contextual information. In this case, only two ellipses are shown, because this model shows a simple case where there is one piece of contextual information, which is a single bit. constitute derived structure. We make no claim about their cognitive reality, and the strong claim is that they have none. If we start with a vector Gaussian category model on p dimensions, assuming a mixture of linear models means that we move from F jq being a distribution specified by this density function? f (yj m ;S) = 1 (2p ) p 2 jSj 1 2 exp 1 2 (y m )TS 1(y m ) (79) ?to this, where x is again the vector coding all the information about the context, where the first element is always 1, ?member of the category in question?: f (yjA;S;x) = 1 (2 p ) p 2 jSj 1 2 exp 1 2 (y AT x)TS 1(y AT x) (80) 147 Given this, a Bayesian mixture model under a Dirichlet process prior is as follows: G DP(a G0)(81) Ai;Si G y i jAi;Si N (ATi xi;Si) This turns out to be a special case of the dependent Dirichlet process prior (MacEachern 1999). The conjugate prior on A;S is the inverse Wishart distribution compounded with the matrix normal distribution (Dawid 1981). This is a generalization of the conjugate prior in the case where the mean follows a multivariate (rather than a matrix) normal distribution, the normal-scaled-inverse-Wishart prior. In that prior, the distribution on the mean has covariance S (the data covariance) except by a constant w : if we use a wholly different covariance matrix we break conjugacy. Here we replace w with a full covariance matrixW, which specifies the variance on each row ofA as well as the covariances between rows. This is the matrix normal density, where h is total length of the context vector x: f (AjA0;S;W) = 1 (2p ) hp 2 jSj h2 jWj p 2 exp 1 2 tr n S 1 [A A0]T W 1 [A A0] o (82) When A follows a matrix normal distribution we will write A N (AjA0;S;W). Adding 148 several hyperparameters, we thus obtain the following model: a Gam(a ja;b)(83) A0 N (A0jM0;S0;Ih) W IW (WjF; l ) G0 =N (AjA0;S;W) IW (SjY; k ) G DP(a G0) Ai;Si G y i jAi;Si N (ATi xi;Si) One Gibbs sampler (the ?Chinese restaurant process? construction) works like this: we first add an arbitrary index zi = z j, hAi;Sii = A j;S j drawn from some other set. We then sample as follows, integrating out G: 1 1Note that the normalizing constant for this distribution over possible values of zi is not the one that makes the integral in this second clause come to one but rather one that makes the probabilities over all the logically possible zi values come to one, which is equal to the sum of all the different probabilities for existing z j, as given in the first clause, plus the probability of a different value, given in the second clause. The integral in the second clause (the ?likelihood integral?) does not come out to one, but since we know the form of each of the three terms, and since we know the form of a function that does integrate to one and differs from the integral only by a constant (namely, f (A;Sjy i ;xi), a normal inverse Wishart density again, by conjugacy), the integral can easily be computed by pulling out all the constant terms to yield w s, doing the same with the posterior density to get z s, and then noting that wz 1z s = wz 1. This is how to obtain the ratio given in the text. 149 (84) Algorithm 1. 1. For i in 1; : : : ;N (where N is the total number of data points): (a) zi is resampled from the distribution: P(zi = z) Nz fN (yijATj xi;S j), where Nz is the number of z j = z and fN is the normal density, when z = z j for some j 6= i P(zi = z) a A;S fN(yijA T xi;S) fN (AjA0;S;W) fIW (SjY; k ) dA;S, where fIW is the inverse Wishart density, when z is different from all z j; j 6=i. The integral works out to Gp k 0 2 jYj k2 . p p 2 Gp( k2 )jIh +Wxix T i j p 2 jY0j k 0 2 , where Y0 =Y+AT0W 1A0 + yiy T i h xiyTi +W 1A0 iT W 1 + xixTi 1 h xiyTi +W 1A0 i and k 0 = k +1. (b) If zi is distinct from all z j; j 6=i, hAi;Sii is resampled from PrA;S jyi , that is, the normal?inverse Wishart prior on A;S, multiplied by fN (yi jA T xi ;S) A;S fN (yi jA T xi ;S) fN (AjA0 ;S;W) fIW (SjY;k ) dA;S . This is another normal?inverseWishart distribution, with parameters: A00 = W 1 + xixTi 1 h xiy T i +W 1A0 i ;W0 = W 1 + xixTi 1 ;Y0; k 0 as above 2. For each distinct z, let all Si such that zi = z be a sample Sz from an inverse Wishart distribution with parameters: Y0z = Y+YzY Tz +AT0W 1A0 XzY T z +W 1A0 T W 1 +XzXTz 1 XzY T z +W 1A0 k 0z = k +Nz where Nz is the number of zi = z, Xz is an h Nz matrix consisting of all xi for which zi = z, Yz is the p Nz matrix consisting of all y i for which zi = z corresponding to Xz 3. For each distinct z, let all Ai such that zi = z be a sample Az from a normal distribution with parameters: A00;z = W 1 +XzXTz 1 XzY T z +W 1A0 S0z = the Si shared by the zi = z W0z = W 1 +XzXTz 1 4. Sample W from an inverse Wishart distribution with parameters: F0 =F+ z2Z (Az A0) TS 1(Az A0); l 0 = l + jZjp where Z is the set of all distinct zi 5. Sample vec(A0), the hp-dimensional column vector obtained by concatenating the columns of A0, from a normal distribution with parameters: location = 2 6 4[S0 Ih] 1 + 2 4 z2Z S 1z ! 1 W 3 5 1 3 7 5 1" [S0 Ih] 1 vec(M0)+ z2Z [Sz W] 1 vec(Ak) # scale = 2 6 4[S0 Ih] 1 + 2 4 z2Z S 1z ! 1 W 3 5 1 3 7 5 1 6. Sample a value x from a beta distribution with parameters a +1, N. Then sample a from a gamma distribution: with parameters a+ jZj, b logx, with probability proportional to (a+ jZ ? 1) with parameters a+ jZ ? 1, b logx, with probability proportional to N(b logx) The addition of the index variable allows us to modify the parameter values without changing the clustering (meaning the ?association? of yi with particular hAi;Sii); this al- lows us to capitalize on the fact that the clustering is often largely correct even when the 150 parameters are slightly off, which makes the sampler more efficient (Bush & MacEachern 1996).2 3.2.2 Summary of Inuktitut experiments In Dillon, Dunbar & Idsardi 2013, we reported a number of statistical learning ex- periments using data from Inuktitut, an Eskimo-Aleut language, one of the three official 2It may be convenient to collapse out the parameter values entirely, and sample only the in- dices (MacEachern 1994). As this makes the conditional distributions on the hyperparameters of G0, W and A0, non-conjugate, however, a Gibbs sampler becomes not only computationally in- tensive but also inefficient (since the sampling step for the hyperparameters must then itself use MCMC or rejection sampling). With fixed hyperparameters, a sampler would look like this: (85) Algorithm. 1. For i in 1; : : : ;N (where N is the total number of data points), zi is resampled from the distribution P(zi = z; for z = z j for some j 6= i) Nz A;S fN(yijA T xi;S) f (A;SjYz;Xz;A0;W) dA;S where f (A;SjYz;Xz;A0;W) is the posterior normal Inverse Wishart density. The resulting integral works out to Gp k 00 2 jW00 j p 2 jY j k 2 p p 2 Gp( k 2 )jW j p 2 jY00 j k 00 2 , where Y = Y+YzY Tz +AT0W 1A0 XzY T z +W 1A0 T W 1 +XzXTz 1 XzY T z +W 1A0 Y00 = h Y+YzY Tz + yiy T i i +AT0W 1A0 h xiy T i +XzY T z +W 1A0 iT W 1 +XzXTz + xixTi 1 h xiy T i +XzY T z +W 1A0 i W = W 1 +XzXTz 1 ; W00 = W 1 +XzXTz + xixTi 1 k = k +Nz, k 00 = k +Nz +1 P(zi = z; for z different from all z j; j 6=i) a A;S fN(yijA T xi;S) fN (AjA0;S;W) fIW (SjY; k ) dA;S as before. 2. Sample a value x from a beta distribution with parameters a +1, N. Then sample a from a gamma distribution: (a) with parameters a+ jZj, b logx, with probability proportional to (a+ jZ ? 1) (b) with parameters a+ jZ ? 1, b logx, with probability proportional to N(b logx) Given a sample consisting of a sequence of indices, (say, the highest-posterior sample), A,S for a given cluster can be maximized analytically under the normal inverse Wishart posterior. Sampling is generally somewhat less computationally intensive than updating the values of the various factors in the integral, but might be expected to be less efficient; however, my own experience indicates that it is difficult to find good fixed values for W, and so I will use the algorithm in the text, and variants on it, throughout. 151 2900 2426 1952 1478 1004 5301100 924 748 572 396 220 2 F? F? 2900 2426 1952 1478 1004 5301100 924 748 572 396 220 9 F? F? Figure 3.3: Vowel tokens from Inuktitut (see below for a description of the corpus). The dimensions are the second and first formant, which correspond closely to vowel backness and height. The ellipses in panel A are 66% confidence regions for multivariate Gaussians estimated by maximum likelihood for just the =i= tokens, just the =a= tokens, and just the the =u= tokens. In panel B, the ellipses are for Gaussians estimated separately for the tokens of [i], [e], (=i= tokens out of, and in, the uvular context), [a], [], [u], and [o]. languages of Nunavut territory in Canada. Inuktitut has three lexically contrastive vow- els, /i/, /u/, and /a/. An allophonic process of Inuktitut retracts the tongue root during the pronunciation of vowels when preceded or followed by a uvular consonant (Dorais 1986; Denis & Pollard 2008). For the sake of having some notation, I will say that, in this environment, /i/, /u/, and /a/ are pronounced as [e], [o], and [], respectively; see Figure 3.3. The process is morphologically productive in the sense discussed in Chapter 1: the word for ?pen? is titirauti, =titiauti=, which is pronounced in citation form as [titeauti], with a final [i]; embedded before another morpheme starting with a [t], it keeps this pro- nunciation, as in ?your pens,? titirautitit, [titeautitit]; but before [q], as in ?do you have a pen?,? titirautiqaqtunga, [titeauteqqtu?a]. We reported the results of three statistical experiments using data from Inuktitut: Experiment 1 fit a mixture of Gaussians, by sampling using Algorithm 1 followed by the 152 selection of a high-posterior sample, to data from Inuktitut (with no predictors); Experi- ment 2 fit the model to the same data, but with the uvular-environment tokens corrected for retraction: the numerical difference between the corpus average hF1;F2i in the retraction environment, and out of the retraction environment, was removed from all of the tokens appearing in the retraction environment, simulating a perceptual correction; finally, Ex- periment 3 used Algorithm 1 on the original, uncorrected data, with a single predictor, coded as 1 if a token appeared in a uvular environment, and 0 otherwise. In all three experiments, the solutions reliably had three categories corresponding well to the phonemes of Inuktitut. However, we argued that the models in Experiments 2 and 3, which suggested versions of the phonetic transform hypothesis, were better because they captured the lawful relations between certain types of vowel tokens (those belong- ing to a particular vowel, in, and out of, the uvular environment). Although learners do gain receptive knowledge of this relation, it is not obvious how to recover this relation from a three-category mixture of Gaussians which obliterates the distinctions between al- lophonic categories. By increasing the number of data points to obtain a larger number of categories, (see Antoniak 1974), we then also showed that the five- and six-category solutions found by a mixture of Gaussians, although they resembled the allophones of the Inuktitut vowels somewhat, (with or without the subtle [a]/[] distinction), were too dif- ferent from the actual allophones to remain in the contextual pattern: the assignment of vowel tokens to learned categories was such that the tokens that were assigned to cate- gories in similar locations to [e,,o] were no longer reliably points that actually occurred in the uvular environment, and similarly for [i,a,u]. Experiment 3, on the other hand, showed a model which learned the phonetic content of the retraction rule (via the regression ma- 153 trices A), while the output of the mixture of Gaussians in Experiment 1 would evidently not be very useful for learning about the retraction rule, since the distributions are too inaccurate. This poor alignment with clear statistical clusters corresponding to phones is somewhat different from the results of other mixture of Gaussians studies of phonetic cat- egory learning, (Vallabha, McClelland, Pons, Werker & Amano 2007, Feldman, Griffiths & Morgan 2009), but those studies? data were collections of points sampled from Gaus- sians estimated to real-life phoneme category data; this makes the data fit the assumptions of the model, which makes it easier for it to learn. We used the raw data directly: 239 mea- surements of hF1;F2i at the steady state taken in Praat in a study on Inuktitut phonetics, (Denis& Pollard 2008), in single-word elicitations from one female Inuktitut speaker from Kinggait. (Upsampling to increase the number of data points was done nonparametrically using a two-dimensional kernel density estimate, according to natural mixing proportions obtained from the Nunavut Hansard corpus: Martin, Johnson, Farley & Maclachlan 2003.) A summary of how the models estimated in Experiments 1?3 compare to the true classification of vowel tokens in the corpus is given in Table 3.1. The classification perfor- mance showed statistically significant improvement for the three-category solutions with respect to Experiment 1 in both Experiments 2 and 3, and for all the models (with what- ever number of categories) in Experiment 3. This was one reason we gave for thinking that a phonetic transform model of allophony better supported learning perceptual categories than a model where allophony is categorical: the classification models we obtained when allophony was learned as a phonetic transform were slightly better than those obtained when it was not, and these models are thus presumably more like the native speaker?s per- ceptual model. We attributed this improvement to the fact that these models can handle 154 the fact that lexical categories have multiple contextually determined statistical modes. The other argument we gave in support of the phonetic transform hypothesis was that the six-category mixture of Gaussian solutions found in Experiment 1 were not well enough aligned with the allophones to preserve the quasi-complementary distribution be- tween allophones: we computed symmetrized KL divergence scores between categories over the probability of being/not being in a uvular environment to assess the degree of complementarity in their distributions (following Peperkamp, Le Calvez, Nadal&Dupoux 2006). The KL divergence scores showed inconsistencies due to incorrect classifications that meant that pairs that ought to have had relatively low KL-divergence (low degree of complementarity, thus less surface evidence for allophony) had higher KL-divergence, and pairs that ought to have had relatively high KL-divergence (high degree of comple- mentarity, thus more surface evidence for allophony) had lower KL-divergence. These statistics are summarized for the estimated six-category model in Table 3.2. Graphs of representative fitted models are shown in Figure 3.4. 155 Supervised baseline F Precision Recall 1000 data points 0.84 0.83 0.85 vs allophones 0.64 0.66 0.63 12000 data points 0.79 0.79 0.79 vs allophones 0.69 0.64 0.76 Raw (239 points) 0.78 0.78 0.78 vs allophones 0.68 0.63 0.74 Experiment 1 F Precision Recall K = 1 2 3 4 5 6 1000 data points 0.70 (+.02) 0.66 (+.01) 0.74 (+.03) 0 0.125 0.625 0.125 0 0 vs allophones 0.60 (+.01) 0.50 (+.01) 0.76 (+.02) 12000 data points 0.65 (+.01) 0.68 (+.01) 0.63 (+.02) 0 0.1 0.5 0.1 0.2 0.1 vs allophones 0.58 (+.02) 0.53 (+.01) 0.63 (+.02) Raw (239 points) 0.65 ( .03) 0.59 ( .03) 0.76 ( .03) 0.1 0.2 0.7 0 0 0 vs allophones 0.47 ( .04) 0.34 ( .01) 0.81 ( .02) Experiment 2 F Precision Recall K = 1 2 3 4 5 6 1000 data points 0.73 (+.02) 0.67 (+.02) 0.82 (+.02) 0 0.4 0.5 0.1 0 0 12000 data points 0.74 (+.02) 0.72 (+.02) 0.76 (+.02) 0 0 0.875 0.125 0 0 Raw (239 points) 0.63 ( .01) 0.49 ( .02) 0.88 ( .01) 1.0 0 0 0 0 0 Experiment 3 F Precision Recall K = 1 2 3 4 5 6 1000 data points 0.75 (+.02) 0.71 (+.02) 0.80 (+.02) 0 0.111 0.889 0 0 0 12000 data points 0.69 (+.02) 0.65 (+.02) 0.76 (+.02) 0.143 0 0.571 0.286 0 0 Raw (239 points) 0.69 ( .01) 0.64 ( .00) 0.79 ( .01) 0.125 0.125 0.75 0 0 0 Table 3.1: Summary of model evaluations from Dillon, Dunbar & Idsardi 2013, showing pairwise agreement between models and true classifications as F , precision, and recall scores (pairwise agreement: for each pair of data points, agreement or failure to agree on whether they belong to a single category); and the distribution of number of categories for the MAP estimate across ten chains, run with complementary 10% held out test sets. See Dillon, Dunbar & Idsardi 2013 for details. Supervised baseline: Gaussian estimated on the points from a given category; pairwise agreement vs three-way phoneme classification except where noted. 156 [i] [e] [u] [o] [a] [] [i] 0 0.810 0.033 0.321 0.330 0.846 [e] ? 0 0.478 0.098 0.093 0.000 [u] ? ? 0 0.138 0.143 0.504 [o] ? ? ? 0 0.000 0.109 [a] ? ? ? ? 0 0.104 [] ? ? ? ? ? 0 Table 3.2: Complementarity of estimated categories? context distributions (i.e., P(token is in uvular environment), measured by symmetrized KL-divergence. Values computed over a six-category mixture classification found in Experiment 1; categories labelled by visual inspection to map to the closest actual allophone. The bold values are pairs that are actually allophones, and thus should have relatively high scores; the under- lined values are pairs that should not be labelled as allophones, because they are members of the same retracted/non-retracted class. All of the underlined scores would ideally be substantially lower than all of the bolded scores, but this is not the case: the classification under this model obscures the complementarity of the distributions. 2900 2426 1952 1478 1004 5301100 924 748 572 396 220 F? F? 2900 2426 1952 1478 1004 5301100 924 748 572 396 220 F? F? Figure 3.4: Example fitted models from Experiments 1 (left) and 3 (right) of Dillon, Dun- bar & Idsardi 2013. The shaded ellipses are the supervised baseline (two different base- lines, phonemic, and allophonic, are used to highlight the relevant underlying statistical patterns). 157 3.3 Selecting transform environments 3.3.1 Mixture of linear models with variable selection In a real-life situation, the learner does not know which of the large set of possible predictor values have an effect on pronunciation, and which do not: the learner does not know what environments trigger allophony. Since there is an identity 0 in the hypothesis space for any given phonetic transform, (at least in our real-valued formulation), we might think that we do not need to know anything beyond what we have already said: if there is no effect, then the best estimate of the phonetic effect ought to be around zero anyway. However, from a practical perspective, it seems likely that the computational cost of not having to search for all the logically if some of them were obviously negligible would be improved; this would certainly be useful for our Gibbs sampler if there were large A matrices to be sampled. More importantly, it would also trigger a Bayesian Occam?s Ra- zor effect?the learner would be doing a model evaluation exactly like the toy example of the lexical decision task in Chapter 2. This would be true even if the distinction be- tween setting phonetic transforms to zero and removing them entirely had no impact on any aspect of perception or production?that is, if this aspect of our formulation of the grammar was not part of the shared structure as regards the function the grammar com- putes. It would nevertheless be a part of the structure shared with learning, because it would impact the evaluation measure, if not the actual computational cost of making the evaluation. We would then be making a very clear (not necessarily easy to test) empirical prediction about learners? preferences. See Chapter 2. 158 The standard way to select different subsets of some available variables in a Bayesian way (the variable selection problem) is what is called a spike and slab model (Brown, Van- nucci & Fearn 1998). This addition to the model simply attaches an indicator g j to each of the predictor variables x j. If all g j = 1, then the model is exactly as before. However, if some g j = 0, then the model fixes the effect of predictor x j to be exactly zero. The prob- abilistic interpretation of this is as follows: if g j = 1, then the conditional distribution of the jth row of A is just as before?we say that this row is drawn ?from the slab.? If g j = 0, however, then the conditional distribution of that raw is an degenerate distribution with all of its mass at zero, the ?spike.?3 It is easy to show that for a Gaussian distribution (like the distribution on the A matrix), the distribution of any of the components conditional on the others is still Gaussian. Since degenerate distributions at a particular point can be seen as Gaussians with zero variance centered at that point, we obtain, conditional on g , a degenerate matrix Gaussian distribution where all the rows of the mean with g j = 0 are zero and all the rows and columns of W with g j = 0 are zero as well. To retain the same form for the distribution of A as in the previous model, we make it so that, conditional on g j = 0, row j and column j of W will all be set to zero (the resulting conditional distribution on the rest of the matrix is still Inverse Wishart with different degrees of freedom); and, conditional on g j = 0, row j of A0 will be all zeroes (the resulting conditional distribution on the rest of the matrix is still Gaussian). This gives rise to the following model: 3It is easy to see why we would refer to the g j = 1 distribution as a ?slab? when it is a uniform distribution, any picture of which in two dimensions will look like a flat slab of probability mass; other models like this are still called ?spike and slab,? even where the full priors that are used for the g j = 1 distribution not even particularly diffuse, just because they have the same structure otherwise. 159 a Gam(a ja;b)(86) g 1 = 1 g j Bernoulli(t ) for 2 j h A0;g N (A0;g jM0;g ;S0;g ;Ih) A0; g = 0 W g IW (W g jF g ; l h+ h j=1 g j) W g = 0 G0 =N (AjA0;S;W) IW (SjY; k ) G DP(a G0) Ai;Si G y i jAi;Si N (ATi xi;Si) Although it is certainly possible to construct a Gibbs sampler in which g is updated conditional on the current values of W and A0, in practice such a sampler cannot move away from local optima for g , because the principal effect of a change to g is to propagate down to the likelihood (throughW and A0) and thus impact the assignment of observations to categories in the Chinese restaurant process in a way which is reflected somewhat subtly in W and A0. Thus the values of g and z are sampled in block (throughout, h p location matrices subscripted with values of g are reduced to contain only the rows with g j = 1 and h p scale matrices subscripted with values of g contain only the rows and columns with g j = 1): 160 (87) Algorithm 2. 1. For j in 2; : : : ;h: (a) For g j = 0;1: i. A A g , W W g , A 0 A0;g , F F g , M 0 M0;g ii. h g2g j g+ g j , l l h+h iii. p g j jF j l 2 2 h l 2 (2 p )h Gh ( l 2 )jW j l +h 2 jS0 j h 2 etr n 12F W 1 o etr n 12 S 1 0 (A 0 M 0 ) T (A 0 M 0 ) o iv. For i in 1; : : : ;N (where N is the total number of data points): A. z i is resampled as before by computing the posterior for each z jz i , substituting A , W , A 0 for A, W, A0 B. New parameter values are sampled as necessary, conditional on g j and g j C. p g p g posterior[z i jz i ] for the selected z i (b) Sample g j from hp 0; p 1i (c) Set all zi according to the z i proposed under g j 2. For each distinct z, let all Si such that zi = z be a sample Sz from an inverse Wishart distribution with parameters: Y0z = Y+YzYTz +AT0;g W 1 g A0;g h Xz;g YTz +W 1g A0;g iT h W 1g +Xz;g XTz;g i 1 h Xz;g YTz +W 1g A0;g i k 0z = k +Nz 3. For each distinct z, let all Ai; g such that zi = z be zero and Ai;g be a sample Az;g from a normal distribution with parameters: A00;z = h W 1g +Xz;g XTz;g i 1 h Xz;g YTz +W 1g A0;g i S0z = the Si shared by the zi = z W0z = h W 1g +Xz;g XTz;g i 1 4. Sample W from an inverse Wishart distribution with parameters: F0 =F+ z2Z (Az A0) T S 1(Az A0); l 0 = l + jZjp where Z is the set of all distinct zi 5. Set vec(A0; g ) to 0 and sample the subset of vec(A0;g ) from a normal distribution with parameters: location = 2 6 4 h S0 Ih g i 1 + 2 4 z2Z S 1z ! 1 W g 3 5 1 3 7 5 1 " h S0 Ihg i 1 vec(M0;g )+ z2Z Sz Wg 1 vec(Ak;g ) # scale = 2 6 4 h S0 Ih g i 1 + 2 4 z2Z S 1z ! 1 W g 3 5 1 3 7 5 1 6. Sample a value x from a beta distribution with parameters a +1, N. Then sample a from a gamma distribution: with parameters a+ jZj, b logx, with probability proportional to (a+ jZ ? 1) with parameters a+ jZ ? 1, b logx, with probability proportional to N(b logx) Note that, during the sampling step for z, none of the existing parameter values for A are resampled in A for the proposal g j = 1, even if the current values were sampled with g j = 0; these values are still consistent with the proposal that g j = 1; for g j = 0, where the current values of A have density 0 in general, only the offending rows are updated in A 0 and A . For W we are not so lucky: the reduced matrix would be ruled out under the full inverse Wishart prior. The resampling step for W thus samples the full matrix (equivalent 161 to first sampling Wg and then sampling the rest from the prior, conditional on the sampled values of Wg , since with A0; g and A g set to 0 the relevant cells of the inverse Wishart scale matrix are unaffected by A g , even though it is sampled from W); the additional values simply serve as a proposed value in order to evaluate the likelihood integral and sample new parameter values in the Chinese restaurant process. The idea, in general, is that we sample all the model parameters in block with g , but bypass the full resampling steps for the numerical parameters until g has been fully updated. The probability with which g j is sampled is almost (but not quite) the joint posterior probability of g j, W, A0, and z. 3.3.2 Experiment: Inuktitut revisited To test this model we will use the same Inuktitut data as in Dillon, Dunbar & Idsardi 2013, with several basic questions in mind: 1. The model should reliably choose to set g = 1 for the uvular-environment predictor; does it? We will need to compare against a control predictor that should be set to g = 0. 2. Does the model reliably set the effects of other environments to be active also, and, if so, how well does this align with manual estimates of the size of these effects in the data? 3. The inference in this model is 2 (h 1) times as complex as for the old model; we may have concerns about the performance of the sampler as we increase h. 162 4. The feature marking the presence or absence of a uvular environment might be thought to be more or less available to the learner than features marking the presence or absence of various different segmental categories in the environment. We should compare how good the learner is at detecting the effects of the individual segments (in this case [q] and []) instead. 3.3.2.1 List of sub-experiments We now document the setup for each of the relevant experiments (numbered in se- quence following the Dillon, Dunbar & Idsardi 2013 experiments to avoid confusion): Experiment 4 We use the same uvularity predictor as before;4 we add a second predictor drawn uniformly at random from f0;1g. The data is as before and we manipulate the size of the data set again by upsampling.5 Experiments 5?7 As in Experiment 4, but we add the corresponding predictors for coronal (Experiment 5), velar (Experiment 6), and labial (Experiment 7) following conso- nants. Experiments 8?10 We combine the uvularity predictor with two (Experiment 8), three 4We use a different uvularity predictor than before. The earlier experiments used a predictor which was 1 if there was either a uvular preceding or following. Here, I used other predictors as well, and, since their effects might not have been similar when preceding and following, I restricted consideration to their effects when following the vowel; the post-uvular environment was used rather than the two-sided uvular environment for consistency. Results for the original predictor were qualitatively the same. 5In these experiments, I upsampled simply by adding Gaussian noise, rather than applying kernel smoothing, which is slightly different in that the variance is not fixed (though it is still small). The new samples are in multiples of the size of the original data set (239). This is because we had previously at- tempted to preserve the rough proportions of each phoneme?predictor value combination in the data set (or in fact to manipulate it slightly). Here I wished to simply preserve the proportions to make the data as com- parable as possible, but with multiple predictors, the size of the table of conditions increases, and the count in each cell decreases, thus making it more difficult to preserve the proportions accurately when increasing the size in non-integer multiples. 163 (Experiment 9), and four (Experiment 10) uniform random predictors, not correlated with each other. Experiment 11 As in Experiment 4, but we use two predictors, one for [q] and another for []. 3.3.2.2 Results As before, ten chains were run, each with a different held-out 10%. Three sample sizes were used (see footnote 5). Ten thousand burnin samples were drawn, followed by a sample of one thousand at a lag of seven. The highest posterior sample was selected. A small amount of annealing was found to help avoid the proliferation of categories; the temperature was lowered from 1 to 0.1 on z by exponential decay through to the end of burnin. Table 3.3 summarizes the results. 164 Experiment 4 (uvular) N F Precision Recall K = 2 3 4 5 6 7 R U 239 0.70 0.61 0.84 0.6 0.4 0 1.0 478 0.76 0.76 0.77 0.1 0.8 0.1 0 1.0 717 0.73 0.88 0.62 0.9 0.1 0 1.0 Experiment 5 (uvular and coronal) N F Precision Recall K = 2 3 4 5 6 R U C 239 0.67 0.56 0.86 0.8 0.2 0.4 1.0 0.1 478 0.73 0.74 0.74 0.2 0.5 0.2 0.1 0.2 1.0 0 717 0.74 0.90 0.62 0.9 0.1 0 1.0 0.4 Experiment 6 (uvular and velar) N F Precision Recall K = 2 3 4 5 6 7 R U V 239 0.68 0.58 0.84 0.7 0.3 0.1 1.0 0.1 478 0.76 0.75 0.77 0.1 0.8 0.1 0.1 1.0 0 717 0.73 0.91 0.61 0.4 0.4 0.2 0 1.0 0.4 Experiment 7 (uvular and labial) N F Precision Recall K = 2 3 4 5 6 7 R U L 239 0.64 0.50 0.86 1.0 0.1 1.0 0.1 478 0.73 0.73 0.75 0.2 0.6 0.1 0.1 0.1 1.0 0 717 0.73 0.90 0.62 0.8 0.1 0.1 0.1 1.0 0 Experiment 8 (uvular and two random) N F Precision Recall K = 2 3 4 5 6 7 R1 R2 U 239 0.64 0.50 0.89 1.0 0 0.7 1.0 478 0.75 0.68 0.85 0.4 0.6 0 0.1 1.0 717 0.78 0.77 0.80 0.1 0.7 0.2 0 0.1 1.0 Experiment 9 (uvular and three random) N F Precision Recall K = 2 3 4 5 6 7 R1 R2 R3 U 239 0.65 0.51 0.89 0.9 0.1 0.7 1 0.9 1.0 478 0.75 0.69 0.82 0.3 0.7 0.9 0.9 0.7 1.0 717 0.78 0.76 0.81 0.1 0.9 0.7 0.9 0.2 1.0 Experiment 10 (uvular and four random) N F Precision Recall K = 2 3 4 5 6 7 R1 R2 R3 R4 U 239 0.67 0.54 0.89 0.8 0.2 1.0 1.0 1.0 1.0 1.0 478 0.70 0.59 0.89 0.7 0.3 1.0 1.0 1.0 0.9 1.0 717 0.78 0.76 0.80 0.1 0.9 0.9 0.8 0.9 0.9 1.0 Experiment 11 (uvular stop, uvular fricative) N F Precision Recall K = 2 3 4 5 6 R q 239 0.77 0.78 0.77 1.0 0 1.0 1.0 478 0.72 0.75 0.69 0.7 0.1 0.2 0.2 1.0 1.0 717 0.65 0.80 0.55 0.1 0.9 0.1 1.0 1.0 Table 3.3: Quantitative summary of Experiments 4?11, described in 3.3.2.2. See Table 3.1 for an explanation. The columns at the right tabulate what proportion of the chains set g j to 1 for each of the variables. The R, R1?4 variables are the random control predictors. U is for uvular; V is for velar; C is for coronal; L is for labial. 165 We return to our questions: 1. Does the model reliably choose to set g = 1 for the uvular-environment predictor? Yes. Looking at both Experiment 4 and at all the other experiments including a uvular-environment predictor, (Experiments 5?10), the model consistently chooses to activate that predictor. In Experiments 4?7, with a random control predictor, this choice is clearly distinct from the choice of whether to activate the random predictor, which is activated rarely. 2. Does the model reliably set the effects of other environments to be active also? (a) Coronal: No. In the runs with the most balanced F-scores (478 data points), which also have the highest proportion of runs with three categories, the coro- nal predictor is deactivated. In the runs under the larger data set, the coronal predictor is activated reasonably often; this may be related to the poorer align- ment of these categories with the true phoneme categories. (b) Velar: No. The pattern is the same as for coronal. (c) Labial: No. As for coronal. 3. Does performance decline as we increase h? Yes. The model correctly deactivates the random predictors consistently when there are no more than three predictors to- tal (and even then only given that the category model is reasonably good: consider Experiments 5 and 8, where a random predictor is sometimes activated for two- category solutions). This suggests a very small limit on the number of predictor variables that can be meaningfully selected among. 166 Velar Coronal Labial Uvular F1 F2 F1 F2 F1 F2 F1 F2 a -38 12 -20 67 23 24 47 -69 i -19 128 -37 62 16 -101 101 -376 u -23 31 -39 75 -52 42 57 -102 Table 3.4: Difference between the Inuktitut vowel mean conditional on a particular follow- ing consonant place and the mean elsewhere, for the four different places of articulation. 4. Can the model detect the effects of the individual uvular segments ([q] and []) instead of their disjunction? Yes. Compare the results of Experiment 11 to Experiments 5?8, where, out of three predictors, only one was reliably activated; here, both the [q] and [] predictors are activated. To investigate these results somewhat further, we consider the effects of coronal, velar, and labial environments. Should we expect to find an effect of either of these two types of segments? If so, how do the model?s estimates compare to reality? The latter question is moot. For the coronal model, further investigation revealed that none of the three-category models had the coronal predictor activated; there was only one three-category model with the random control predictor active. The same held for the velar model, except that the random control predictor was active in two three-category models. For the labial model, none of the three-category models had either the labial or the random control predictor active. Thus no comparison could be made. The data itself shows that the means condi- tional on these environments are substantially more similar to the means elsewhere than in the case of uvular following consonants. See Table 3.4. The second issue is the model?s performance as we increase h, which is poor (with respect to the selection of predictors) as we go past three total. A reasonable explanation 167 is that the sampler might not move efficiently through the high-posterior regions of the hypothesis space, and in particular might either get completely stuck in a locally high- posterior region, or just move very slowly. This is definitely the problem: analysis of burnin shows the three-random-control model leaving the state with all g j set to one on an average of 0.1 percent of samples, scattered throughout burnin, despite the higher posterior value of these samples. At least for the Inuktitut data, this does not hinder the exploration of z: the high-posterior mixtures are generally reasonably good approximations to the three Inuktitut phonemes; the use of contextual predictors can improve the alignment to the true categories, but never seems to substantially worsen it. More efficient methods for sampling g should be investigated; we may be able to take this fact into consideration at least in some cases. 3.3.2.3 Discussion Given the above discussion of the linear phonetic transform hypothesis, we said nothing to imply that every contextual dimension would necessarily have an associated phonetic transform. This is what we are forced into, given the model presented in the previous section. In this section, we have introduced a model that learns which contexts trigger phonetic transforms. It can be said to be ?learning the environments? for these shifts. Although this characterization might be counter to the intuitions of some, because we provided the model with a list of potential triggering environments, it should be noted that the existence of such a list (finite or infinite) is implicit in any model that says there is a discrete set of transforms, and any debate about whether the list ?exists? can only be 168 a debate about how that information is represented (though representational questions are obviously highly relevant to the evaluation measure). The role of the Bayesian Occam?s Razor in a model like this is to guard against selections of less sparse solutions for the sake of only small improvements to fit (e.g., to capture the negligible ?effects? of the randomly generated predictors), in this case via the dimension of A0 and W. This yields a testable prediction, and the most careful empirical test of this part of the theory would look like this: (88) Find a satisfactory experimental methodology for training people (presumably infants) on novel phonetic categories, and assessing the resulting phonetic maps after training (89) Construct a data set which exposes subjects to sequences with phonetic tokens appearing in different environments, ensuring as well as possible that the environments provide salient cues which are not themselves subject to category learning (perhaps by tightly controlling their variance) (90) Manipulate, across conditions, the size of the phonetic effects of two different predictors; a critical comparison is between a condition with very different effect sizes for two predictors, one predictor?s effect being very small, (different condition), and one with two comparable, reasonably large effect sizes in different directions (same condition) (91) Evaluate the integrated likelihood for the set of phonetic category models with and without each of the predictors 169 (92) The subjects? resulting phonetic maps should reflect net zero influence of the small-effect predictor in the different condition, while the phonetic maps in the same condition should reflect influence of the predictors in proportion to, and in the direction of, their acoustic effects This experiment is reasonable but is beyond what psycholinguistics can currently achieve. First, there is one standard procedure for training on new phonetic categories (Maye, Werker&Gerken 2002) but it is unreliable (Peperkamp 2003, Gulian, Escudero&Boersma 2007). Second, although there is a standard method for probing listeners? phonetic maps (compare discrimination abilities for pairs of sounds in different parts of the phonetic space), fine-grained studies of what the changes in this measure due to contextual effects look like have not been done, to my knowledge. There are also unexplained phenomena found in this measure, such as effects of order of presentation of stimuli (Kuhl 1991). Third, discrimination measures in this experiment would be discrimination in context, which allows listeners to discriminate on the basis of the context, interfering with the probe of the phonetic map for the vowels; thus a different way of probing the phonetic map altogether would be preferable. Finally, although we could of course base likelihood computations off the Gaussian models used here, some predictions based on empirically grounded measures of ?goodness? for a given phonetic mixture and set of data points would be better. Although the model?s predictions are clear, therefore, they are difficult to test at present. Nevertheless, until this model is embedded in a larger model (including phone segmentation, word segmentation, categorical phonological grammar, and so on), it serves as the best demonstration possible that the phonetic transform hypothesis is a 170 feasible one from the point of view of learning. 3.3.3 Experiment: sex and gender differences The idea behind the model presented in this chapter is more general than allophony, in two different ways, which I will briefly demonstrate now. As outlined at the start of the chapter, we assume that context representations drive the effects of phonetic transforms. However, the mechanics of the mixture of linear mod- els do not rule out these scalars being real numbers. This says nothing of the cognitive architecture, of course, but it means that it is also possible for us to construct models where the predictors are continuous, not discrete values (presence or absence of a certain environment)?if we like to think about the (epiphenomenal) surface phones, then in that sense this makes the inventory of phones nonenumerable. Second, the predictors do not need to be anything to do with the segments in the im- mediate context; they could be any other property of the observation. One problem which has exactly the same structure as the allophone problem is the problem of indexical differ- ences in sociophonetics: characteristics of the speaker that are realized in the phonetics, including both the aspects which are physical, such as sex differences due to vocal tract length, and those that are the result of applying sociolinguistic knowledge in production, such as (strict) gender differences (Foulkes, Scobbie & Watt 2010). In this section, I demonstrate the applicability of the model to these more general cases by modelling a small vowel system with speaker variability, including both categor- ical and continuous predictors. 171 Experiments 12?13 English corner vowels, =i=, ==, =u=, taken from Hillenbrand, Getty, Clark & Wheeler 1995. 347 observations consisting of the first three formants, taken at a steady state, with three predictors: f0 (fundamental frequency), sex (male or female), and age (adult or child). See Figure 3.5. Estimation of a regular mixture of Gaussians (Experiment 12) and the current model (Experiment 13). Figure 3.5: English corner vowels. Shown here with no division by predictor; with divi- sion by age (children blue); with division by sex (females red); from left to right. As before, 10000 burnin samples were drawn, before taking a sample of size 1000 at a lag of 7, with ten chains each, with non-overlapping 10% subsets held out. The highest posterior sample was taken to be representative of a chain. For this model, the qualitative predictors were set to 1 and 1. Table 3.5 shows a quantitative assessment of the results. Experiment 12 (English, mixture of Gaussians) N F Precision Recall K = 2 3 4 347 0.95 0.98 0.93 0.7 0.3 Experiment 13 (English, variable selection) N F Precision Recall K = 2 3 4 R Sex Age f0 347 0.99 0.99 0.99 1.0 0 1.0 1.0 1.0 Table 3.5: Quantitative summary of results of mixture of Gaussians versus variable selec- tion model on English corner vowel data. The variable selection model reliably correctly rejects the control predictor, indi- 172 cating that there is no technical issue with spurious solutions as there was before. All the other predictors are consistently selected; this is consistent with the fact that the predictors all have substantial effect. An example model is shown in Figure 3.6. Figure 3.6: English corner vowels, one fitted model. Filled ellipses are Gaussians fit in a supervised way to individual sub-categories: no division by predictor; division by age; division by sex. Thicker lines are realized (epiphenomenal) categories: the dotted line is the prediction when the categorical predictors are coded as 0 and f0 is at its mean value for that category; green ellipses show the effect of f0, with predictions at the 25th and 75th percentile f0 for the given category; red ellipses show the two predictions for sex; blue ellipses show the two predictions for age. The idea that sociolinguistic and other talker differences could be handled using ex- actly the same cognitive apparatus as allophony is new; here I do not explore it in great depth, but I believe it is viable, and, given a phonetic transform view of allophony, there is nothing against uniting these two things. For the skeptical modularist worried about ex- tralinguistic information contaminating the grammar, is important to remember that (i) the information triggering the transform is extralinguistic, and is provided to the linguistic sys- tem in order to decide how to process the input (in a broad sense ?which grammar? to use, although that characterization applies equally to allophonic conditioning)?the phonetic representation coding the transform itself is not extralinguistic; and (ii) normalization for sociophonetic factors is just a fact (Evans & Iverson 2004)?so it needs to happen some- where along the line in the phonetic system. This model does that. Since the structure of 173 the contextual and the indexical variability problems is so similar, the idea that the pho- netic computations fundamentally change their character to deal with one as opposed to the other seems like it should not be the null hypothesis. Beyond this suggestion, the results indicate that learning relies to some extent on talker normalization: there is improvement in the resulting solutions when the talker-level predictors are added to the model, as compared to when they are omitted. Of course, there are other sources of information besides the simple correlation of some aspect of the data with a talker variable which might cue the learner to find ways to ignore variabil- ity in phonetic realizations: hearing words which are identifiably the same pronounced quite differently by different speakers would presumably provide some information about speaker-level variability. However, using lexical information like this does not solve the problem of talker variability by itself. In principle, hearing a few high-frequency words pronounced systematically differently from what is expected under the listener?s existing phonetic model could be a strong cue to the existence of a transform at the talker level; but learners do need to be able to generalize from this word to a model that will help them make sense of the stream in the future, and so it would not be enough to simply say that learners can fill in the category identity from context. Rather than just treating talker vari- ability as a kind of mispronunciation, this model allows for sociophonetic and other types of talker variability to be treated as systematic. 174 3.4 Proposed model: learning with features 3.4.1 Background: features, geometries, and the contrastive hierarchy In this section, a model is proposed (but not implemented) which learns phonetic categories that are cognitively structured using binary valued distinctive features. I review the reasons why one might want to make such a move now. Recall that, beyond phonetics, the ultimate problem of phonology is how the brain links phonetic representations with memory (the lexicon); the elements of the lexicon have some information relevant to phonetics (in addition to the syntactic and semantic infor- mation that links a word with its grammatical properties and its meaning); the information in a single lexical item is usually though to consist of a sequence of ?segments?; the in- formation in a single segment is usually thought to consist of feature?value pairs, and the values are usually thought to be binary. The idea behind the use of binary valued features in phonology is thus that they are the smallest units of lexical storage. A given set of feature value pairs induces an equivalence class, namely, the set of all objects with that set of feature?value pairs. For example, the vowel =i= is often described as being [+high][ back]. All phonetic realizations of =i= belong to this class. The vowel =u= is then usually described as being [+high][+back]. All phonetic realizations of =u= belong to this class. Notice how this is different from the models we have been using up to now. In particular, there is a logically possible class [+high] to which all phonetic real- izations of =i= and =u= belong; the mixture models we have been using treat each phonetic category as atomic, but feature-based models treat a phonetic category as complex; valued 175 features are the atoms. There are two crucial things here: first, segments belong to equivalence classes like =i= and =u= (and the assumption is often that nothing beyond these equivalence classes can be coded in memory, although this appears to be wrong: see, for example, McMurray, Tanenhaus & Aslin 2002). Second, segments are cross-classified: they belong to multiple equivalence classes. Even before grammar was given an explicitly mentalistic interpre- tation, and these equivalence classes were required to be psychologically active, many grammarians developed descriptions of phonological patterns that also crucially had to cross-classify segments. For example, P??ini?s 6th-century BCE grammar of classical Sanskrit contains the ?Shiva sutras,? which is a table of classes of Sanskrit phonemes set out as verses: each verse is essentially a list of phonemes and a dummy phoneme at the end marking a class that all those phonemes belong to. Rules in P??ini?s grammar then make reference to sequences which are of the form pP, where p is some phoneme symbol, and P is one of the dummy phonemes; this means the set of all phonemes starting from where p appears in the table, up to the end of the verse that P ends, where the ?table? is arranged by reading the verses left to right, from the first verse to the last. For example, all the sono- rants come in earlier verses than the obstruents, so to make reference to all the obstruents, one simply writes the symbol for the first phoneme that appears in the first verse with obstruents (which happens to be [dh]) followed by the symbol for the dummy phoneme at the end of the last line of obstruents (which happens to be [l]); however, one can make reference to narrower subclasses of obstruents, like the voiced unaspirated stops, by start- ing with the first such obstruent, [d], (from the second verse of obstruents), and ending with the dummy phoneme ending that verse, [], because that is the only verse containing 176 voiced unaspirated stops; see Kiparsky 1991. There are even more classes, too, not just the classes corresponding to to sets of verses, because one does need to start at the be- ginning of the verse?but the point is merely that (i) there are finitely many equivalence classes of segments?P??ini does not make reference to gradient phonetic detail?and (ii) the equivalence classes overlap. The reasons for (i) and (ii) in describing phonological patterns are clear: there are obvious patterns in pronunciation which seem to be of the form ?whenever a segment of equivalence class X is in environment Y, it is actually pronounced as equivalence class X0??the idea of a change from one form to another being backed up traditionally by mor- phological evidence about what the ?basic? segmental form of a morpheme is?or of the form ?segments of class X can never/only appear in the environment Y?; P??ini?s gram- mar, for example, describes Sanskrit with the rule ?obstruents [?[dh][l]?] change to voiced unaspirated stops [?[d][]?] before voiced stops [?[dh][]?],? making reference only to equiv- alence classes, and furthermore to equivalence classes cross-classified using features. As far as any grammarian knew, no reference to phonetic detail was necessary in describ- ing these patterns, ever: the pronunciations after one segment is changed to another are just like the corresponding pronunciations when they are not the result of a change (?un- derlying? in modern cognitive terminology), or when they are changed from some other class of segment. Crucially, for many processes, this turns out to be true (though not all, of course, as discussed above: see Chapter 5 for more discussion). The existence of such processes is the definitive linguistic argument for these equivalence classes, and, to the ex- tent that they must be cross-classifying to describe all such processes in a given language, for distinctive features. 177 Features can be leveraged for phonological patterns so that they have another ben- efit beyond defining overlapping classes, (iii), correspondence?so, in Sanskrit, the ob- struent [dh] changes to the corresponding voiced unaspirated stop, [d], not some other one. In modern linguistic theory, this process would be described as replacing the feature value for [voice] on those segments with [+voice], replacing the [continuant] value with [ continuant] (for ?stop consonant?), and replacing the feature value for [spread glottis] with [+spread glottis]. All the other feature values remain the same, and so different seg- ments can be made to change to different corresponding segments, by making reference to changes to particular features. There do indeed seem to be many processes that work this way?putting segments in correspondence according to independently motivated featural classifications?and, if one needed an argument for allowing correspondence, that would be it (but one does not, since correspondence does not require any extra mechanisms under modern feature theory).6 Moving forward to the twentieth century, the Prague school phonologists?most no- tably, Jakobson and Trubetskoy?like all other phonologists, had devices for (i), (ii), and (iii), and these are the ancestors of modern feature theory. There are several notable things about this literature. One thing which was not really new was that phonological features were required to have some phonetic meaning: [+high] needed to mean something having to do with the position of the tongue when pronouncing a vowel, different, at least in a relative sense, from [ high]?or to do with some acoustic or auditory property (in which case the feature would have some other name), or both at once. In fact, it had been true 6In P??ini?s grammar, this fact, correspondence, is actually not handled by exactly the same device used to define the two classes, obstruent and voiced obstruent?instead, by the ?closest place of articulation? clause, 1.1.50, which does not use the Shiva sutra verse features?but this is, at any rate, another cross- classification. 178 even for P??ini that phonological features were in some way phonetically grounded, as the Shiva sutras were clearly organized according to well-understood phonetic classifications of Sanskrit phonemes, but, in contrast with the Prague school (and even within the Prague school sometimes), some early twentieth century phonologists seemed to endorse the idea that the features were motivated entirely by the phonological patterns they capture?so that, say, the existence of the class of obstruents in Sanskrit would be motivated only by the fact that they are a set of segments that all undergo a change to voiced stops; this point of view will be discussed later, but suffice it to say that, at least for Jakobson, features were clearly phonetic, and were required to play a role in both speech perception and speech production. The strong idea is carried forward into mainstream phonological theory today: there is a fixed set of binary features, a set of substantive phonological universals. There is some difficulty posed by this, of course, since the pronuciation of, say =i=, will always be slightly different across languages, and so both in production and in perception it seems quite reasonable to say that, rather than each phonological feature having a fixed phonetic interpretation, there is a fixed set of biases towards forming various different phonetic kinds of features. A strong version of this idea is implicit in Jakobson, Fant & Halle 1952; Jakobson & Halle 1956: these works attempt to delimit all the possible phonological fea- tures and associate each with a description of their acoustic, articulatory, and auditory characteristics. The binary feature [strident], for example, has at its positive value spectra with a ?random distribution of black areas,? due to ?turbulence at the point of articula- tion,? and identification of manipulated speech sounds as being the [+strident] element of a pair of stops in perception experiments (which according to Jakobson, Fant and Halle 179 is the affricate, so, [t] as opposed to [t]) is most reliable when the duration of the sound is longer. At its negative value, (which Jakobson, Fant and Halle call mellow), it gives rise to ?spectrograms in which the black areas may form horizontal or vertical striations,? and in production lacks the ?supplementary barrier that offers additional resistance to the air stream? that the corresponding [+strident] will have, such as obstruction from the lower teeth or uvula. This gives rise to a strong suggestion that there is a one-to-one mapping be- tween auditory feature detectors tuned to certain acoustic properties, on the one hand, and gestures, on the other, which some theories of speech perception pursue (Fowler 1986); others propose that there is a specialized system early in perception that converts per- cepts into motor representations (Liberman & Mattingly 1985), so that all phonological computations can be done over production features, while still others propose that per- ceptual and production phonetics are linked only by a prediction from a ?forward model? to predict what the percept should be for a given lexical item, stored in terms of some production-oriented features (Stevens 2002). Finally, some authors have argued for a mix of perceptual and production features in the lexicon (Ladefoged 2005). Note that this pho- netic content in binary phonological features is asserted in spite of the fact that at some level the phonetics definitely contains graded information. Our assumption is that, even if the phonetic systems do have binary features in them, learned, language-specific computa- tions can be done over the graded information, not only over binary features; see Chapter 5 for discussion of the empirical issues. In addition to making phonological features play a key role in perception and speech production, Jakobson 1941 places even greater empirical demand on features. He claims that the oppositions they set up delimit how infants learn: infants go through a sequence 180 of increasing complexity in what sounds they can pronounce, first learning to make the distinction associated with one feature (say, consonants versus vowels), then refining this by adding another feature (according to Jakobson, high versus low on the vowel side, and oral versus nasal on the consonant side). There is a fixed set of binary phonetic features, and an ordering is stated over these features; children?s language acquisition follows this ordering, and makes a progression following these distinctions and not other ones. The reverse order was supposedly the order that distinctions were lost in aphasia. There is little empirical support for Jakobson?s claims (see ?The Acquisition of Phonological Invento- ries? for a review) but, more generally, it has been common throughout the literature to make phonological features act as a set of fundamental units common to various systems and processes and have broad effects in all of them. Given that the understanding is that (at least for a given language), there is a finite set of phonological features available, underspecification is the idea that a segment can be marked with some, but not all, of these possible features. There are two different ideas that sometimes get called ?underspecification.? One is that features do not need to be valued at all, and, instead, they are marked as present or absent; the other is that features do need to be valued, but not all the feature?value pairs need to appear. The first theory, a theory that features are privative, is fairly consistently understood to imply that the phonological grammar cannot change or condition on unspecified features, because they are not present in the representation to begin with. However, it is often unclear what the difference be- tween this theory and the binary value theory is as far as the phonetic interpretation of features goes: what is the difference between saying that the position of the tongue, or the value of the first formant, in a vowel, can be specified as [high], or contain no such informa- 181 tion, as versus the claim that the phonetic specification can either be [+high] or [ high]? In either case, there must be one specification that corresponds, phonetically, to high vow- els, and another that corresponds, phonetically, to low vowels. One answer is suggested by Lahiri & Reetz 2002: features may be underspecified in lexical representations, but those same features may be perceived?that is, the perceptual system may make use of them and attempt to look up words in memory using those features: the feature [coronal], which characterizes sounds like [t] and [n], is often thought to be underspecified in all languages. The prediction is that, in word recognition, other features, like [labial], which characterizes sounds like [p] and [m], should be able used as successful search probes for words with coronal consonants: the German word D?ne, ?dune,? should be accessed when a listener hears the non-word D?me, but the word Schramme, ?scratch,? should not be accessed when a listener hears the non-word Schrane. For some empirical evidence like this, see Cornell, Lahiri & Eulitz 2011; Lahiri & Reetz 2010; Scharinger, Lahiri & Eulitz 2010; Scharinger & Lahiri 2010. The second idea of underspecification, in contrast, does not just say that presence/ab- sence is just the way that positive and negative feature values are represented (with what- ever consequences that has); rather, the representation must contain a value, positive or negative, when it contains a given feature, but it does not need to contain every feature. There are thus three values, in the sense of three possible states, for a given feature, which sometimes get called+, , and 0; again, though, 0 values, when they mean ?unspecified,? generally imply that the feature is invisible to phonological operations. As for phonetics, the same perceptual predictions could be made, but, in this theory, because underspeci- fication implies that the feature is never specified at all, positively or negatively, there is 182 another possibility, suggested by Cohn 1993; Dyck 1995; Keating 1988: lack of specifi- cation implies greater variability in production, so that, for example, since Russian [x] is underspecified for the feature [back], but [], according to Keating, is specified as [+back], [x] has greater front?back variance in its pronunciation, in terms of the position of the tongue. Both theories can be combined, too: some features may be privative, others bi- nary. The notion of contrast is also relevant to much of phonological feature theory. Dresher 2009b argues that the Contrastivist hypothesis (Hall 2007) is to be found, implicitly or ex- plicitly, throughout the phonology literature: the idea that only a subset of the features that are needed to fill in all the relevant phonetic information will be able to be altered or conditioned on by phonological grammars, namely, those features which are contrastive. The notion ?contrastive? is usually understood with respect to the inventory of segments (specified as sets of feature?value pairs), and in particular the inventory of segment classes that are used somewhere in the lexicon. These set up the possible contrasts: for example, Inuktitut has a three-way contrast between =i= ([+high][ back]), =u= ([+high][+back]), and =a= ([ high][? back]). The reason it is difficult to say whether =a= is specified as + or for the [back] feature is not only because its phonetic realization is actually fairly central, (and so therefore neither clearly back nor clearly front), but also because it is the only low vowel that ever needs to be coded lexically, and, therefore, whichever of the two categories [ high][ back] or [ high][+back] the lexicon actually stores, it does not seem to ever use the other. We say that the feature [back] is not contrastive for low vowels in Inuktitut, and the Contrastivist hypothesis would therefore predict that the feature [back], whatever its value might be, is invisible to the phonological mapping when it appears on 183 a segment in combination with [ high]. Unfortunately, there is some difficulty in jumping from an idea of contrasts between segments to contrastiveness of features in this way, as Dresher points out: without the question about the phonetic backness of =a=, we could just as easily have said that there is a two-way height contrast between =u= ([+high][+back]) and =a= ([ high][+back]), but that, since there is no low front vowel corresponding to =i=, only the feature [back], but not the feature [high], is contrastive for =i=. Dresher appeals to an ordering of features to fully determine contrastiveness of features: one asks whether a particular feature is con- trastive for given all the contrastive specifications of the features previous in the order, so that [back] is not contrastive for =a= if high< back, while [high] is not contrastive for =i= if back < high. The idea is the same, however: by some criterion having to do with what segments are in the inventory?a criterion which will be dependent on setting a fea- ture ordering, according to Dresher?the specification of a given feature can determined to be contrastive, as opposed to redundant, for some subset of the inventory. The Con- trastivist hypothesis is often reduced to some version of contrastive specification (Steriade 1987, Dresher 2009b): non-contrastive features are always underspecified, and this is why phonological grammars cannot see non-contrastive features?they are absent. Throughout this section, we have seen that different empirical demands have been placed on distinctive features: are they there to explain phonological patterning only? which, in the cognitive view, makes them lexical units that the phonological grammar manipulates?or are they also part of the system of phonetics, either in perception or pro- duction or both? This question is frequently reduced to the following: are phonological features merely classificatory, grouping segments together for the purpose of stating the 184 patterns captured in phonological grammars, or are they (also) phonetic, specifying partic- ular information relating to perception or articulation?7 Chomsky 1964 attributes the first view to Bloomfield, and argues instead for lexical information to be grounded in universal phonetics, building on Jakobson?s ideas. The question is frequently further reduced to the following: are phonological fea- tures learned, or ?emergent,? (with no bias towards one type of feature or another coming from UG) or are they ?innate? (with substantial bias towards certain features, like vowel height, consonant place of articulation, and so on, with a bit of fine tuning learned for each language)? The reason that the ?weak versus strong? learning bias question is of- ten strongly linked to, or even equated with, the classificatory versus phonetic question (for example, Mor?n 2007, Mielke 2008) is that, first, sound patterns differ widely from language to language, and so the classification of sounds for patterning purposes must be different across languages, thus learned, thus totally arbitrary (this last step in the rea- soning is not justified by itself, but this is the argument that Mielke makes in favor of emergent features). The second part of the link seems to be that, since phonetic systems are limited by the auditory and articulatory systems, any phonetic features would need to be fairly tightly constrained in their content; this is true up to the ?any??one can also 7The understanding is, furthermore, almost always that features group segmental categories, like [i] and [u]. One line of argument against the classificatory view, therefore, would be that the basic facts about the segmental categories used in language cannot be learned, or even properly described, without reference to phonological features. The arguments above about how features must have some cognitive use in the phonetic system, if correct, strongly suggest an argument like this: if perceptual cognition works using features, and segments per se are not used in perception to at all, then in order for the classificatory view to be right, lexical storage would need to first wipe out all the featural information and turn the output of the phonetics into atomic categories, only to recreate a (different) system of features, and this would seem unlikely given how often phonological feature systems recapitulate phonetically grounded classes. If a phonetic category learning model doing featural analysis of the phonetics to form categories, as versus one treating categories as atomic, could be shown to be better in some way, it would be another suggestive argument of this kind. 185 imagine features that are phonetically grounded but learned in order to better encode the phonetic contrasts in a language. (Mor?n?s argument is that, since sign language and spo- ken language use different articulators, features must be learned: apart from relying on this assumption that features could not be phonetic and also be quite plastic, this also rests on the assumption that there could not simply be two different types of features used in the phonologies of sign versus spoken language.) To sum up: distinctive features are dimensions of cross-classification for speech sounds. Today, they have a standard interpretation as basic units of lexical storage within segments. They are usually understood to be either binary (either ?present/absent,? i.e., ?specified/underspecified,? or else ?+ value/ value?) or ternary (?absent/present with value +/present with value ?). Most phonological theories also use features to set up correspondences between segments. There is support for the equivalence classes and cor- respondences they set up from phonological patterning (again, see Chapter 5 for more discussion of the evidence for true equivalence classes in phonological processes). They are also understood to play a role in speech perception and in speech production: for per- ception, some authors take them to be fundamental units, while others take the segment or syllable to be fundamental, but still take features to be lexical units which can influence perception. For production, researchers agree that production systems need to make ref- erence to the various different motor dimensions of a segment, and that phonemes per se are not fundamental cognitive units in production; theories of perception also recognize the fact that the auditory system is multidimensional, and differ over how the perceptual dimensions are connected with production dimensions, and whether the systems share a common set of binary features (that is, whether the two sets of are linked by a very strong 186 homomorphism preserving the precise coding of all the lexical items they are associated with). Finally, different views of the role of features in phonetics correspond to an active debate in phonological theory, that of classificatory versus phonetic features. Minimally, however, there is good reason to think that that some cross-classification into equivalence classes is necessary to handle phonological patterning, and, since binary or binary+un- derspecification type features are the standard way to do this, it is worth asking how we would incorporate this into a phonetic category learning model, and, thus ultimately, into a phonetic grammar learning model. 3.4.2 Background: Bayesian category models with features Many statistical models also use binary valued features in a similar way to phono- logical feature systems. This is different from saying that there are many statistical models for binary- or categorical-valued data?it is generally implicit in any phonological feature theory that, if any of the phonetic values for features need to be learned, then the rele- vant ?data? (input in the auditory system) is not categorical. Instead, what it says is that these models learn binary-valued parameter vectors from potentially continuous-valued input. In particular, there are a number of category models?which is to say, mixture models?in which each category is in some sense ?made up of? a collection of categorical feature?value pairs. To see this, start with the Chinese restaurant process model?which does not have this property?considering Algorithm 1: the indices (category labels) zi are categorical in the sense that they are drawn from from an enumerable set. There is no reason that the 187 index set they are drawn from had to be the integers; the algorithm would have worked exactly the same if the indices had been Unicode symbols: Q, ?, ?, and so on. In this model, a category is uniquely associated with a single element of an index set U , an element which is in turn associated with a parameter vector q , from a different set Q (in our case hA;Si pairs). Specifying a particular value of q fills in the specification of a likelihood function connecting the model with the input (it gives us a ?component of the perceptual map? in our case). The association between elements of U and Q is arbitrary. Consider now how this differs from the phonological feature systems described in the previous section. A single segmental storage unit (category) is associated with a set of feature?value pairs. If all the features are binary (or n-ary, etc) then we can say that the feature values are all drawn from a single set U , but each category is represented, now, not by a single value u 2U , but by a mapping c :F !U , taking elements of some set of ?possible features? as input (we discussed features like [high] and [back] above: these would be two different elements of F ). Put aside for the moment the fact that the features (the elements of F ) are understood to have content (that is, there ?is? a feature [high] that is different from the feature [back] not because of the values it can take, but because of something about what it ?means? to be [+high] as opposed to [+back]). There is a still more fundamental difference between the CRP model, known as a latent class model, and the feature model of lexical storage: categories are complex, not simplex. A single observation (if we are thinking about observations of individual segments) must be paired with more than just a single element in order to be considered a ?member? of a particular category. In the statistics literature, models in which a single ?category? is actually a complex object, are called latent feature models. We will review some of 188 these models now. However, before we continue: it is important to pay attention to the fact that these models, unlike the feature model of lexical storage, almost always work under the assumption that features do not have a priori content; a particular category will be represented in these models as just a sequence of feature values, not a mapping from features to values. A sequence can of course be seen as a mapping c in which the domain F is the integers, but, for the purposes of these models, it can just as easily also be seen as a mapping in whichF is the Unicode table?the order that features are in in a model like this is just a way of keeping straight distinct items, not a way of giving them content; there is no sense in which models like this could ?have? a feature [high] or [back] or anything along those lines, because the features are interchangeable. We will see this spelled out right now, and we will return to it later on, and discuss how to extend these models to obtain more constrained latent feature models. To begin with an example: the standard Bayesian latent feature model is the Indian buffet process (IBP) due to Griffiths & Ghahramani 2006. In this model, the number of features is infinite and values are binary. The sampling scheme set out by Griffiths and Ghahramani is very similar to Algorithm 1 above, except that each observation must now be associated with, not a single category, but a vector of features; rather than evaluating, for each existing simplex category, the probability of assigning the observation to that category, plus some probability for adding a new category, we evaluate the probability of assigning each feature the value 1 for this observation, and, with some probability, set some number of new features to 1. In the IBP, the probability of setting feature k to 1 for a new observation (conditional on feature specifications for the previous observations) is NkN , where Nk is the number of previous observations with k = 1, and N is the total number of 189 observations (including the new one), weighted by the likelihood under that specification, as in the DP; the distribution on the number of new features used is Poisson with rate parameter aN , where a is a hyperparameter, while the unconditional distribution on the total number of features used is Poisson with rate parameter a , meaning that we should think of a as the average number of features used by any given observation. Similar to the Chinese restaurant process, the ?add to existing? probabilities increase in the number of other observations with the property, only, here, the property is not ?belongs to the given category? but rather ?has the given feature as 1.? Unlike in the Chinese restaurant process, however, the probability of creating a new (previously unobserved) category is not limited to the probability of adding new features: it is possible to create a new category out of old features, simply by selecting a combination that has not been added before. Latent feature models thus decompose categories. A ?category? in a latent feature model like the IBP is some combination of binary feature values; but remember that cate- gories need to be associated with parameter values to do anything. If categories are decom- posed in a latent feature model, it would seem to suggest that we must have decomposed the parameter values as well?after all, if the new feature combination f1 = +/ f2 = was not required to have anything in common, via the feature f1 = +, with the category f1 = +/ f2 = +, then the practical value of decomposing the categories using features be- comes much less obvious. What is the equivalent of sampling a new parameter value in the IBP? In general it depends on what the features are modelling (they do not need to be mod- elling phonetic categories), but the key is that there is some parameter value q f associated with each feature f , and there is some structure in the set of likelihood functions which is preserved under conjunction of features, so that l [ f1^ f2^ ] = l [ f1] l [ f2] . The 190 example that Griffiths and Ghahramani use is of multivariate Gaussian likelihoods with locations differing as a function of the category?rather like our own phonetic category models?and with being the operation that simply sums the locations of the Gaussian likelihood functions, so that, for example, m f1^ f2 = m f1 + m f2 . This is actually a linear model just like our own: written in vector notation, if z is the binary column vector rep- resenting a category, written out as zeroes and ones, and we stack the various Gaussian locations, one per row, in a matrix A, then the location for a particular category is AT z.8 The key differences here are that: (i) the equivalent of the predictor vector x is latent, not observed, (ii) it is infinitely long (and so in fact we should really call it a function, not a vector), and (iii) the MLM model picks out a mixture of A matrices, whereas this IBP based model will only have one A matrix. Notice that it still gives rise to infinitely many latent categories, however, because there are infinitely many latent features. Some examples of what good models for the Inuktitut data would look like (focusing on just the non-retracted allophones) are shown in Figure 3.7. Most notable Bayesian latent feature models are intended to extend IBP in some way. For example, the dependent IBP (Williamson, Orbanz & Ghahramani 2010) is an extension to the IBP to allow the assignment of features to observations to depend in some measure on the values of a set of observed predictors; the power-law IBP (Teh & G?r?r 8Griffiths and Ghahramani call the individual m values, rather than the binary values, the ?feature val- ues.? They separate out the binary feature vector as a ?sparseness vector,? which we might instead call a ?variable selection? vector, or a ?feature selection? vector, following our terminology above. This differ- ence with respect to the phonologist?s perspective on features is merely terminological: what phonologists call ?binary feature values,? we might instead call the ?feature selector values? associated with a segment; what phonologists call the ?phonetic contents? of a feature,? we might instead call the ?feature value,? here meaning the ?value? associated with the feature. For current purposes, we will stick to the phonologist?s terminology?the IBP learns binary feature values plus some associated parameters, the intrinsic content of those features. 191 Figure 3.7: Three different ways that the clusters in a three-vowel system might be de- composed if the likelihood function in a latent feature model simply adds Gaussian means. All of the clusters here share some feature f0, which need not be enforced (but could be). In the leftmost model we show a combination of features which is predicted to be possible but which should never be used (if the model finds a cluster assignment consistent with the right one): something like [+high][ back][ front], where we are calling the ?zero? setting for a feature the setting (although we could have called it the underspecified setting just as easily). 2009) allows the number of features used by an observation to follow power-law distri- butions but is otherwise the same; and, while the IBP feature assignments have the same distribution regardless of the order in which the assignments are made across observations (they are exchangeable), a general framework for handling dependencies among observa- tions by coding relations between them in a tree structure is the phylogenetic IBP (Miller, Griffiths & Jordan 2009). The Beta process (Hjort 1990) might be seen as an exception in that it is a latent category model, and did not start from the basis of IBP (it was devel- oped earlier); however, BP turns out to have IBP as a particular construction of a special case (Thibaux & Jordan 2007), so that all the extensions of IBP are also extensions of Beta processes. To take the Beta process as an example of a more general latent feature model: while the feature value assignments under an IBP have a single hyperparameter a regulating the number of features used by an observation, under a BP prior, an additional parameter c explicitly trades off the addition of new features against old ones, similar to 192 the Dirichlet process prior: the probability of using a given existing feature k goes down from NkN to Nk N+c 1 , while the parameter of the Poisson distribution on the number of new features added is now scaled down from aN to ca N+c 1 ; the IBP is the case where c = 1. Smaller values of c encourage the reuse of features, and larger values discourage the reuse of features. Once the IBP is understood, however, the general flavor of all Bayesian infi- nite latent feature models is fairly clear. Of course, latent feature models need not have infinitely many features: for exam- ple, it is simple enough to add an inference for the value of x to the MLM described above, so that the model becomes a Dirichlet process mixture of finite latent feature models. As phonological feature theories often attempt to delimit a finite universal feature alphabet, this might seem to be a key difference between phonological feature theory and the promi- nent current Bayesian latent feature models (and a favorable one, since finite-dimensional prior distributions are easier to work with and derive). However, there is a general uncer- tainty about the universality of the set of features, and so it would be imprudent to enforce this constraint too strictly; more importantly, there are deeper differences to be reconciled. Before proceeding, it is worth making one point very clear: despite the allusions to the ?presence or absence? of a feature in the current section, the binary features in a statistical latent feature model are not necessarily privative. We can think of the binary values just as easily as mapping to + and as to present or absent. This means that we can use the exact same model to capture either full-specification binary feature models or underspecification models with privative features only; which we choose is a matter of how we interpret the feature values, and in particular how we choose a likelihood (for example, whether we choose to use a likelihood like Griffiths and Ghahramani?s, where 193 the 0 value for a feature implies an actual zero effect on the likelihood function). 3.4.3 Feature-based phonetic category models: goals for future research In the standard IBP model, the content of a feature is drawn from some distribution which is the same across all features. When any new feature is introduced, it is always associated with a parameter value taken from the same set, drawn from the same probabil- ity distribution (like the fixed prior distribution over Gaussian location components in the Griffiths and Ghahramani model discussed above). This is at odds with the usual phono- logical model: if a [high] feature is used to specify some category, it is surely the case that another [high] feature will not be employed, a constraint which means something if (and only if) there is some phonetic constraint on what it means to be a [high] feature. This is possible to have even if the precise phonetic values for a feature need to be learned, and which case what is universal for a given feature is some particular distribution over pos- sible parameter values, not a particular phonetic value. This constraint on how parameter selection works comes in two parts. Here is how we state the first part: (93) FM1. The selection of parameter values for a given feature can be separated into two steps: (i) select a distribution on parameter values; (ii) select a parameter value from that distribution. Universal phonetics approaches to feature theory claim that the distribution over distributions in step (i) is concentrated at a small number of biologically fixed points (although it would not be inconsistent with the prevailing opinion to think that some probability mass is held out for ?totally learned? features). 194 Here is the second part: (94) FM2. Step (i) above is dependent on the other step-(i) selections in the following strong way: conditional on a distribution G1 being associated with a feature, the probability of G1 being assigned to a new feature is zero. The other way that the IBP-like models do not align with phonology is when it comes to underspecification. In particular, the binary-valued underspecification hypothesis says that features can be omitted, not only specified as one of their two values. Conceptually, this is a matter of specifying an inference over now ternary-valued features in two steps, first, via a distribution on whether the feature will be present or absent, and then via a distribution on its binary value (in addition to specifying its phonetic content). The Contrastivist hypothesis further asserts that there is some independent and mean- ingful notion of contrast which drives what gets specified lexically: in particular, certain feature specifications contain information which is not contrastive given the rest of the specification. However, as discussed above, the classical notion of ?contrastiveness,? by which a feature must non-trivially partition a finite set of already-learned categories com- plete with binary feature specifications, presupposes a classificatory view of features, or, at least, a classificatory view of underspecification. A system learning phonetic feature specifications using a contrastiveness criterion, on the other hand, would need to use a very different notion of ?contrastiveness,? one based on the degree of explanation of the phonetic data?because before learning the categories, that is all it would have available to set its criteria around. 195 There are therefore two very different types of models that could be built: one, the standard classificatory view, would divorce underspecification from phonetic feature specification, and build it into a second step in which the lexical representation for an ob- servation is determined. A data point?s ?surface? featural specification z would determine the phonetic likelihood function, but a higher-level, ?lexical? feature specification l would also need to be inferred, from which z would be be predicted; the parameters of a mapping between l and z is what would need to be learned. The ?surface? level does not actually to be the conventional ?phonetic surface? level, (in a larger model, it might be the input to the phonological mapping from the lexical side instead), but the mapping should be able to predict features that must be present on z but which are absent on l. A single-state transducer with lexical representations subject to some simplicity-biased prior would be a good place to start. The alternative approach would be to make the idea of ?contrastiveness? translate entirely into an evaluation of the phonetic likelihood function, thus, some notion of ?good- ness of fit.? In the IBP, the probability of setting an existing feature to 1 or 0 is determined by the number of observations with that feature set to 1 and the likelihood of the ob- servation after setting the feature to 1/0. Consider the leftmost model, for example, in Figure 3.7: in this hypothetical set of clusters I have not drawn the clusters correspond- ing to m ( f0)+ m ( f1) and m ( f0)+ m ( f1); but such feature assignments ([ high][+front] and [ high][+back]) are possible. The idea is that such feature assignments would just never be made because they would not give very high likelihood to any observations: they would not fit the data well. In a phonetic sense, there ?is no contrast? between [+front] and [+front] or [+back] and [+back] for the low vowels. Rather than being interpreted 196 as meaning that there are no already-formed phones corresponding to these features, we would instead interpret ?non-contrastive? as meaning there is no perceptual evidence to support phones corresponding to these feature combinations, that is, for low vowels, the features [front] and [back] area not phonetically contrastive. In the IBP, it is not the case that there is a preference to have a feature set to zero once it is in use: sparser assignments of 1s to existing features are not ?simpler? for the IBP. Rather, any time a feature is in use by at least one observation, then an additional selection of 1 or 0 will need to be made for all the other observations. However, this does mean that consequently the probability of a feature vector will be scaled down across the board, because one more probability needs to be multiplied in, and this reduction in probability is clearly an instance of the Bayesian Occam?s Razor?but with respect to how many features are in use by at least one observation. The relevant pair of larger and smaller model hyperparameters is a set of K previously-in-use features versus a set of K + 1 features, and a BOR is yielded because the shape of the distribution over the first K feature values is the same under one of the choices for the Kth (in fact, both, due to the independence of feature selections for a given observation). Thus sparseness is enforced for the set of features in use taken as a whole, but once a feature is in use by at least one observation, then it is in use, and the selection must be made for all observations; the only way to get the BOR effect is to take the feature out of use entirely. With the addition of a second inference, however, we can do the following: for each feature in use, choose whether to specify the feature; choose its feature value if it is specified. Now it is not only a smaller set of features overall that will be preferred, but also a smaller number of specified features for a given observation, because each 197 decision to specify a feature induces its own BOR effect on the feature specification vector for that observation as a whole. It would still not be possible to learn an outright ban certain feature combinations because they are ?redundant? in the classical contrastivist sense?this model still ultimately relies on the likelihood function to assess ?contrast? and unusual observations still might give rise to unusual feature combinations?but it would at least give rise to the possibility of underspecification of binary feature values, and there would be a preference for underspecification. To sum this up, the two possible versions of contrastive underspecification are as follows: (95) FM3a. For each observation, the feature specification must be converted to a lexical representation via some mapping that must be learned. Underspecification can arise if the mapping enforces sparseness on the number of features. (96) FM3b. The specification of feature k for observation i, zik, is a two-part choice: first between 0 (underspecified) and 1 (specified), and then between + and , conditional on the first choice coming out as 1. The first choice relies on a separation between lexical and non-lexical featural represen- tations, via a mapping which sets out what are and are not possible (or likely) lexical representations; the mapping needs to predict non-lexical from lexical feature specifica- tions. Enforcing sparseness of lexical representations can therefore penalize lexical rep- resentations which are equally predictive but more complex (like full specification ver- sus underspecification). This is a version of the classical contrastivist idea: features will be underspecified if their binary values can be predicted from other values. The second 198 choice does not make such a separation. This is a model where the choice to specify or underspecify relies only on how well the observed phonetic values can be predicted from one specification or the other. It relies only on what might be called a notion of ?phonetic contrast,? and it does so no more than any other phonetic category learning model we have talked about. Finally, we must discuss the likelihood function. Above, we gave examples of what phonetic features might be learned under a Griffiths and Ghahramani-type linear Gaussian model. These models did not look very much like the usual feature models in phonology, because a feature value could either be 1 or 0, and, if it was 0, it induced a zero change in the likelihood. There was no way to have a single feature give rise to two different non-zero effects. However, recall our change in coding in the gender model above: the same set of induced categories can be learned with a different representation if we code the feature values as 1 and 1 rather than 1 and 0. See Figure 3.8 for an example of what this might look like. It is much more like what we would conventionally expect phonological features to do for vowels. Strictly speaking, it is not necessary to change to an underspecification model in order to use this coding, (a standard IBP model would work), but that is necessary in order to get a model to come out looking like Figure 3.8: the low vowel category =a= is underspecified for backness. A more important difficulty in specifying the likelihood function is the problem of the covariance. The problem does not arise for Griffiths and Ghahramani, because their features have fixed covariance. Recall our discussion above, in which we said that we would deal with the problem of needing to apply to full category map ?phonetic repre- sentations,? which somewhat oddly, implied that we could sometimes add the shapes of 199 Figure 3.8: One way that the clusters in a three-vowel system might be decomposed if the likelihood function in a latent feature model with feature values +1 and 1 adds Gaussian means. All of the clusters here share some feature f0, which need not be en- forced (but could be). If points are to be assigned assigned to the cluster with center at + m ( f0) m ( f1), then this model is only possible if there is also a zero value?that is, if it is an underspecification model. Those points (the vowel =a=) will need to be un- derspecified for the feature f2, the backness feature. Under the model discussed in the text, it will still be possible to assign tokens the feature values f0 = 1; f1 = 1; f2 = 1 and f0 = 1; f1 = 1; f2 = 1, corresponding to the locations + m ( f0) m ( f1) + m ( f2) and + m ( f0) m ( f1) m ( f2), but we would expect these points to have implausibly low likelihood. phonetic categories together, by saying that it simply never arose that we would need to do that. Now, we need to do something almost exactly like that: we need to build cate- gories out of smaller pieces, and the linear Gaussian model says that we add the pieces together (with +). It is almost certainly not the case that we could coherently account for all the differences in the shapes of (perceptual) phonetic categories that arise using addi- tion of covariance matrices. There are a number of ways we could get around this while keeping the Gaussian assumption, (we could hold the orientation of the Gaussian fixed 200 and hope that we could get away with adding and then squaring to obtain the eigenvalues of the covariance matrix), but all would add some complications to inference. A proper exploration of how the featural composition of a phonetic category determines its shape, if at all, is in order; it would be interesting to take the idea seriously that the addition of features is by and thus affects the shape in the same way as phonetic transforms do. I will leave an implementation for future research, but it is worth reviewing some of the technical issues here for the interested reader: FM1 makes the likelihood more difficult to compute?but not impossible, particularly if it is possible to integrate out the parameter values that go into the likelihood functions; FM1 places unresolved empirical demands on the construction of the model?but the assumption could be dropped; the re- search strategy would be to assume that features are totally learned until modelling results deem it necessary to add a bias, and the research program would begin by evaluating how well modelling results align with the linguistic typology of features; if there is truly no held-out probability for learned features, then FM1 makes the feature model finite, which is easy to work with; FM2 makes the distribution on the sequence of parameter values nonindependent, but it could be dropped if UG were such that choosing the same fea- ture distribution twice would be unlikely to yield high likelihoods (or simply by dropping FM1 altogether, as discussed); FM3a requires additional machinery be added to a stan- dard IBP model, but there is nothing particularly novel about that machinery (including a probability of adding particular features to z conditional on the makeup of l); FM3b con- stitutes a fairly small change to the IBP. Finally, and perhaps most importantly, we need to work out a proper model by which combination of features translates into combination of parameters, keeping in mind our concerns about specifying the covariance. 201 3.5 Further issues Consider again the Inuktitut vowel data, shown in Figure 3.9. This data shows al- lophones with different covariances. Under the real-space hypothesis, we can interpret the size and shape of these distributions directly as representations of components of a phonetic map (or at least what must go into it and come out correctly identified). 2900 2426 1952 1478 1004 5301100 924 748 572 396 220 2 F? F? 2900 2426 1952 1478 1004 5301100 924 748 572 396 220 9 F? F? Figure 3.9: Second and first formant values for Inuktitut vowel tokens (repeated from Figure 3.3). This does not comport with how we have implemented the LPT. There are at least three possibilities: Additions to covariance It is possible to make whatever adjustments we need to the scale parameter of a Gaussian of using addition; more generally, we can also adjust what- ever representation of a category we have using addition, including the scale (on some other representation the location and scale might not be independent). Changes to The way that works is not simple addition; it only looks like simple addition if we consider the location of categories. 202 Scaling of dimensions The perceptual dimensions are scaled differently from the acoustic dimensions. The categories do have the same covariance matrices in perceptual space. The idea behind the last suggestion is that the vowel space might be on a different scale, and, if we simply changed the scale on the axes, we would get all the covariances looking like they were the same shape across allophones. (There are, of course, various psy- chophysical scales?bark, mel, etc?which are presumably more like the low-level au- ditory input than Hertz, but these are fairly close to linear for much of the vowel space, and at any rate do not impact the appearance of this data much.) The problem with this is that the allophones, taken as a whole, seem to be rescaled in a way that does not depend only on their location in the vowel space, but also on the fact that they are allophones: in addition to being shifted downward and back, they are scaled down in their variances. If the scale being different were the thing responsible, we would expect the scales to be the same between allophones of a single phoneme, given how close they are to each other in formant space. The first two possibilities are very difficult to distinguish, as it will generally be possible to change our representation of the shape of the category; the details will depend on both the evaluation measure and the way scale effects track . I will leave the empirical details to be worked out; but there are several interesting patterns that can be found in the retraction data, which may prove relevant. First, the retracted vowels have smaller variance in both dimensions than the non-retracted vowels, and the retracted low vowel also has a smaller variance F1, although the variance increases 203 in the F2 dimension; this suggests there might be something systematic about the change in scale, either tracking the allophonic/underlying status of the category, or the degree to which it has been retracted. The pattern is imperfect, and we will see in a moment a case where the changes in scale are somewhat more arbitrary; but we will also see a suggestion of a pattern like this again in Chapter 5. The vowels are also closer together when they are retracted. This is almost certainly due to physical factors: limitations on the position of the tongue demand that the low vowel not be shifted as far as the two high vowels. This should influence our thinking on how works: as we reach the edge of the vowel space, the sample covariances must necessarily become more compressed. Whether the perceptual maps reflect the sample covariances in this way is a separate question, but if they do, then needs to operate in a way that takes this into account. We might instead be led to think that the locations follow from the absolute position in phonetic space to some degree: perhaps the three are specified as equal transforms, but the computation of corrects for the limits on possible vowels. There are some facts, however, that suggest strongly that this is not the explanation for the compression of the vowel space. Consider the example shown in the leftmost panel in Figure 3.10 of retraction in Kalaallisut, which is another Inuit language, the official language of Greenland; it has the same vowel inventory and the same rule as Inuktitut. The retraction in Kalaallisut appears to be more pronounced than in our Inuktitut data. The variances show a different pattern than in Inuktitut, as they are not clearly scaled down in the retracted vowels in the way they are in Inuktitut. There is some reduction in the variance in the F2 dimension, but this could just as easily be attributable to the effect of 204 Figure 3.10: Vowel categories from Kalaallisut (see below for a description of the cor- pus). The vowel ellipses in the left panel are unretracted (grey) and retracted due to the influence of a following uvular consonant (red); in the middle panel, not fronted (grey) and fronted due to the influence of a preceding coronal consonant (green); in the right panel, unretracted and unfronted (grey; in fact, all the grey ellipses in all three panels are unretracted and unfronted) and both retracted and fronted (yellow). reaching the back of the vowel space. Now, Kalaallisut also has an allophonic rule which fronts vowels slightly, particularly =u=, after coronals. This is shown in the middle panel. Here, again, the vowels are closer together than in their default pronunciations, but this, too, could be attributable to reaching a corner of the vowel space, and it could literally be the effect of a grammatically active constraint on possible outputs, limiting the ultimate locations of the targets?rather than a rule which happens to respect this constraint. This might be our explanation if we considered the two processes separately; but it cannot be correct. The right panel shows what happens when both processes apply: the vowels that have been both retracted and fronted are still tightly clustered together, even though they are now in the middle of the vowel space. The action of the two transforms is such that the vowel space winds up compressed, although there is no physical reason for it to be so any longer. We may also be able to use this case to shed light on the operation of per se. For the retracted vowels, the difference between fronted and unfronted =u=?where fronting 205 is most pronounced?is far less than in the unretracted =u=, a difference of about 122 Hz in F1, as opposed to 395 Hz for unretracted =u=. This suggests that is not simply doing addition of the (rescaled) acoustic values. It should not be taken as relating to associativity or commutativity: we have been given no independent evidence about different ways of combining the effects of transforms. Rather, we have been shown one way?the actual way?that phonetic transforms combine, and we have seen that it deviates from ordinary addition. One approach to handling this might that the phonetic representations are all rela- tive, not absolute, and they therefore combine by something more like multiplication?but how to apply this so that it allows us to state the full range of transforms, and at the same time state the categories themselves, is not totally clear. Another issue raised by the application of this model to talker variability?or rather, by the presentation of a concrete model of talker variability perception?is the distinction between I-language and E-language (Chomsky 1986). An I-language is simply whatever (?internal?) state of a particular human?s brain permits it to process, produce, and assign meaning to language utterances?the mental grammar, but the term is set up to contrast against an E-language, which is an (?external?) set of utterances, or utterances paired with referents, or situations, or whatever other collective externalization of language one could find to label ?a language.? As Chomsky points out, it is much easier to say precisely what constitutes an I-language than what constitutes an E-language, and, in spite of some researchers who take the study of language to be the study of E-languages, it seems more reasonable to study I-languages, not only because the study of mind is interesting, but also because we ought ultimately be able to derive whatever properties an E-language happens to have from properties of the minds that speak it. 206 One nice consequence of this is that it deproblematizes the question of what consti- tutes a ?language,? familiar from comparative linguistics: why are Bosnian/Serbian/Croa- tian/Montenegrin different ?languages? when their standard forms are in fact basically the same apart from the reflex of Proto-Slavic yat, while the three sub-?dialects? of Croatian (Shtokavian, Kajkavian, and Chakavian) are not mutually intelligible? The answer is that it is beside the point; the terms ?language? and ?dialect? are put to inconsistent use de- pending on the circumstances, and the desperate need for a clear distinction is motivated only by the misconception that E-languages are natural kinds. The I-language perspective makes it all right if they are not, and if, say, E-languages are only derivative properties of interacting collections of I-languages, which is to say, individual people?s grammars which have slightly different properties from one person to the next. Grammars may be more or less similar, but there need not be any meaningful points at which two grammars change from being idiolect-different to dialect-different to language-different. Nevertheless, taking the speaker model to its logical extreme, in which each speaker has a separate sub-model, there is a question that is raised: why does the learner construct a common model for all these speakers in the first place? That is, why are the speaker- dependent phonemes not simply separate phonemes? After all, if everyone around you has a different grammar, why not simply take this at face value? Why assume that they speak the same ?language?? There are reasons, of course: simplicity biases militate against simply adding new lexical items, phonemes, and so on, just because they were spoken by someone different; having a noise model, without which even within-speaker variability would be impossible to process, allows slightly different pronunciations to be assimilated across speakers as well; and the apparatus must exist to handle allophony within a sin- 207 gle speaker, and so effects of talker-level variables certainly can be partialled out in the phonetic grammar. Nevertheless, the world did not have to be this way: it could have been the case that I-languages were devices which the brain constructed specially to op- erate on input from one single other I-language, but it is not the case: although they can contain adaptations to handle individual-level variability, I-languages are not (only) mod- els of individual I-languages. In the speaker model, I set the scalars coding male/female, young/old to be 1 and 1, which meant that the common, neutral content of a category did indeed have a cognitive status (it is the intercept in the A matrix); even if this had not been the case, the common category label would have been shared across different levels of the talker variables despite the fact that the phonetic interpretation of the category at each of those levels differs. Thus, in one way or another, properties common to speakers, and not only to individuals, have a cognitive status. This is not surprising, of course, nor does the fact that the grammar gives some inter- nal status to a common ?language? mean that that ?language? must align with constructs like ?English,? ?Dutch,? or ?Chinese.? (Presumably, however, there is some status given to arbitrarily constructed language/dialect-level indicators that allow us to make use of the information that a speaker used the word yall or vous-autres to switch us, categori- cally, to a system that will make sure we say coke or liqueur douce rather than pop, as we might normally be inclined to say?and which will, more importantly, give us some general expectations about that speaker?s phonology, grammar, vocabulary, and so on.) Nevertheless, it does mean that ?I-language? should not be the lens through which we view everything. Finally, the discussion of a model in which features rather than atomic categories are 208 learned raises the question of what, then, it would mean to be an allophonic process at all for the purposes of our conjecture: that allophonic processes?those processes that do not output existing lexically contrastive categories?are phonetic and gradient. What would it mean if there was no primitive notion of ?category?? We could modify the criterion in one of two ways: we could either say that the output of a process was a strict allophone if it generated a novel combination of (possibly elsewhere contrastive) features; or only if it made use of a nowhere-contrastive feature. This latter is dangerous, however, as it implies that allophonic processes simply add feature?value pairs, (or at least can), which is at odds with the way we described phonetic transforms as working. The problem is that the notion of ?non-contrastive category? in the definition of a strict allophone was really intended to be a descriptive, or even an external, description of a phenomenon, for the purposes of attaching an explanation in the form of phonetic transforms; it does not need to be replaced when we move from an atomic representation of lexical segments?which are internal?to a feature-based one. I discuss these issues at greater length in Chapters 4 and 5. 3.6 Summary In this chapter, I have highlighted the existence of context-dependent phonetic gram- mar, and put forward a new hypothesis about it, namely, that context-dependent phonetic transforms are linear and additive. I have used this to show not only how we can learn the phonetic contents of allophonic processes, and how this can improve our ability to learn phonetic categories, but also to show how we can learn the environments in which 209 they occur. I have also gone over the basic statistical principles by which we can set up phonetic category models, and shown what would be involved in making such a phonetic category model reflect the idea from phonology that categories are inherently decomposed into features (as well as some of the barriers to doing that). I have presented some pho- netic data bearing on the way that the context-dependent phonetic transformations work. Finally, I have raised a further conjecture: all allophonic processes are phonetic. The rest of this dissertation investigates this idea further. 210 Chapter 4: Phonetic transforms I: The cognitive architecture The spirit talks in spectrums He talks mother earth to father sky ?Joni Mitchell, ?Don Juan?s Reckless Daughter? 4.1 The phonetic surface Chapter 3 introduced the linear phonetic transform hypothesis (LPT) and then put forward the conjecture that allophony is due to the application of phonetic processes. Be- fore we begin any discussion of the implications of this, we need to say what we mean by ?allophony.? We are referring to the existence of a certain type of relation between phonetic categories, which is to say a certain type of relation between equivalence classes over phonetic tokens. What kind of equivalence class? Take it as given that we can sort out which seg- ments belong to which categories, including the categories we would like to identify as allophones, not as learners but as analysts: that is, suppose we are given a descriptive oracle which can turn a speech signal into a sequence of what are ordinarily called phones (unlike the mixture of Gaussians procedure discussed above in the first set of Inuktitut experiments, which was not good enough to deliver such an oracle). Although the reader 211 may have already gleaned that our theory implies that nothing even approximating this oracle is part of phonological cognition, in order to have a coherent definition of what constitutes an allophone, phones must exist descriptively. Whether such an oracle would be based on acoustic statistics, or some kind of measurement of the muscle movement involved in production, or whether it would actually be impossible to build, talk about al- lophones, descriptively, is talk about the output of such an oracle; that is all we are saying here. Given an oracle, there are two related patterns I will be folding under the term ?al- lophony.? First, the very shallow property of complementary distribution. Complemen- tary distribution simply says that a pair of oracle categories never occur in exactly the same segmental context (which we need to take to mean ?in the same n-phone sequence, for some reasonable n, often 3). If we take any transcription from the oracle, take two categories and replace all instances of both with question marks, then, if they are in com- plementary distribution, the remainder of the transcription is always sufficient information to restore the question marks, and conversely (up to a minimal n). Now, there are many things that get called allophony that do not show complementary distribution, and vice versa: Non-surface allophony ?Obscured? complementary distribution is often still considered allophony. Take the words write and ride. With respect to the pronunciation of the second one, [rajd], the pronunciation of the first is relatively short, central, and slightly raised in many Northern dialects of North American English: [rjt]. If an oracle could distinguish [j] and [aj], they would be in complementary distribution in 212 many cases like this. The key difference is the voicing on the following segment. But North American English also pronounces both writer and rider with the flap [] as in water [w?r] in place of both the coronal stops [t] and [d] ([] is in complementary distribution with both of those). As a result, writer is pronounced [rj?r] and rider [raj?r]. So, in fact, the oracle would tell us that, looking at this larger sample of the language, [j] and [aj] are not actually in complementary distribution. Still, they are ?allophones? under many descriptions, and especially if the phonological grammar works in multiple steps, of which a conversion of [t] and [d] to [] is just one. At the level of representation before this conversion happens, [j] and [aj] do stand in complementary distribution. Given the oracle, they do not stand in complementary distribution, but they fall into the same class as things that do if we assume a bit about the grammatical system that underlies this complementary distribution. Mismatching pairs In Spanish, [b] is in complementary distribution with [?] and [d] is in complementary distribution with [?]. However, as [?] and [d] have the exact same restricted distribution, it is also true that [b] is in complementary distribution with [?] and [d] is in complementary distribution with [?]. Nevertheless, [b] and [?] are not considered allophones, and neither are [d] and [?]. Again, this is because ?al- lophony? refers to one of a number of descriptive generalizations intended to feed a theoretical account of what is going on, and, in particular, the distributional relation between [b] and [?] is thought to be the result of the grammar turning [b] into [?], but the distributional relation between [b] and [?] is not thought to be the result of turning [b] into [?] (similarly for the other case). Again, descriptively, if all we presuppose is 213 the transcription, then all four pairs stand in complementary distribution, but they do not all fall into the same class once we add this particular theoretically-committed rider about what is going on cognitively. Crazy pairs In English, [?] and [h] are in complementary distribution, because [?] only occurs syllable-finally and [h] never does, but no one would say they are allophones for the same reason that [b] and [?] are not allophones. This is only different from the mismatching pairs case in that the two are not even supposed to be indirectly related; it is a complete coincidence that the two have complementary distributions, and no one attempts to claim that there is any common cause in the form of a process giving rise to either of these two positional restrictions. In short: allophony is some causal account for (certain) cases of complementary distri- bution, which, given an oracle that allows us to talk about phones, (?oracle categories? henceforth) is itself just a descriptive generalization. Although both may go beyond a finite corpus, allophony is something reasonable people could disagree about, and com- plementary distribution is not (again, assuming they had access to the transcription oracle). In addition to complementary distribution, I will also relate phonetic transforms to strict allophony (see Chapter 1). Now, this idea presupposes more than an oracle: the idea is that there are oracle categories that appear in the transcription but have no corresponding lexical category. Thus it assumes an analysis of the lexicon. The first important idea about this will be that phonetic transforms, regardless of whether they are patterns that would traditionally fall under ?allophony,? (in Chapter 5 we will see some interesting cases where they are not), are virtually assured to give rise to an analysis with almost exactly that shape: 214 if we had an oracle, we would find categories that emerge from the phonetic grammar but are not coded lexically. (As we will see, one difference is that the categories are not given a categorical phonological representation anywhere.) The other term for the allophonic processes in analyses like this is non-structure-preserving: they introduce information that cannot (or at least is not) coded in the lexicon. The second part of this chapter will discuss some of the known descriptive generalizations about non-structure-preserving processes, and show how the current analysis hangs on to those. Now, barring the distributional pattern being obscured by other processes, such sit- uations should give rise to surface distributions that are complementary; and this idea of surface-only categories is hardly theory-specific. However, as referenced in passing in Chapter 1, depending on the theory, that may or may not be the mechanism accounting for complementary distribution patterns. We will see presently that analyses in Optimality Theory usually take a slightly different approach. The rest of this chapter will thus proceed like this: first, take it for granted that pho- netic transforms account for complementary distribution, and note that such an analysis is a kind of non-structure-preservation, or strict allophony analysis of complementary dis- tribution (and of course other cases?to be discussed in Chapter 5). The first part of the chapter then discusses a major consequence of this, which is that there is no surface rep- resentation in the conventional sense. This puts the analysis at odds with the analysis of many of these patterns in Optimality Theory as ?surface phonotactics?; and it puts it at odds, albeit less sharply, with the classical analysis as categorical rules. I will then explain why the architecture is the way it is, and not otherwise, in terms of the special nature of phonetic non-structure-preservation; this will be seen to be related to a known empiri- 215 cal pattern I call the ?lateness of allophony.? Finally, previewing Chapter 5, I will start to sketch the reason that not only the analysis, but the theory, is at odds with the conventional surface representation?in other words, why the architecture should lead us to treat com- plementary distribution patterns as non-structure-preserving phonetic transforms. The key will be Bayes? Rule. 4.1.1 Background: Surface representations The way we have talked about learning phonological categories up to now, it has been as if there were a finite set of lexical categories: possible segments that can be stored as part of a lexical item. We made some brief reference to the fact that our statistical models do not really comport with the assumption of finite inventories, but it is a standard assumption in linguistics (and, at any rate, they do imply that inventories are enumerable). To understand the idea of a finite lexical inventory, we need to first ask what it means cog- nitively. The idea is simply that there is some restriction such that every stored lexical item will always be a string over some finite alphabet LI, the lexical inventory. They did not always agree on how these restrictions were to be specified, but phonologists thinking cognitively thought this way for the better part of the twentieth century: Jakobson 1941 tracks the child?s development of phonemes, (the ?sounds that are used to distinguish the meanings of words,? 29), and makes it clear implicitly that, at each stage, including the adult state, the set of possible phonemes remains finite (?while the succession of phono- logical acquisitions in child language appears to be stable ?, [t]here are children who ? still have not completely mastered their phonemic system at school age,? 46); much of 216 this follows from the understanding that lexical inventories are made up of specifications of binary phonetic features, and the universal set of possible phonetic features is finite (?In the succeeding pages we shall list the individual features that together represent the phonetic capabilities of man?: Chomsky & Halle 1968, 299). However, the usual view goes beyond just this, and asserts that the lexical inventory of a language is a subset of this finite universal lexical inventory; in other words, there is a hard, language-specific restriction on what can and cannot be coded in the lexicon. This view is to be found in most work bearing on the question of lexical inventories through to the beginning of Op- timality Theory. To take some notable examples: Halle 1959 derives this from a general economy condition, ?Condition (5): In phonological representations the number of spec- ified features is consistently reduced to a minimum?; similarly Chomsky & Halle 1968: ?Languages differ with respect to the sounds they use and the sound sequences they permit in words. Thus each language places certain conditions on the form of phonetic matrices and hence on the configurations of pluses and minuses? that may appear as entries in the classificatory matrices of the lexicon? (381); Aronoff 1974 constrains a class of ?allomor- phy? rules by enforcing an early version of a structure-preservation condition, ?that they cannot introduce segments which are not otherwise motivated as underlying phonological segments of the language,? implying, given the finiteness of the universal inventory, that the lexical inventory is also finite; Kean 1975, when she writes, ?Of the set of possible seg- ments characterized by the distinctive features, it is evident that some are present in nearly every language, with others only occasionally occurring,? (6?7), can also be seen as im- plying such hard restrictions, depending on what is meant by ?present?; Archangeli 1984 derives restrictions on lexically possible segments from an underspecification condition: 217 ?In a language?s underlying representations only the features that are distinctive in that lan- guage, that is, features which actually are necessary to distinguish two sounds, have values specified,? (43); and Avery & Rice 1989 introduce the Node Activation Condition??If a secondary content node is the sole distinguishing feature between two segments, then the primary feature is activated for the segments distinguished. Active nodes must be present in underlying representation,? (183), to be held in tension with the Universal Markedness Theory, which ?supplies minimal structure, ensuring that unmarked values are absent in underlying representation while marked values are present,? and ?can be overriden only if the NAC requires additional structure,? (184). Since specifying an unmarked value would then immediately allow for a host of new segments to be represented, this implies that lexicons are forced into not only not representing any of these segments, but not being able to represent them, if it is not needed to distinguish (presumably) any pair of lexical items. In all of this work, there is some mechanism that is part of language-specific or language-universal phonological cognition that limits the possible segments that can be stored in lexical memory to some finite language-specific set, a lexical inventory. This is different from a surface inventory. That is, it is different from a restriction on what segments are phonetically possible. What might such a thing mean? At the external level of realized gestures and auditory signals, the notion of the inventory, in the sense of hard, language specific restrictions on what is possible, is not really very useful; any look at ultrasound or auditory data (or the individual acoustic points on the vowel plots found throughout Chapter 3) will confirm that the set of possible phonetic realizations is not only infinite but extremely broad. While it might be possible to rule out a few things entirely, like English speakers pronouncing clicks, even this would be quite tenuous. This 218 is not what is usually meant by a surface inventory. Moving up the chain of cognition: there is substantial ambiguity (not quite disagreement) in the literature as to the role of binary features in the cognitive systems that are closest to these external states, the motor and perception systems. Still, it is usually understood that there is some level of very fine-grained representation at least coming close to capturing the detail in the continuous externalization. Thus it is not at this ?gradient phonetic level,? either, that the notion of a surface inventory is appropriate. Rather, consider the level we have been dealing with in Chapter 3: a set of phonetic categories, each specified as some parameter values (see Chapter 3), and these parameter values each in turn specifies a different phonetic recognition model. Now, these phonetic recognition models deal in graded data, coming from this ?gradient phonetic level,? but the models themselves form an enumerable set, regardless of whether they form a mixture of Gaussians or a mixture of linear models. More importantly, that set is a subset of the whole set of possible phonetic recognition models. This is a crucial idea that we will use below: whatever a language might specify in terms of how binary features are realized in the gradient phonetics, as long there are fewer possible categories than possible gradient phonetic realizations for them, then in a sense we can say there is a ?surface inventory,? an inventory of phonetically interpretable segments (surface categories). A surface representation, then, is a representation consisting exclusively of elements of the surface inventory, the output of a stage of phonological grammar (seen as the mapping from lexicon to phonetics) that deals exclusively in changes from lexical categories to others, or to (exclusively) surface categories. As we alluded to in Chapter 3, the phonological analysis that took place up to the 219 early twentieth century was done without access to detailed phonetic measurements, and, when consideration was given to phonetic detail, the detail was generally given a cat- egorical description (for example, all of the phonetic transcription systems invented for spelling reforms or for transcribing unwritten languages throughout the nineteenth century, including the IPA, were finite alphabets, despite often capturing relatively fine details?see Kemp 1994; and Sapir, while often giving quite vivid phonetic description, nevertheless generally only did this in order to provide a detailed articulatory explanation for a given sound). Sapir, and to a greater extent Halle, in The Sound Pattern of Russian, and to a still greater extent Chomsky and Halle, constructed phonological grammars as collections of rules changing lexical representations that hewed precisely to a set of already laid-out ?possible segments,? including the rules that handled allophony. Of course, there was no principled reason for this: Chomsky and Halle?s idea that role of the phonological gram- mar was, uncontroversially even now, a system that would start from binary features and ?gradually convert these specifications to integers,? (65), a statement which tolerates rules dealing in these integer values?but they did not do this other than for stress. So, given that complementary distribution and related patterns were explained as cases of strict al- lophony, and given that this allophony was handled by changing representations from one category to another, with all the rules applied to a lexical representation, one would there- fore derive a surface representation with all the allophones marked as separate categories. We will call such a representation an SC-representation. Crucially, such an analysis im- plies that the SC-representation has a cognitive status?that is to say, it makes it part of a structure induced by the model phonological system which we could, and, under the strongest interpretation, would, demand be shared by perception, production, and/or 220 learning, under some homomorphism. Certain theories place explicit demands on SC-representations and are very difficult to reformulate without them. Key to natural generative phonology (Hooper 1976) is the True Generalization Condition, which requires that ?all rules express transparent surface generalizations, generalizations that are true for all surface forms,? (13). This could be used to prohibit, say, the analysis of writer and rider described above where there is a rule changing [aj] to [j] in an environment that does not exist in the SC-representation?before voiceless stops, (where the [aj] has primary stress or the following vowel is unstressed), whereas the flap [] is not voiceless. The generalization is true at one step in the operation of the proposed phonological grammar, (before flapping), but not in the SC-representation. Now, if there is no SC-representation in which [aj] and [j] would appear, then there is no way that the NGC can even be evaluated as stated for this particular pattern, because there is no representation supporting a generalization about the oracle category [j]. It only makes sense to talk about comparing ?the generalization? expressed by a rule to the SC- representation if the SC-representation exists and the rule is a categorical rule. The same holds for the voicing of the flap: if the flap is only present in a gradient representation, then there is no way for the NGC to be evaluated for the pattern in question. Again, the SC- representation is different from just any ?surface representation.? The SC-representation is one in which allophones, whatever is to be said about them, are distinguished as seg- mental categories. All of this applies equally to any theory in which the grammar or constraints on the grammar crucially refer to surface representations: if there is no SC-representation, then the scope of these generalizations needs to be carefully re-examined. Crucially output- 221 referring constraints are precisely the direction that phonological theory took starting in the 1980s (McCarthy 1986, Paradis 1988) and culminating in the development of Optimal- ity Theory in the early 1990s, (Prince & Smolensky 2004), which is a theory of grammar in which all the free parameters are statements that make crucial reference to outputs. These outputs are nearly always understood to be categorical (see below for some discussion of exceptions), and at least some of the grammatical statements made in these theories are crucially SC-representations (namely, those that make reference to allophones) Regardless of whether the phonetic allophony proposal turns out to be correct, a re-examination of all these cases needs to be made: there is no argument for an SC- representation to be found anywhere in the literature. It follows in a theory in which the patterns that are complementary-distribution-like are all explained by appeal to categor- ical changes made to underlying representations that do not code certain non-contrastive segments; but there was never any requirement that we do so (and, in fact, Chomsky and Halle reference Sledd 1966 in the context of gradient rules; he describes Southern Amer- ican English dialects using context-dependent gradient rules, albeit stated in words). Now, in the usual Optimality-Theoretic formulation, furthermore, complementary distribution is in fact often not accounted for by changes made to underlying representa- tions via the grammar; it is not always treated as allophony in the non-structure-preserving process sense. Rather, since the theory turns around the idea that the learned aspects of the grammar only pick out different ways of constraining outputs, or lexicon?output map- pings, and not lexical information per se, there is no hard lexical inventory in standard Optimality Theory. The result is that there is no restriction on which of the SC categories should be stored, given that an SC category is part of a complementary distribution pat- 222 tern, and so a general principle (Lexicon Optimization) pushes the learner to store the observed surface category. The only case where this really would not work is understood to be active morphological alternation (titirauti?titirauti+qaq+tunga), because that other information constrains how a morpheme should be stored. Cases of complementary distri- bution without morphological alternations are usually treated as phonotactic knowledge, constraints on legitimate sequences in surface representations; and these are obviously stated over SC-representations. This is the usual analysis even when the very same seg- ments also some times participate in alternations: in some cases, a surface constraint will lead to an actual alternation, but in other cases the same surface constraint will just ex- press a generalization about possible sequences of segments morpheme-internally, a static generalization about lexical items stored faithfully with the surface SC categories. This reduced burden for the phonological grammar making changes in explaining complementary distribution actually undermines what little motivation there was for the SC-representation in the first place (if complementary distribution is due to grammatical processes, and the only grammatical processes are categorical?even though no one ever said this?then there must be an SC-representation). The use of SC-representations is an unexamined assumption that would change the analysis very much if it were dropped. Prince and Smolensky?s discussion of the fact that additional mechanisms of ?lexicon op- timization? along the lines of Chomsky and Halle?s symbol-counting evaluation measure are needed to even get the learner to unify two pronunciations of a single morpheme in one lexical entry could also have led theoreticians to question the value of SC-representations, but, so far as I know, it was never vigorously pursued. There is empirical work to be done in pursuit of the question of whether SC-representations 223 exist as well. Speech perception seems to generally treat allophones differently (see Chap- ter 3 for some discussion). These have yet to be properly explained, and they are especially puzzling under the theory that complementary distribution is just like any other phonotac- tic pattern, because the behavioral pattern is often just the opposite of what would be expected intuitively: listeners are less sensitive to the distributionally wrong allophone being used, in spite of the fact that complementary distribution is the strongest possi- ble case of distributional regularity, which ought to make illicit sequences extra salient. Another problem for SC-representations, whatever the analysis of complementary distri- bution and its ilk, would be if perception under allophony was crucially determined by gradient properties of the signal in ways that comparable contrastive category perception is not. 4.1.2 Status of surface representations under a phonetic transform hy- pothesis In this section, I outline the phonetic transform architecture, and say why accounting for complementary distribution patterns as the result of phonetic transforms leads to our throwing away the SC-representation in favor of a more abstract surface representation, which I will call the AC-representation. Figure 4.1 is what the architecture of phonolog- ical grammar looks like under the current view, compared to what it looks like under the conventional view. The current architecture, shown on the left in Figure 4.1, is one in which, like in conventional views, lexical memory is a collection of strings, and the role of the cat- 224 Memory Memory Regular relation Regular relation AC-representation Linear additive phonetic transforms SC-representation Peripheral encoding Peripheral encoding Figure 4.1: Current (left) and conventional (right) phonological architectures. egorical phonological grammar is to map from a combination of these strings to some output string (or vice versa). It is well known (Johnson 1972, Kaplan & Kay 1994) that no known phonological pattern exceeds the computational capacity of a finite state de- vice, which is to say that, in general, the phonological mapping is some regular relation. This is a descriptive, not a theoretical statement: even if some theory of phonological grammar can express non-regular relations, actual phonological grammars are not. As in the conventional view, too, the surface representation must be translatable into a gradient representation interpretable by a peripheral (sensory or motor) system. To say a bit more about how this works here: the AC-representation specifies a sequence of segmental categories z1 : : :zk; selecting a zi induces a context representation xi constructed as a function of z1 : : :zi 1zi+1 : : :zk. We have not said exactly how this works, but a simple way to understand it would be a concatenation of all the feature?value pairs, as in Chomsky & Halle 1968, Chapter 8. Each zi also provides a context-independent category representation ri, and a function t Ti associated with category zi that applies to xi?in the real-valued version presented in Chapter 3, we multiply xi, a vector of 0/1 or 225 1/1 feature specifications, by a matrix of independent context effects?and return a ?net contextual shift,? si. The final gradient peripheral encoding is then obtained as pi = ri si. Crucially, although nothing about the architecture actually commits us to seeing t Ti as literally treating the context elements as scalar multipliers for some context effects, in the way that the real-valued Chapter 3 version does, we are committed to t Ti being a linear additive function of xi. The context xi, whatever its content is exactly, can be viewed as a combination of independent formally possible contexts, xi = a1 am, and si = t Ti(a1) t Ti(am). The structure of si is induced by the same operation, , which relates p i to ri and si. This mode of sequence interpretation does not imply that the gradient phonetic out- put is necessarily a neatly temporally segmented sequence of the phonetic information associated with each segment. We know this is not true: acoustic cues and gestures over- lap across segments in a way that makes it clear that segments are units that are, at best, strictly internal to the cognitive system, and are not ?physically real.? It does not imply this for two reasons. First of all, the phonetic realization of a segment is crucially context- dependent in this model: a segment gets to affect the phonetics many times, once as zi and then again for each z j 6=i where it appears as part of the context. Secondly, although in the Chapter 3 models we translated each category in the surface representation into a single element of a phonetic representation sequence, more realistically we want to translate each category into a sequence of phonetic representation elements. Allowing for the output to have temporal overlap (?frame overlap? in the automatic speech recognition literature) would then have no additional architectural consequences whatsoever if the overlapping parts of two adjacent frames straddling an abstract segment ?boundary? (meaning that the 226 first is considered part of segment i and the second is considered part of segment i+1, even though they actually overlap in time) were always along independent dimensions of the gradient representation. In general, there will be consequences, as we will be required to say how the phonetic representations in the overlapping regions combine, (see Browman & Goldstein 1993 for a sketch of how this might work in production). Combination under overlap then also makes the inverse problem more complex. For the moment, however, the point is simply that saying that the ?sequence? of segments is abstract with respect to the actual phonetic representation?s temporal ordering. Finally, I should point out that the architecture itself does not say that xi needs to be a function of only the categorical AC-representation, but I will assume that it is. The alternative one might envision is that it also contains some information which is gradient (peripheral, phonetic?as in F0 transform in the sociolinguistic model presented at the end of Chapter 3). The consequence of this would be to set up potential conflicts between the output of one segment?s interpretation and the contextual input to another. We would be required to say something about how the different mappings for the different segments combine to resolve this conflict?most naturally, just compose them, but this would imply that the the translation to and from gradient phonetics takes place as if in multiple ?stages,? and we would then need to say how those stages combine (that is, state an ordering). In what follows, I will assume that this is not the case. The reason for this is not empirical but rather a research strategy. The consequences of making this particular assumption are rich but simple to work out; the alternative theory subsumes this one, and, in the absence of any particular empirical motivation one way or the other, I hew to the theory that is logically prior. Although this might seem somewhat limiting, this formulation is clearly 227 sufficient to cover a wide range of different contextual phonetic effects. (In the second half of the chapter, we will see that some extensions of the architecture would force us to say that we can sometimes translate back from some gradient interpretation object(s) to their categorical representation, which would only be legitimate sometimes, not always; but this is different. See below.) The crucial point about the AC-representation that makes it different from an SC- representation is that the results of applying phonetic transforms are not coded in the AC-representation. Rather, a segment which may have contextually-variant pronunci- ations is represented, alongside the contextual information, and phonetic interpretation combines these to give a single phonetic realization. An immediate consequence of this is that no contextual variant pronunciation that is the result of applying contextual pho- netic transforms is be coded lexically as its own segmental category. In contrast, an SC- representation will code a different category for each contextual pronunciation variant of a segment. (We will see below that this actually forces the general shape of the architecture upon us.) An analogy will help: suppose I am tuning a ukulele and I am given a slip of paper that tells me to tune the four strings to G4, C4, E4, A4. I adjust the tuning pegs to give a particular tension in the strings. Once I do this, four particular frequencies (392 Hz, 262 Hz, 330 Hz, and 440 Hz) will be encoded in the tension in the four strings. If I pluck the G string, a note with a fundamental frequency of 392 Hz will sound; conversely, if I play a loud 392 Hz tone in the room, the G string will be maximally resonant, and I can thereby identify the note that was played. Now, in a sense, the categories G, C, E, A are encoded in two places?in the strings of the ukulele, and on my piece of paper. This fact 228 remains true if I believe that ukulele tunings are subject to contextual variability?for ex- ample, it remains true even if my piece of paper also tells me how to adjust each string to the so-called ?Canadian? tuning?A3, D4, F]4, B4: multiply the frequency of each by an equal-tempered whole tone, 6 p 2, and additionally divide the frequency of the first string by an octave, that is, 2?should the environment demand it. There are still only finitely many tunings licensed by my piece of paper, and, even if a piece of paper of this kind went on forever, it could only ever give me a discrete subset of the physically possible ukulele tunings. However, there is nothing on the paper that says A3, D4, F]4, B4; rather, it gives me two-step instructions for obtaining this tuning, via G4, C4, E4, A4. Were I to play a loud tone in the room in an attempt to work out the single ?paper object? corresponding to a particular string and its tension, I would turn up nothing. Of course, the two represen- tations of the frequencies are also different types of things: if I want make another tuning adjustment, I need paper instructions, and if that tuning adjustment is dependent on the current state of the ukulele, then I need to first work backwards from the tension in the strings to obtain the tuning frequencies. If the only way that I can be given instructions for tuning is in terms of paper-type categories, then I will be stuck if I come up empty, which will happen when I am in the Canadian tuning, unless I know how to go all the way back to the ?underlying? categories G4, C4, E4, A4. Mutatis mutandis, this means that there can be no feeding relations from strict allophonic outputs to categorical phonology, or from strict allophonic outputs to the categorical environments for further allophonic processes. When phonetic transforms apply, the resulting categories are epiphenomenal with respect to the categorical phonology. See below and Chapter 5 for the implications of this. Given that the individual contextual variants of a segment have no categorical repre- 229 sentation, we can see that applying phonetic transforms will naturally give rise to comple- mentary distribution: any alignment of the output of transforming category A with some realization of a separate categoryB is accidental, and should result in pronunciations which only ever occur in a particular restricted environment (this will be discussed further in the next section and in Chapter 5). In contrast, a string transformation from A to B should have a result that behaves exactly like the category B, even if B is subject to idiosyncratic phonetic adjustments for each environment in which it appears: if the realization of a par- ticular [t] is simultaneously subject to the phonological instruction ?change to [d] due to a following voiced segment? and the phonetic instruction ?front by a particular amount due to a preceding [i],? the degree of fronting should be that of a [d], not of a [t], if the two are different, and the base phonetic representation to which this fronting applies should be exactly that of a [d] in another environment. A particular phonetically-adjusted realization of [t] should appear in one particular environment and nowhere else, because it cannot be coded directly in the lexicon and can only be obtained by applying phonetic interpreta- tion, but a [d] derived from a [t] (fronted or not) will not exclusively occur before voiced segments, but will be the same as lexical or otherwise-derived [d] in other positions. To re- iterate: a phonetically-adjusted realization of [t] ?cannot be coded directly in the lexicon? not because there is an explicit grammatical constraint on what categories can be coded in the lexicon, but because it is not the right type of object; this results in complementary distribution. See Chapter 5 for expanded discussion of the issues under interactions of phonetic processes. We have therefore now answered the question of how phonetic transforms relate to, and account for, complementary distribution: the phonetic (oracle-) categories that are 230 obtained are not coded categorically, and cannot be reproduced for another segment or in a contradictory environment except by accident. Conversely, oracle-categories which are not in complementary distribution cannot be related by phonetic transform. This account of complementary distribution aligns with the classical notion if ?strict allophony? in the sense that the contrasts between the various phonetically transformed realizations of a category are not coded lexically, and are thus ?non-contrastive.? Whether phonetic transforms in fact will account for any particular cases of com- plementary distribution traditionally falling under the umbrella of ?allophony? is another story; however, suffice it to say that they can, and, to preview Chapter 5, they should, all things being equal, provided there is enough of the right kind of evidence available to the learner. 4.1.3 Problematic and unproblematic appeals to surface representations in phonological theory The absence of SC-representations under this theory is in conflict with the premises of certain arguments and proposals in phonology. In this section I document some high- lights of which appeals to surface representations conflict and which do not; the ?surface representations? that do not conflict could just as easily be AC-representations, and are therefore perfectly acceptable under the current architecture. 231 4.1.3.1 No problems with AC-representations Let us being with a straightforward example of the reasoning; this case will turn out not to be a problem. McCarthy 1986 presented some convincing evidence for the Oblig- atory Contour Principle (ultimately rooted in a proposal of Leben 1973). The principle unifies a class of exceptions to rules and restrictions on possible lexical items across lan- guages by disallowing certain sequences of identical segments. Some cases are given in (97) and (98): Afar (Bliese 1981)(97) Syncope Rule: 2 6 6 6 6 6 6 4 V long stress 3 7 7 7 7 7 7 5 ! ? , #(C) 2 6 6 4 V long 3 7 7 5Ca ?Cb V except if C a = Cb default case: ?amila+? (swamp grass +acc.)! ?aml+ i (swamp grass +nom.) diib+ te (marry + 3.sg.fem)! dib+ e (marry + 1.sg) exceptional case: mid?ad?u+? (fruit+acc.)! mid?ad?+ i (fruit +nom.) alal+ te (race +3.sg.fem)! alal+ e (race +1.sg) 232 Tonkawa (Hoijer 1949,Kisseberth 1970)(98) Syncope Rule:V! ? #CVCa ?Cb V within stems, except if C a = Cb default case: notoxo- (hoe, stem)! notx+o (hoe+3.sg) pitena- (cut, stem)! pitn+o (cut+3.sg) exceptional case: hewawa- (die, stem)! hewaw+o (die+3.sg) hamama- (burn, stem)! hamam+o (burn+3.sg) The Afar syncope rule deletes the second of two short vowels in open syllables at the beginning of a word, if it is unstressed and not word-final. (An unrelated process deletes stem-final short vowels before the nominative i-suffix.) An exception is made when this vowel falls between two identical consonants, however. Similarly, in Tonkawa, a syn- cope rule deletes the vowel in the second of two CV syllables when it is not word-final (an arguably separate process deletes the first of a sequence of two vowels). Again, an exception is made when the vowel falls between two consonants. McCarthy links this to a restriction on Arabic triconsonantal roots: only the second two consonants, but not the first two, can be identical (there can be stems like samam, ?poison,? but no stem like sasam). The explanation assumes autosegmental theory: rather than being a sequence of segmental feature bundles, (see Chapter 3), a lexical item is actually a graph, where 233 features are ?associated? with each other, and, as a special case, associated with ?timing units? that give rise to the segmental ordering. Importantly, a single set of consonant fea- tures can be associated with two different timing units, in this case, to two consonants. The Obligatory Contour Principle bans having two adjacent segments with the same fea- tures unless it is because they are sharing the same feature bundle, where ?adjacent? in Arabic means adjacent consonant-wise. Sharing of features is highly restricted, and in Arabic is limited to the second two, not the first two consonants. While this might seem baroque, McCarthy presents evidence from language games that in samam-type stems, the second two consonants behave as a unit (and the interpretation of ?adjacency? with respect to just consonants is needed independently throughout Semitic morphology). In Afar and Tonkawa, adjacency is interpreted segment-wise, and the rules do not have the power to ?restructure? the associations between timing units and features after they apply; thus syncope would give an output violating the OCP, thus it fails to apply. Thus the ex- plicit condition on both rules can be dropped, assuming that the behavior of the rule when its output would violate the OCP (skip the rule) is well-defined. The OCP is a classic case of an output filter. What is crucial here is not that it the OCP is a filter, but rather that it makes reference to outputs. Rather than putting a condition on the input to the rule, a condition is placed on the output that determines whether the rule will apply. This is not forced by the Afar or Tonkawa patterns, but it is necessary in order to unify the condition with the Arabic triconsonantal root restriction under the OCP, because, given the segment-wise interpretation of ?adjacency? in Afar and Tonkawa, the only sense in which there are adjacent identical segments banned in either is in the output. This is not merely a restriction on surface representations: it is 234 a restriction on all representations?but it therefore extends to surface representations. If the surface representation is an SC-representation?a surface representation making use of non-contrastive features, to be reinterpreted here as phonetic?it raises a potential problem under the current theory. Since the crucial question is whether the OCP can be evaluated, the question is whether the OCP can be verified after the rule applies. In the original formulation of the Afar and Tonkawa rule exceptions, restrictions on the forms that the rule applies to, this is not an issue; the features are present in the AC-representation, which is the input to phonetic interpretation, and thus it would be fine if these rules were phonetic no matter what. In this case, there is little reason to worry: the deletion of a segment is in no way a good candidate for reinterpretation as ?non-contrastive feature change.? Since there is no independent reason to think that either rule is phonetic,1 I will assume that there is no problem here; but this is the kind of problem we would be looking for. It comes up under other analyses as well. Gouskova 2003 reanalyses Tonkawa syn- cope under Optimality Theory. In this analysis, the environment for vowel deletion is deduced as being on those vowels which would permit an ideal trochaic footing of the string (as indicated by Phelps 1975, the syncope is more general than the second syllable, and the more general rule reapplies from left to right). This does two things: first, it gives 1With one exception: according to Bliese, in the Northern Afar dialect, there are certain lexical excep- tions to the syncope rule, such as the verb alif, ?to close?: for the first person singular, we get alif+e rather than alf+e. As we will discuss in the next section, we do not expect to find lexical exceptions among pho- netic rules. However, apparently, the vowels that escape deletion in this way are often reduced. Although Bliese only presents this as a passing remark, this reduction could be attributable to a phonetic rule, if it were found to be phonetically shorter or more centralized than the same short vowel in a non-syncope (but otherwise maximally similar) position. In this case, there should be no OCP effects, but such a prediction is moot anyway, as reduction (as opposed to deletion) should not be sufficient to trigger even a phonetic OCP effect. Besides this, the language would need to be such that these exceptions happened to coincide with cases where the OCP would be relevant. 235 an account of the environment for the rule in terms of constraints falling out of a metrical analysis; second, this environment is derived by way of output constraints: the string is adjusted to accommodate an ideal metrical analysis. As discussed in Chapter 1, this is in fact the only grammatical mechanism standardly made available in Optimality Theory. The consequence of this is that the Optimality Theory analysis both changes the representation of the grammar, (relevant primarily to learning, as discussed above), and fails to assert the sequence-of-ordered-representations structure that can be induced by the composition of operations: the grammar does not say how the actual computation of the candidate output forms proceeds. However, what is crucial here is that this analysis induces a still stronger appeal to output representations, because, however the computation proceeds, the output must include both the result of deletion and the relevant metrical analysis. Take the evaluation for [notxo], ?he hoes,? for example: (99) =notoxo+o= RhType=T Stress-to-Weight Max(V) Clash R (n?t)(x?) ** * (n?.to)(x?) *! * (no.t?)(x?) *! * * * In the correct output candidate, a vowel is deleted (in fact, two); but, in order to evaluate this candidate, (for example, to determine where and whether the RhType=Trochaic con- straint is violated by having moras grouped together into feet with stress on the second one, and where and whether the Stress-to-Weight principle is violated by having a light syllable stressed), we need to be able to examine the stresses and feet. An operation of deletion feeds an evaluation that makes reference to metrical structure. The analysis then evaluates the output of syncope to determine where and whether syncope should apply. 236 If we were to find reason to think that syncope should be treated as phonetic under the current architecture, it would be extremely problematic. Feet are entirely abstract and absent from the phonetic representation, and, although stress and syllable weight do have gradient phonetic correlates, the RhType=Trochaic and Stress-to-Weight principles assign violation marks on the basis of categorical differences (heavy/light, stressed/unstressed). The output representation being evaluated would thus need to be both gradient and cate- gorical, which, according to this architecture, is impossible. This contrasts sharply with the analysis of Tonkawa given by Noske 1993, which treats syncope as the result of remov- ing unsyllabifiable vowels. Although the analysis makes heavy use of output constraints to derive the optimal syllabification, this syllabification can be (in fact, crucially is) fully determined on the basis of the non-syncopated segmental representation; unlike in the OT analysis, no syllabification or other suprasegmental analysis of syncopated outputs needs to be done. Thus no problem would be posed if syncope were phonetic. To sum up: grammatical output constraints can in principle be in conflict with the absence of the SC-representation?but not all analyses making use of a categorical surface representation are problematic: (i), if there is no reason to attribute any of the relevant processes to the phonetic component, an AC-representation is sufficient; and, (ii), to be ?relevant,? a process needs to feed the evaluation of the output constraint, or, more generally, to feed further (categorical) grammatical computations; otherwise there is no potential conflict. 237 4.1.3.2 Historical changes We now turn to cases where one or both of these issues clearly does arise. Padgett 2003 discusses post-velar fronting in Old East Slavic and its synchronic residue. Modern Russian is generally understood to have a lexical contrast between palatalized and plain (arguably velarized) consonants. Labial Dental Postalveolar Palatal Velar Stop p pj t tj k kj b bj d dj j Fricative f fj s sj j: x xj v vj z zj Affricate ts tj Nasal m mj n nj Lateral l lj Rhotic r rj Glide j Table 4.1: The consonant and vowel inventory of Russian. Early Common Slavic had no palatalization at all. A change took place that took k; ;x > t; ; before long and short i and e, (which were the only front vowels), as well as j, and then the appearance of velars became perfectly predictive of the following vowel being back. The Old East Slavic post-velar fronting was a diachronic process by which the vowel [] fronted to [i] following velars. Padgett suggests that this was because there was no contrast between [] and [i] following velars, and, under the pressure to maximize phonetic contrasts posited under Dispersion Theory, (Flemming 1995), [] fronted in order to be as far away from the high back vowel [u] as possible. This was not possible following non- velar consonants, say [p], according to Padgett?s analysis, because there it was possible for [i] to appear, and a dispreference for merging lexical distinctions made such a fronting 238 undesirable. (Confusingly for us, the first change was called the first palatalization; this is confusing because this was not the change that resulted in the palatalization contrast that is evident from the table, and that we are going to need to talk about below. This hap- pened much later, but before the post-velar fronting. The first palatalization was instead a change that changed velars to palatal fricatives, historically by way of a pronunciation with palatalized secondary articulation.) Padgett?s analysis is done using Optimality Theory. Deviating from standard as- sumptions, Padgett assumes that the OT computations operate on entire languages, not just to compute the pronunciations of individual morphemes or utterances. What this means cognitively?what cognitive process this is supposed to correspond to?is not en- tirely clear. Each one of Padgett?s tableaux is evaluated over a small collection of multiple idealized ?words,? pi, p, kji, and so on, with each candidate listing an output for each one of the words. He says that such a collection of words is an ?idealized language.? The reason he must include multiple items is clear: Dispersion Theory requires a comparison among different forms, (for example, it is the presence of [u] in some forms in the language that forces [] forward, not just facts about about [], nor any lexical representation in which it appears, nor any surface pronunciation). Flemming 1995, in introducing Dispersion Theory, does something similar, in using OT computations to evaluate ?systems of con- trasts,? rather than individual underlying?surface pairs. In Padgett?s analysis, this is made necessary because of the presence of a family of constraints called Space(color) 1n , each of which assigns a violation for every pair of items in the (complex) output candidate that differs only in a vowel contrast spanning less than 1n of the ?color? dimension for vowels, corresponding roughly to the second formant. For example, if pi, p, and pu were all in 239 the candidate, then Space(color) 12 would assign two violations, Space(color) 1 1 would assign three, and Space(color) 13 would assign none. For the candidate containing only pi, t, and pu, only Space(color) 11 would be violated, and it would assign one violation. These multi-form candidates are also necessary for the Merge constraint, which assigns a violation for each pair of non-distinct output candidate elements that correspond to dis- tinct inputs: the input pi1, p2 would incur a violation if paired with pi1, pi2, but not with pi1, p2, where the subscripts indicate which inputs correspond with which outputs. I will interpret Padgett?s multiple-element candidates as implying that the computation of a sur- face form for any given input representation needs to simultaneously take into account the implications of the grammar for every other possible input representation. That is, I will interpret Padgett?s tableaux as representing part of the normal phonological grammatical computation for a form, as opposed to some other, unspecified part of linguistic cogni- tion; the only difference from the normal conception is that, to operate the phonological grammar one needs to operate on all forms at once. This is consistent with everything that Padgett says. Here are the computations Padgett gives for the earlier and the later stages of Old East Slavic, before and after the post-velar fronting came into the language. The Space constraint is Space(color) 12 . 240 (100) Before post-velar fronting: Post-velar fronting grammar: pj i1 p2 pu3 k5 ku6 tj i4 Merge Id(color) Space R pj i1 p2 pu3 k5 ku6 tj i4 *** pj i1 p2 pu3 kj i5 ku6 tj i4 *! ** pj i1 p2 pu3 kj i5 k6 tj i4 *!* *** pj i1;2 pu3 kj i5 ku6 tj i4 *! ** pj i1;2 pu3 k5 ku6 tj i4 *! * * pj i1 p2 pu3 k5 ku6 tj i4 Merge Space Id(color) pj i1 p2 pu3 k5 ku6 tj i4 ***! R pj i1 p2 pu3 kj i5 ku6 tj i4 ** * pj i1 p2 pu3 kj i5 k6 tj i4 ***! ** pji1;2 pu3 kj i5 ku6 tj i4 *! ** pji1;2 pu3 k5 ku6 tj i4 *! * * The crucial difference in the two grammars in (100) is the ranking of the Space constraint. The ranking Ident(color) Space puts faithfulness above the need for dispersion, thus barring post-velar fronting, while the reranking Space Ident(color) allows dispersion to take effect. The existence of a three-way contrast elsewhere blocks any movement of the central vowel forward, as this would violate Merge, but, after velars, there is no contrast, which frees the reranked grammar to front [k] to [kji]. There are several things which are sources of potential conflict between this analysis and the architecture proposed here. As before, the two sticking points for any categories that are proposed to be in the surface representation is whether they are in complementary distribution and whether they (fail to) correspond to any lexically-specified categories. As always, I leave the worked-out explanation of why these are problems for Chapter 5, where it will be shown that the grammatical treatment of patterns satisfying either of these two notions of ?allophone? as phonetic is not actually a hard restriction, but rather a 241 preference; I will specify what the factors that go into the assessment are, given a rational learner. For the moment, however, I continue to simply assess what would go wrong when these patterns are indeed treated as phonetic. Here, we may note two things like this. First, the distribution of [] and [i] is com- plementary. Although it is not shown, a high-ranked constraint, Palatalization, enforces palatalization of all consonants before front vowels. At this stage of the language, the constraint actually shows its effects across all consonants, not just velars, a point which I will discuss in a moment. This means that the appearance of [] or [i] is actually totally predictable, because [i] will always be preceded by a palatalized consonant. Second, the distribution of [k] and [kj] is also complementary, because [kj] only appears before front vowels, and [k] never appears before front vowels, and more generally this is true across the velars. This is actually not true for the other places of articulation, despite the fact that I show what appears to be a consistent complementarity for the non-velars, represented by [p], in (100) in the input forms (I deviate slightly from Padgett here in indicating the palatalization on the non-velars?see below). This is because of the deletion of the high short front vowel in certain positions, (the ?front yer,? written ? in historical Slavic lin- guistics and probably pronounced []), which left some instances of palatalized non-velars followed by something other than front vowels; the first palatalization had already pro- tected the palatalized velars from this fate, by lexically eliminating the possibility of velars before [] (and eliminating velar?front vowel sequences across the board). This too will be discussed shortly. In assessing the question of lexical contrast?that is, whether it is fair to say that [k]/[kj] is actually not marked lexically, thus providing another reason to judge the differ- 242 ence to be a purely phonetic one, not appearing in the AC-representation?we must be careful. This is because Padgett appeals to the principle of Richness of the Base, which is just a name for the conventional architectural assumption in Optimality Theory stating that the grammar serves only to constrain the mapping between lexicon and phonetics, not to place restrictions on what is allowed to be in the lexicon (for example, statements of lexical inventories). The nearest equivalent to asking whether a category can be marked lexically is ordinarily to ask whether it actually is marked lexically. However, this does not work here, as the interpretation of Richness of the Base is complicated by Padgett?s appeal to complex inputs that contain all lexical items. Richness of the Base would imply that the evaluation should always be over an input?output pair that contains not every possible lexical item, not just every actual lexical item (whatever this means in Padgett?s idealized language). Since this would make the Dispersion Theory analysis Padgett gives impossi- ble to maintain,2 he states that all of his analysis takes place in the post-lexical component, and the restrictions on inputs are due to neutralizations in the lexical phonology (see be- low in this chapter for more discussion of this distinction and its status in the present theory). The relevant question for Padgett is therefore whether a distinction is present at 2In the case of (100), this means that kji (among infinitely many other things) also needs to appear in the input, and thus have a corresponding item in the output. This will incur a violation of Merge for the candidate that is currently shown as winning in the second tableau, with post-velar fronting; but then the output candidate on the last line, with fronting of [p] to [pji], will be no worse by Merge. Space will be no better off, because the addition of kji will mean that two violations are incurred among the velar-initial elements, too, and it will not be to the advantage of the winning candidate that there is an extra ?space? to fill left by the absence of kji: richness of the base would imply that there is no such space. One response might go like this: ?A higher-order Richness of the Base principle is at work. In this theory, the grammar needs to assign a consistent output across all possible subsets of the lexicon, when each is submitted for evaluation. The selected input sets are just particular examples.? This really will not work, however: aside from the fact that adding the whole lexicon back in will give inconsistent results, as indicated in the text, subsets of the sets Padgett gives are also problematic: if we exclude [pji] from either of the input candidates Padgett gives, then the crucial violations of Merge that rule out fronting [p] are also alleviated, leading to another inconsistent result. 243 the output of the lexical phonology. This is also sufficient for us: if it has already been determined that the relevant grammatical changes operate on representations for which the [k]/[kj] distinction is not marked, then it would seem quite clear that this distinction could be interpreted as arising in the phonetic component. This line of reasoning will be worked out in further detail later in the chapter, and an analysis of the post-velar fronting phenomenon (albeit one which does not in fact look like this) will be given momentarily. First, however, I address one side issue that comes up in Padgett?s analysis: the Space constraints make crucial reference to the phonetic interpretation of particular fea- tures. It is not totally unreasonable to think that the information operated over in the categorical phonology might have some general phonetic content which could be evalu- ated in something like the way suggested here. As we discussed in Chapter 3, it is at least plausible to assume some general constraints on the interpretation of particular features, even if these are vague and subject to further specification in the phonetic component (and in fact, such constraints are the minimum needed in order to say that features are not sim- ply classificatory); however, in the categorical phonology, Space amounts at any rate to to simply penalizing particular combinations of categories. To give the ?fraction of the total width? dimension real force, it would be reasonable to instead propose that the Space constraints are part of the phonetic component, which, again, undermines the possibility of doing the analysis over surface representations, and would lead us to question the status of the SC-representation. However, the constraints would need to be substantially refor- mulated: as stated, in the phonetic grammar, they should be satisfied by small phonetic adjustments, not only by the wholesale fronting of [] to [i]. I now discuss how this case would be handled if some or all of the relevant gram- 244 matical operations were taken to be phonetic. The []/[i] alternation is a good candidate for a phonetic transformation. The distribution of the two is complementary, and the distinc- tion is not crucial to any lexical contrast, nor to triggering any phonological process (con- sider Padgett?s assertion that the alternation can be handled in the post-lexical phonology). This becomes still clearer when we consider the fact that the complementary distribution of []/[i] holds across all the other consonants, too, not only the velars. This is still true to- day in Russian, and the traditional analysis of this fact in Russian is that the palatalization contrast, which is phonemic, (albeit somewhat marginal for velars where it is mainly con- trastive in loanwords), and triggers an allophonic change from [i] to [] in the presence of a plain consonant. There must be some such process?that is, there must be cases where an underlying [i] surfaces as []?given morphological alternations like pot+i! pot, sweat + nom/acc pl. (versus putj+i! putji, road + nom. pl.). The analysis of Dresher 2009b, following that of Jakobson 1929, is that the distinction became non-contrastive following another change in Old East Slavic, the shift from non-contrastive to contrastive palatal- ization. (This was alluded to above, and will be explained in a moment.) This is slightly different from Padgett?s analysis in asserting that the []/[i] distinction stopped being coded lexically, although both capture the distributional pattern. This is what we will assume here. Dresher further supposes that the underlying representation of this vowel was then as [ back], and that the alternation in the vowel was then attributed by learners to exactly the rule just described, with plain consonants, marked as [+back], spreading this feature to following [i] to yield []. This does not account for the post-velar fronting, since the language had only [] fol- lowing velars, and evidently had only plain, and not palatalized, velars in this position. 245 Thus the presence of plain velars would seem to have supported the maintenance of these plain?[] sequences, given this rule, not a shift to [kji]. Where Padgett explained this shift by a pressure for vowel dispersion, however, tolerated only in the velars due to the lack of a palatalization contrast, Dresher explains the shift using a version of Contrastive Speci- fication (see Chapter 3). The absence of a palatalization contrast in the velars, following the first palatalization, led to their becoming underspecified for the [back] feature. This implied that they could not trigger the backing that the other plain consonants triggered. Thus the vowel surfaced in its underlying [ back] form. This would be one possible analysis?simply translate the rule backing [i] to a pho- netic rule triggered by a [+back] feature?if we assumed a Contrastive specification ac- count. We would then still need to account for the palatalization of velars in this posi- tion. Padgett accounts for this by highly ranking a Palatalization constraint (asserting, in Dresher?s featural analysis, that if consonants are followed by [ back] vowels, they must be [ back]); it is ranked over both Ident(pal) (i.e., Ident(back)) and kj. In fact, this is the same ranking that Padgett uses to account for a precursor to the first palatalization in an ear- lier stage of the language, an ranking which, as he points in a footnote, was also extended to pj, (where p stands in for all the non-velars), by the time of the post-velar fronting. This secondary palatalization on all consonants is in fact very important, as Dresher points out, because it was this palatalization that is understood to have been reanalyzed as phonemic just prior to post-velar fronting: [], which also triggered the alternation, was substantially reduced in certain positions, and eventually disappeared (dn! djnj! djnj, day: word- final position is a reduction position and the immediately preceding syllable is not). Some residue of this palatalization process still exists today, triggered by [e], (sjirot+e! sjiratje, 246 orphan +dat., versus sjirot+i! sjirat, orphan + gen.), plus a handful of idiosyncratic suf- fixes. Dresher assumes that the more general palatalization persisted, but, at the stage of post-velar fronting (and since) was crucially ordered after the rule backing [i]. That rule has no effect following velars, because they are underspecified for [back], but for other consonants bleeds palatalization. However, neither of these explanations deals directly with the fact that the grammar inducing the [kji] sequences would not be assessed as a good fit to the phonetic evidence available to the learner at the time of the post-velar fronting, which evidently has all those sequences as [k]. A better fit to the data could be obtained, under Padgett?s analysis, by ranking kj high, and, under Dresher?s analysis, by filling in the missing [+back] feature prior to the backing rule, or analysing backness as contrastive for velars, with the presence of [k] being a more decisive factor in coming to this conclusion than the absence of [kj]. Why should any of these legitimate analyses have been dispreferred by learners? I conclude that they simply would not have been, and the alternative opened up by the phonetic approach is a way of saying that the shift was initially gradient and thus poten- tially gradual in time. This is important, because the emphasis throughout this dissertation has been on the tradeoff in learning between goodness-of-fit and goodness-of-grammar (e.g., simplicity considerations). In order to justify deviations from fit to the data, the grammars that fit the data must be correspondingly dispreferred (at least?or they might even be impossible). Since the data that the learner is attempting to explain at this stage consists of [pji], [p], and [k], but crucially no [kji], the analysis should not deviate too much from fitting the data unless there is a corresponding increase in the inherent preferability of the grammatical analysis proposed. 247 An analysis which is both gradient and gradual goes as follows: the learner first ar- rives at an analysis which does not deviate from this pattern at all, and thus, at this stage, the language is rather different from what has been previously proposed. There is a pho- netic []/[i] rule, either fronting, backing, or deriving both from an archiphoneme. Backing provides a better fit to the data, because (for independent reasons) there is no word ini- tial [], but there is word-initial [i], and so, presumably, any single-feature environment for fronting will compromise the distribution somewhat. Whatever residue of palatalization can be maintained for non-velars, on the other hand, must be a categorical rule, and thus must feed the []/[i] rule. Crucially, therefore, this alternation must exclude the underlying representation of []/[i] (=i=) as an environment, or else the distribution that forms the basis for the []/[i] alternation would be undone. At this stage, then, [k] is maintained. Once we have specified the velars as active triggers for the allophony, we can adapt the contrast-based ideas of the previous analyses, which both rely on the absence of a palatalization contrast for velars to tolerate the innovative [i] pronunciation in that envi- ronment, to motivate a shift in the appropriate environment. However, rather than attempt to use the relative goodness of grammars to do the work, I propose that functional pres- sures were at work. In particular, the goal of fronting was phonetic enhancement in the direction of the underlying =i= (Keyser & Stevens 2006). This implies that speakers gra- diently both backed and fronted =i= after velars. Before filling in the detail suggesting why this should lead to an eventual reanalysis as [kji], rather than [k] followed by backed and fronted [i], I point out that, while there is no good theory of what conditions should license or encourage a particular phonetic enhancement of a segment, we can put forward two suggestions: (i) the unenhanced pro- 248 nunciation should be perceptible to the speaker as a non-prototypical realization of some AC-category; (ii) the enhanced pronunciation should be a more prototypical realization of that same category. The first condition is complex; it says that the unenhanced pro- nunciation should be perceived as a realization of some particular category, but it should also be perceived as deviant. This will not be the case for most processes. It will be the case for phonetic processes in domains where gradient acoustic perception is reason- ably good, such as vowels or fricatives; however, since allophony, for whatever reason, seems to reduce perceptibility when it occurs, (see Chapter 3), more than just allophony will likely be required to be perceived as non-prototypical; the auditory apparatus must be independently equipped to detect the relevant acoustic differences, and the best case of this will be, as in this case, where there is an existing contrast: differences on the high end of the F2 dimension, which cue both palatalization and the backing of [i], (Hamann 2003), are needed independently to detect the contrastive palatalization difference in the non-velars. (It should also be the case that sociolinguistically stigmatized neutralizations will be sensed to be non-prototypical, albeit in a very different sense. However, this is the nature of functional pressure: it is a confluence of disparate forces pushing the grammar in a particular direction.) The second condition says that the enhanced pronunciation should be more prototypical; for this to take effect, all the same conditions hold as for the first condition. However, notice that this also does the work of the Merge constraint: to the degree that the percept is also perceptible as a good realization of some other category, or sequence of categories, then, through the joint probability, the speaker?s degree of belief in a prototypical representation of the target category is also decreased, no matter how prototypical it may actually be. 249 Why should speakers make gradient changes to satisfy these pressures, rather than categorical changes? Furthermore, isn?t the idea that speakers are making changes to F2 ambiguous as to whether it is the vowel or the degree of velar palatalization changing, given what we just said? Speakers, of course, cannot make changes in F2 directly, and, under reasonable assumptions about production, will be able to manipulate palatalization and vowel frontness somewhat independently; if so, under these assumptions, speakers must enhance vowel quality and not palatalization: there is no reason to think that the velar tokens would be perceived as anything but prototypical in this context. The changes will be gradient rather than categorical, however, because, given a large change in fronting, the velar would stop sounding prototypical, due to the overlap in acoustic cues. In sum, unlike in many cases of allophony, speakers should have been sensitive to the phonetic difference. They would therefore attempt to enhance to get just enough fronting so that allophonic [k] sounds like [ki] without sounding too much like [kji] (notice that it does not matter that palatalization is not contrastive for velars, just that it is de- tectable). Despite the phonetic backing process, speakers innovated a process of phonetic fronting applying to the same vowels, but only after velars; for learners, the two processes combined by , but eventually the cue became highly ambiguous, and a grammar gen- erating [kji] became preferred (see Chapter 5). A new categorical palatalization rule was introduced, palatalizing velars before [i], thus lifting the need to back. While this explanation admittedly suffers from the lack of a full model of speaker innovation, we do have a reasonably well-developed learning model; given the right like- lihood functions to learn palatalization and consonant categories, and the necessary im- provements to the variable selection sampling scheme, the category and allophony model 250 discussed in Chapter 3 will be usable in testing these predictions largely unchanged. How- ever, much more work is needed in incorporating categorical processes into the model. The trading relations I discuss here and in Chapter 5 depend heavily on what precisely the prior for phonological grammars looks like. To sum up, we have now seen a case where the assumption of an SC-representation was crucial to the analysis, that of Old East Slavic post-velar fronting. However, in addi- tion to being incompatible with our current assumptions, I took this analysis to be unsat- isfactory, in that it proposed a sudden move to a highly innovative grammar, a problem shared even by previous rule-based analyses which would just as easily have tolerated AC-representations. I thus proposed a new account, suggesting a gradual drift towards the innovative system by means of additive phonetic transformations. Although this sug- gestion is still somewhat informal due to the lack of a full learning model, (an issue which is nevertheless shared by all other accounts), it is much better specifed under the current assumptions than it might be otherwise, because we have a precise notion here of what it would mean to have a phonetic rule doing the enhancement. 4.1.3.3 Opaque allophony I finish this section by simply pointing out that it is not only unorthodox global- evaluation theories like Padgett?s analysis that demand SC-representations. Any account of a strict allophony-type pattern done using categorical phonological grammar is equally a candidate for reanalysis that would force AC-representations, although many would be substantially less interesting than the Old East Slavic case. I list a few of these presently. 251 However, it is important to highlight one special case, namely, analyses of opaque patterns which attempt to reanalyse the pattern to eliminate the opacity (we called this ?obscured complementary distribution? above; we are about to go through exactly that case, short- ening/centralization/raising?henceforth ?raising??and flapping). An allophonic process in principle either (i) can be obscured in its environment or in its output by the action of some other process or (ii) can obscure by its action the environ- ment or output of another, previously applied, process. We will review the implications of the current architecture in Chapter 5, but, suffice it to say, the prospects are better than under monostratal Optimality Theory, which cannot get the composed relation to arise out of a combination of the two relations in cases like this. To make this less abstract, consider the case of raising and flapping. In that example, we saw that some English speakers pro- nounce the diphthongs in write and ride as [rjt] and [rajd] respectively. These diphthongs have an allophonic analysis based on the voicing of the following segment, but the com- plementary distribution is obscured. This is because there is another allophonic process, flapping, which makes writer and rider come out with (what appear to be) identical con- sonants following, [rj?r] and [raj?r] respectively. To get this kind of interaction between two processes arising from independent constraints is impossible in standard monostratal Optimality Theory (for an explanation, see McCarthy 1999). Mielke, Armstrong & Hume 2003 begin with the possibility of not treating the obscured raising distribution as being due to allophony, but rather just due to a static phonotactic fact. Recall from above that this is a standard tack taken on the Lexicon Optimization hypothesis. In many cases, this almost works, as most of the supposed alternations are actually morpheme-internal, but the grammar still needs to handle what alternations there are across morpheme boundaries. 252 In this case, that would be a problem, at least if the patterns were to be captured in the natural way: a monostratal grammar built out of surface constraints cannot be built by enforcing the surface generalization that [j] appears before voiceless consonants and [aj] occurs before voiced consonants, because it is not always true: the flap is voiced. In this case, however, the claim is that there are no such alternations, because the diphthong?flap boundary is never at a morpheme boundary (notice that the [t] in writer, flapped or not, is just as morpheme-internal as the [t] in write). Thus the opaque grammar simply never arises. Putting aside whether this is correct, (Idsardi 2006 presents evidence that it is not true in Canadian English: i-th, y-th ! j?, wj?, not ! aj?, waj?), the problem is still a serious one in general.3 This does not turn only on the question of AC-representations versus SC-representations: the nature of OT grammar itself is also potentially at issue, and this kind of problem arises whenever there are two processes interacting in particular ways, allophonic or not. How- ever, the issue of how to specify the grammar is still directly tied to the representation of allophony. If the obscuring process (in this case, flapping) is treated in a second mod- 3Another possibility which is always available in principle is to deny that the opaque case emerges in the same way as the transparent cases. Some effects can be derived in this way in Optimality Theory using local constraint conjunction: in particular, cases of counterfeeding order, where a rule does not apply in what appears to be the correct configuration because that configuration was actually generated by a later rule. This leads to underapplication with respect to the surface environments, and so additional markedness constraints can simply be added to the grammar to suit the particular nuances of the environment in which the rule fails to apply as expected. The solution amounts to stating two generalizations and adding an exception for the underapplication. The general mechanics of this solution would not work for overapplication due to counterbleeding, but there is always provably some set of constraints, however unnatural, that will capture the pattern (Kaplan & Kay 1994). For example, in this case, the raising rule overapplies. Clearly we cannot ban raised vowels before voiced segments, because the flap is voiced, but we can ban raised vowels before each of the voiced segments except for flap; the trick is then constructing constraints that handle pattern for flapping environments, which is to raise only preceding an underlying voiceless segment. But constructing such constraints is not a problem for monostratal Optimality Theory: it is a problem for a version of correspondence theory that only allows statements about dependencies between input and output to be made with respect to a surface segment and its single input segment. What is needed here is dependent on the next input segment; but, again, this is not an issue for the architecture, it is a question about the theory of Con. 253 ule of the phonology, as in the current architecture, or as is generally the case in Lexical Phonology (Kiparsky 1985),4 then its output is not represented at the surface (of the other module), and so the problem of getting the underlying representation of an allophonically obscured environment to be visible dissolves: there is no change at the surface repre- sentation. The problem becomes slightly more complex if, as here, the process whose environment is obscured is also arguably strictly allophonic?and thus also in that second module itself, and thus potentially re-obscured by the effect of the second process apply- ing in that module. The current architecture, however, solves this handily, because, unlike in standard Lexical Phonology, the allophonic processes apply as if simultaneously (see above and Chapter 5). We have thus raised the following idea: certain opacity puzzles in surface-oriented phonology rest, naturally, on a particular conception of the surface which is not the one we are discussing here, just like the analysis of post-velar fronting and many others; but, more than this, the very fact that they are puzzles rests on this particular conception of the surface. Once a different architecture is adopted, the problem of visibility of environments on the surface evaporates for processes in the second module; and given the character of that second module, as proposed here?input-oriented, but not derivational?the problem of opaque interactions evaporates within that module, not in spite of, but because of the 4One can easily list the bevy of alternate solutions proposed to replicate these invisibility effects, which are given rise to by the inherent ordering of the post-cyclic/word-level or post-lexical levels in Lexical Phonology. A case of a post-cyclic rule that has received a lot of attention is the Levantine Arabic epenthesis rule discussed by Brame 1974. A short epenthetic vowel is inserted in certain environments, but it seems to be invisible to the phonology. See Kiparsky 2000 for a survey of recent solutions to this particular problem that attempt to replicate the invisibility effect without importing the architectural cut. See below for further discussion of the relation to Lexical Phonology. I will not discuss this case further, not because I do not believe that the epenthesis is plausibly gradient, but because attempting to develop a theory of phonetic processes that can handle wholesale insertion would take me too far afield. 254 fact that the phonetic module is not derivational. Again, this will be taken up in greater depth in Chapter 5. In this section, I have gone through three examples of analyses that make explicit reference to SC-representations: in one, it was not crucial; in another, it was crucial, and we developed a new analysis of these historical facts which fixes the problems for the SC- crucial grammar, and removes an implausible-sounding leap from one grammar to another from both analyses; and, finally, we went through a case where representing allophony as phonetic made a very big difference to how the grammar could be stated, namely, a par- ticular case of opacity. There are, of course, many other simpler cases of reliance on SC-representations which do not require much elaboration: to reiterate, all uses of ?surface phonotactics? to capture allophony?whether the case is one, like Mielke et al.?s analysis of Canadian English, where the result is supposed to be wholly static knowledge, with no alternations, or one where the surface phonotactics is supposed to give rise to morphological alterna- tions as well?all rest on crucially incompatible assumptions. This was discussed in the context of learning models in Chapter 3, a list which includes, minimally, Boersma 2001, Boersma & Pater 2007 Hayes & Wilson 2008, Blanchard & Heinz 2008, Jarosz 2011, Elsner, Goldwater & Eisenstein 2012. Although these learners generally do quite well, Chapter 3 argued that, in the face of the difficulty in learning SC-representations, the (rel- atively simple) task of learning allophony was put to more practical use if embedded in the phonetic learner, allowing for a reduction in the complexity of the surface representa- tion, from an SC-representation to an AC-representation, taking away the motivation for applying learners like this to allophony. This section has outlined the broader framework 255 in which this is situated. In the next section, I will probe the consequences more deeply, highlighting the connection to Lexical Phonology (Kiparsky 1982). 4.2 The Lateness of Allophony 4.2.1 Background: Structure-preservation and the cycle Since the linguistic computation needs to map not just between phonetics and strings of lexical items, but in fact between phonetics and hierarchically organized collections of lexical items?morphosyntactically complex objects?something needs to be said about how the phonological computation interacts with this structure; one approach would be to assume that the phonological component just maps between phonetics and an unbro- ken concatenation of all the lexical items in an utterance, completely insensitive to any morphosyntactic structure. Another approach would be to make the phonology sensitive to only the lowest-level boundaries?divisions between the leaf nodes of the morphosyn- tactic tree, but nothing more. However, another approach suggests that the phonological computation hews closely to the morphosyntactic tree. The phonological cycle was in- troduced in the 1960s as a strong instantiation of this idea. In the original version, the phonological grammar applies once to generate an output string at the leaf nodes; then erases the first layer of bracketing and applies again; and then erases the next layer of bracketing and applies again; and so on, until all brackets have been erased. This did a fair bit of work in the SPE analysis of English stress, so that the derivations for ?rchestr?- tion and ?nf?st?tion were, roughly: 256 [[rkstret]Vjn]N [[nfst]Vetjn]N First cycle Main stress rule [[rkstr 1 et]Vjn]N [[nf1st]Vetjn]N Alternating stress rule [[1rkstr 2 et]Vjn]N ? Pretonic weakening ? ? Auxiliary reduction rule ? ? Spirantization ? ? Palatalization ? ? Glide deletion ? ? Bracket erasure [1rkstr 2 etjn]N [nf1stetjn]N Second cycle Main stress rule [2rkstr 1 etjn]N [nf2st 1 etjn]N Alternating stress rule ? ? Pretonic weakening ? [nf3st 1 etjn]N Auxiliary reduction rule ? [2nf3st 1 etjn]N Spirantization [2rkstr 1 esjn]N [2nf3st 1 esjn]N Palatalization [2rkstr 1 ejn]N [2nf3st 1 ejn]N Glide deletion [2rkstr 1 en]N [2nf3st 1 en]N Bracket erasure 2rkstr 1 en 2nf3st 1 en Then, after the cyclic rules applied, a post-cyclic rule of vowel reduction applied in syllables not marked with stress: 2rk?str 1 e?n, but 2nf3st 1 e?n. The morphological differences in the two words?deriving from orchestrate versus infest?combined with the cyclic ap- plication of the stress rules to protect the second syllable of infestation from reduction. The emergent descriptive generalization about English stress was later called into ques- 257 tion, and a new analysis, which did not make use of cyclicity in this way, was proposed (Halle & Kenstowicz 1991: see below for discussion of the other implications their anal- ysis had). However, the inside-to-outside application model had in the meantime accrued several other empirical patterns to its credit; one type, Strict Cyclicity Condition, or SCC, effects, which were handled under a small modification to the Chomsky and Halle frame- work; and level ordering effects, which prompted a more radical revision to the model in the form of Lexical Phonology. The latter will be discussed shortly. In all of the cyclic computation models, however, a second block of non-cyclic rules had to always be distinguished. For example, vowel reduction had to apply once, after the cyclic computation; it had to be after, or else the stress placement rules would not have worked properly. In the grammars of many other languages, such post-cyclic rules also appeared. A curious ?post-cyclic syndrome? then began to emerge: post-cyclic rules were non-structure preserving, which meant that they almost never hewed to the (hypothesized) underlying inventory of the language; that is, they were, largely, strictly allophonic. This cut in the phonology did not have anything to do with the cycle per se. Even without the cycle, a division of phonology into early structure-preserving rules and late allophonic rules was plain: having established a partial order over the rules in a grammar, one would almost always find that no non-structure-preserving rule was ordered before a structure- preserving rule (in fact, as I will discuss in Chapter 5, within the late block, the rules also almost never fed each other). A division saying roughly that structure-preserving rules fed non-structure-preserving rules was not new. In fact, one type of pre-generative theory that had a division like this had driven one of the early vociferous battles surrounding generative grammar, the debate over and break from the structuralist taxonomic phoneme 258 level. However, this only became clear within generative grammar once there was an ar- chitectural cut that could make the distinction clear. The other important symptom of the post-cyclic syndrome was that post-cyclic rules were without lexical exceptions, whereas cyclic rules were generally riddled with them. To fully understand the context in which we should be talking about allophony and lateness, we must first cover a few other facts and theories. The relevant theories all incorporate hierarchically structured cyclic domains for the application of phonological grammar. In the Aspects/SPE model, one could see a ?domain? as corresponding to a subtree of the morphosyntactic structure assigned to the sentence. This model became less popular after an effort was made to separate morphological from syntactic structure in different components of grammar. At the same time, within the morphological sys- tem, a demand for higher-order cyclic domains emerged. Lexical Phonology maintained the cycle, but divided it up into different strata, or levels. The level L(I) was for all the morphology closest to the stem; in English, it was home to the stress-shifting affixes (-ity, -ic, -ate, ?); the level L(II) contained stress-neutral affixes (agentive -er, -ness, -hood, ?), as well as compounding. This cleaned up a distinction that was already made in SPE, wherein what wound up being the L(I) affixes were treated as lexical-item-internal bound- aries marked with a ?formative boundary symbol? +, a weaker boundary than the ?word boundary symbol? # introduced between layers of morphology. Further levels were also proposed, with authors disagreeing on how many there should be (the current consensus within Stratal OT, the modern descendant, is two). Each level could in principle have its own set of rules, although some authors posited constraints that forced some sharing across levels (Halle & Mohanan 1985). The principles of the cycle (most notably the the 259 recently discovered SCC) were generally understood to still hold within levels, however; thus there were two types of domain: the morphological cycles, and the levels that set off a nested grouping of these morphological cycles, with L(I) affixes nested within L(II) affixes and so on, but still giving rise to cyclic domains within levels. More recent work, mostly following Distributed Morphology, (Halle & Marantz 1993), has returned to the integrated morphology?syntax model. In the meantime, syntax, too, has developed a type of higher-order cyclic domain than just the bare syntactic subtree. There have been sev- eral versions since the 1970s, (Bresnan 1972, Uriagereka 1999, Chomsky 2001), and in most current versions these higher-order cyclic domains?specially designated subtrees, CP, vP, DP, and sometimes others?are called phases. Recent work has attempted to unify the phonological cyclic domains with Phase Theory (Piggott & Newell 2006, Compton & Pittman 2010, Slavin 2012). Still other types of theories have suggested domains defined by an independent phonological (prosodic) hierarchy, unrelated to the morphosyntactic structure, but these are largely irrelevant for current purposes. In this section I will discuss two things: first, the deduction of the order non-structure- preserving follows structure-preserving under the current architecture. Then, since other work on this question is usually embedded in a model with cyclic domains, (either just in the Aspects/SPE sense or also in the second-order Lexical Phonology/Phase Theory sense), I will discuss how this architecture comports with those theories. The first part will be easy, the second part will be left open as clearly problematic. That is because, as we will see, a natural way to interpret the allophonic rules when we combine the current architecture with cyclic phonological domain theory will be as rules applying at the end of every cyclic domain. This is natural either because (i) such a thing already has some em- 260 pirical support, with blocks of post-cyclic rules applying at the end of every higher-order domain (Lexical Phonology); or (ii) because the theory demands that the higher-order domains be domains for phonetic interpretation (Phase Theory), and making this empir- ically meaningful demands that we try to hew as close as posisble to having this mean actual phonetic interpretation in the gradient sense we have been working with. This is problematic. I outline the reasons and sketch some suggestions for proceeding at the end. To clarify point (i): Berm?dez-Otero & McMahon 2006 and Berm?dez-Otero 2013 attribute the following examples of blocking of allophonic processes in English to cyclic domain differences: (101) Vernacular London Goat Split ! =?l: where . represents a syllable boundary Stem L(I) L(II) hole hl holey hl#i holy h.li Walpolian wl.p.l+ i?n 261 (102) Northern Irish Dentalization ft;d;n; lg! dental=?(V)r Stem L(I) L(II) train tren sanitary s?nt+ ?ri shouter t#?r (103) Canadian Raising fa;ag! central,raised= 2 6 6 4 ? hstress< 1i 3 7 7 5 2 6 6 4 C voice 3 7 7 5 2 6 6 4 V h stressi 3 7 7 5 Stem L(I) L(II) Eiffel f?l i-th +? eyeful a#f?l The generalization they give is that L(I) affixes form part of the domain for allophony? the end-of-domain rules apply at the end of L(I). (In fact, as we will see below, if this generalization is correct, then it is a better diagnostic for the cyclic domain of an English affix than the usual diagnostic of stress-shifting, because that has since been undermined.) Thus one might think that the appropriate sense of ?late? is ?at the end of L(I),? at least for English. However, there are also allophonic processes that apply across both L(II) suffix boundaries and indeed word and phrase boundaries. The first set of allophonic processes are usually called ?post-cyclic? and the second ?post-lexical? (Booij & Rubach 1987). This is extremely confusing, and so, to keep things straight, I will continue to refer to blocks L(I) and L(II) as particular examples of ?higher-order cyclic domains,? which I 262 now begin to abbreviate as HOCDs; the non-cyclic rules applying at the end of each block become, in my terms, end-of-domain rules, EOD rules. If this model is right, then there are separate blocks of allophonic rules for each HOCD; if it is not right, then there is only one block of allophonic rules, following the outermost HOCD (whether there are HOCDs or not). I begin, however, by putting this aside?just assuming that there is one block of allophonic rules?and deducing their lateness. 4.2.2 Structure-preservation and phonetic transforms ?Deducing? the lateness of non-structure-preserving processes may sound strange given the architecture proposed above. Particularly if we continue with our simplifying assumption, made up to this point, that the phonetic transforms must be non-structure- preserving and the categorical phonological grammar must be structure-preserving?which we will elaborate on and nuance here and in Chapter 5?it sounds like, by proposing this architecture we have already just stipulated that non-structure-preserving processes follow structure-preserving processes. The point is this: even before we ask about the association between phonetic in- terpretation and non-structure-preservation, we can ask the separate question of whether the order ?categorical phonological grammar before phonetic interpretation? must be the architecture, and indeed it must. Consider the following very weak core principle of what we understand to be meant by phonological grammar?what I have been distinguishing from phonetics by calling it categorical. What makes it ?categorical? is, minimally, this: (104) Coarseness of Phonology Principle. The set of distinct single-segment 263 representations which appear in the set of all possible grammatical outputs for a given phonological grammar can be placed in correspondence with only a proper subset of the universally possible single-segment phonetic representations. This says that, for any particular phonological grammar, the set of segmental distinctions made in legitimate grammatical outputs, in toto, is of lesser cardinality in the Cantor sense than the set of segments that the phonetic system can distinguish. It does not exactly say that the set of universally possible segmental phonological representations is of lesser cardinality than the set of universally possible segmental phonetic representations: but, in that case, a grammar that placed no constraint, either input or output, on the grammatically possible segments, so that there would therefore be a way of phonetically interpreting the outputs that allowed for any possible phonetic distinction, would violate the CPP. Such a grammar would leave us without any explanation for the basic empirical fact discussed in Chapters 1 and 3 that the changes phonological grammars make are categorical. Now, suppose that, under our model, we fix a context x and execute the mapping from possible phonological output segments to their corresponding phonetic interpreta- tionsPx. For any one of the possible segments, suppose we then change the context to x0 and find the result px0 . There is nothing in this architecture that says that px0 2Px or anyPy6=x0; only assumptions about phonological grammar that violate the CPP could en- sure such a thing. Therefore, even if there were some segmental category that represents px0 in another context, (think: some way of representing Inuktitut [o] in the phonological representation other than the sequence [u][+uvular]), logically, the phonological grammar would be forced to neutralize it to something else; anything else would be impossible by 264 hypothesis. For all intents and purposes, this means that it is perfectly legitimate for there to be some (oracle) category in this system for which the only possible way of generating it is by applying a transform to a particular segment in one particular context. Now it is a short jump to the conclusion that context-dependent phonetic interpre- tation cannot feed phonological grammar. For if it did, its application would sometimes be uninterpretable or fruitless, restored to some new categorical representation only to be inevitably neutralized to some other. Notice that this does not imply that all the segments in the input need to be phonet- ically interpreted at the same time. It is fully consistent with this feeding constraint on the architecture that we could have a variant architecture, tolerating ?mixed? interpreted? uninterpreted sequences. Above, we happened to stipulate this out of existence, but were it true, the reasoning here would tell us that, during the phonological computation, there could be already-interpreted elements that the phonological grammar would have to stu- diously avoid ever making reference to. This would be almost indistinguishable from the architecture proposed here, except that (i) it would open up the new possibility of gradient environments for allophones; and (ii) it would allow, in principle, for cyclic non-structure- preserving processes, diagnosable perhaps by their otherwise unexplained restriction to derived environments (SCC; see Kiparsky 1982). As far as I know, there are no such cases, and so, for the time being, point (ii) remains moot, and point (i) remains barred by stipulation. It also does not rule out the more general possibility that one could have phonetic interpretation feed phonological grammar, but then, only those representations which could be restored as phonologically interpretable categories could be operated over further. The same reasoning applies: this is basically indistinguishible, although it would 265 be worth investigating certain exotic cases further. Having shown, then, that phonological grammar more or less has to feed phonetic grammar, and not the other way around, I now consider what this has to do with non- structure-preservation. Recall that the fact to be explained is that non-structure-preserving rules are late; the obvious late block in our architecture being the phonetic component, it is natural to try to show that category changes which are non-structure-preserving in the clas- sical sense?introducing a distinction made nowhere in the lexicon, or explicitly barred in the lexicon?are phonetic. Now: what I have just outlined amounts to saying that phonetic transforms are po- tentially non-structure-preserving in a different sense?they are at least capable of intro- ducing distinctions banned in the output, not the input, to the phonological grammar. I will show how we get from here to an implication in the correct direction; and then to an implication, not from ?non-structure-preserving in the classical sense? but from ?in complementary distribution.? Now, throughout this chapter and the preceding one, I have made numerous references to the idea that the association between phonetic transforms and allophony was not a perfect one. Thus what I will show here is simply how to reverse the implication, given that it is imperfect in both directions (the details will be saved for Chapter 5). The answer, of course, is Bayes? Rule. Now, there are two different claims being made: the first is about non-structure- preservation in the classical (lexical inventory) sense, and the second is about comple- mentary distribution. I will start with the first one, about processes giving rise to non- lexically-specified categories. We start with this fact which will hold of any gradient phonetic learner: 266 (105) The prior probability of the (context-transformed) phonetic representation for any category being exactly identical to any the (context-transformed) phonetic representation for any existing category is extremely low. Let oi;x be the oracle category corresponding to category i in context x. For any category?context pair j;x0 6= i;x: Pr oi;x = o j;x0jlexicon;grammar Pr oi;x 6= o j;x0jlexicon;grammar . Now, hold fixed a lexicon in which there is a collection of tokens that are all coded using the same category, even though any high-likelihood phonetic implementation will split them into two different oracle categories. Bayes? Rule can help us to go from the above to see just when it will be the case that Pr transform i in ctxt xjlexicon;8 j;x0;oi;x 6= o j;x0 > Pr new cat. in ctxt xjlexicon;8 j;x0;oi;x 6= o j;x0 . That is, when will it be the case that an oracle category in some context being distinct from any existing oracle category given by the lexicon and the existing phonetic implementation (thus, non-structure-preservation) means a phonetic transform (late rule) is better for the learner than a phonological rule? I will spell out all the technical and empirically testable details in Chapter 5. It will, of course, have something to do with complementary distribution. To see this, think about the following relevant facts, which speak to the converse of the second claim: (106) Given a grammar with phonetic transforms perturbing category i in context x, the probability of there existing in the phonetic interpreted output a distinct oracle category for context x is just the probability of the phonetic transform for x being non-zero: Pr [transform of i in context x 6= 0jlexicon;grammar]. Given that the transform is non-zero, there will be two oracle categories in complementary 267 distribution. Bayes? Rule can help us go from the above to see just when it will be the case that Pr[transform i in ctxt xj complementary distribution]>Pr[new cat. in ctxt xj complemen- tary distribution]. Suffice it to say, this will generally be true. I leave the details, as well as the discussion of purported cases where this preference seems not to have held, despite complementary distribution, for the discussion of incomplete neutralization in Chapter 5. 4.2.3 Issues with phonetic EOD blocks in HOCD theories Consider now the empirical evidence given by Berm?dez-Otero and McMahon that led us to conclude that English post-cyclic rules applied within a narrow HOCD we called L(I); this domain is well within the scope of primary stress and resyllabification, setting up a bifurcation in the domain corresponding to firm ?word?-like boundaries. As the notion ?word? is neither a surface property nor obviously a natural kind, this is unsur- prising. However, if HOCD blocks all have phonetic implementation at the end of them, then it implies phonetic implementation in some sense ?within words.? Crucially, if the inside-to-outside computation model is correct, then the mapping to phonetic output for a nested HOCD is a sufficient substitute for the tree in the mapping at the outer HOCD. To the extent that phonetic interpretation in the ?outside? domain depends crucially on phonological, rather than phonetic, information ?inside? a nested domain?call this ille- gal HOCD interaction?then one of two things needs to be done: (i) we need to claim that phonetic interpretation can actually be undone; (ii) we need to explain why that informa- tion also needs to be present in the phonetic object anyway. At least where the relevant 268 phonological information is some categorical representation of an allophone, (i) should surely be impossible, but any case where we need to appeal to that device weakens the notion of ?phonetic interpretation? substantially. I now go through some cases that look problematic on the surface. The first type is stress adjustment. The positions of phrase and compound stresses are almost always the relevant stresses as determined word-internally?that is to say, the position of stress is computed at the inner domain and not adjusted at any outer domain?but the final degree of stress is in general determined at the outer domain (see Chomsky & Halle 1968 for English nuclear stress, and see Cinque 1993 for German and Italian; although some have suggested that the position of secondary stresses within German words is mobile and de- pends on sentence context, Knaus, Wiese & Domahs 2011 present clear evidence that it is determined within the same domain as main stress). The degree of stress is gradient infor- mation; however, it is hardly ?phonetically interpreted.? To the extent that sentence-level stress adjustments preserve the language-specific phonetic interpretation of the abstract ?degree of stress? marking, it must remain abstract, subject to the latest possible inter- pretation. Thus, posting end-of-domain phonetic interpretation commits us to a two-stage process of phonetic interpretation for stress; it also seems to commit us to saying that determining sentence-level degree of stress is done in a way that is qualitatively differ- ent from the way word-level degree of stress is marked, which is inconsistent with many metrical theories. Altering the degree of stress on a particular syllable, whether the change is gradient or not, requires identifying the position of the stress-marked syllable in some representa- tion. The position of stress is treated as a position in a sequence, on a grid line, or in a tree; 269 but at any rate it is discrete. One might, at first blush, think that this could explain why the position of word stress is not changed at the phrasal level: we might expect that the discreteness of the sequence is erased. This would not make sense, however, as the correct syllable still needs to be identified in order to change the degree of stress. Indeed, there is no particular reason to think that the structure scaffolding the position of stresses is ever required to be converted to a representation which is gradient in the relevant sense. A few cases where the position of word stress is apparently adjusted according to position in a larger domain do exist, and they are informative. The principle case is stress retraction, as in the English rule shifting Marc?l Pr?ust to M?rcel Pr?ust and the related German as well as Tiberian Hebrew patterns (see Prince 1975, Liberman & Prince 1977, Dresher 2009a); analyses of these retraction phenomena using a hierarchical metrical grid or tree only require moving the top-level prominence marking to align with a position fully de- termined by the next level of the structure. Thus the visibility of this information could yet be heavily restricted. Even granted that phonetic rules can access metrical structure, however, one wholly problematic case is posed by the pausal forms in Tiberian Hebrew (Prince 1975). These marked forms crucially appear with a restricted distribution?before a major phrase boundary: (107) UR Contextual Pausal katab+ta you wrote katabta katabta katab+u they wrote katbu katabu zaqen+u they are old zaqnu zaqenu A recent analysis by Dresher (2009b) makes the computation of word-internal metrical structure sensitive to the prosodic context, nearly from the start: immediately after the 270 initial unmarked main stress rule applies, (penultimate for vowel-final words, otherwise final), a strengthening which affects metrical structure takes place in the pausal-form con- text: in Dresher?s metrical grid account, in the pausal-form environment, ?assign a right bracket and a grid mark at every prosodic level up to the Intonational Phrase on a vowel to the right of a left bracket?; vowels which head an Intonational Phrase are then lengthened. Since there are no subsequent metrical operations needed at the lower or intermediate lev- els, this derivation itself is unproblematic, assuming that the lengthening is phonetic at the IP level; but the environment for this strengthening is inherently phrasal, requiring the ?inner-level? computation to be sensitive to a small amount of ?outer-level? information. At present, this is a case of strictly illegal HOCD interaction; both this exceptional case and the fact that it is exceptional must be explained. One might think, however, that stress is just different and is not subject to the bounds of the HOCD architecture. Segmental phonology presents a much bigger problem. Gen- erally speaking, any segmental rule which applies across ?word? boundaries is going to be a problem if ?word? is understood to refer to some HOCD triggering phonetic inter- pretation. The Polish rule of allophonic palatalization (Booij & Rubach 1987) affects all consonants before [i] and [j]. It also applies across word boundaries. That means that the information ?consonant? and ?high front vowel? needs to be preserved in the output of phonetic interpretation. It is possible to reconcile this with a weak sense of phonetic in- terpretation: the segment [i] is spelled out, but this information can be converted back to categorical information if necessary. Contrast this with our architectural argument above: there it was the fact that allophones could not be mapped back to their own phonologi- cal categories that blocked further interactions; but the relevant information here is not 271 introduced by allophony. (For now I am putting aside the problem raised by Booij and Rubach?s claim that a word-level EOD rule of retraction bleeds palatalization; see Chapter 5.) Similarly, an inner HOCD may provide the trigger, not the target, for a process in an outer domain. Kiparsky 1982 and Halle & Mohanan 1985 assign English regular inflectional morphology (plural -[z] and past tense -[d]) to L(III) and L(IV) respectively, for simple reasons of its position outside derivational morphology and compounds. The standard analysis is that both are subject to epenthesis of a short vowel following stem-final sibilants (plural) or coronal stops (past); then devoicing following stem-final voiceless obstruents. Both processes clearly depend on phonological information in the stem which appears to be categorical.5 Again, the only way to handle this is to weaken the notion of ?phonetic interpretation.? (It is also worth noting that the standard analysis, with two separate processes in a feeding relation, could not work here if the processes are both phonetic: see Chapter 5.) As for stress, it is usually accepted that there are no illegal metrical interactions of the kind discussed above within English words: one of the key properties L(II) affixes is that, unlike L(I) affixes, they are never stress-shifting (although -al is stress-sensitive: arrival, edital). However, as Halle & Kenstowicz 1991 point out, stress-neutral L(II) af- fixes in fact sometimes mysteriously appear strictly inside the scope of stress-shifting L(I) affixes, and not only receive stress as a result, but appear to do so over their underlying, 5English homorganic nasal place assimilation is another such possible case (un- is an L(II) affix; un+be- lievable assimilates in place to [mb?liv?b?l]); this is likely an EOD rule, as it also generates un+fair [fr], with a result that is not lexically contrastive. However, the spreading nature of this interaction suggests an- other analysis in which the triggering information really is phonetic. If that means gradient, then we could in principle predict gradient effects depending on small deviations in the pronunciation of the following place?so, more like coarticulation?but in fact this reasoning is highly debatable. 272 pre-interpreted segmental representations: p?tent#ab?l versus p?tent?ab?l+ity. Halle and Kenstowicz simply mark each suffix as being stress-shifting or not; each stress-shifting affix completely erases the stresses previously assigned and starts the metrical structure anew. This undermines the assignment of these suffixes to separate HOCDs, but there is no reason to think that stress needs to be computed at the innermost HOCD. Contrast this with the Latin clitic?host stress pattern discussed by Halle & Kenstow- icz 1991. Stress is antepenultimate if the penult is light. Stress can shift under enclisis (?bi, ?where,? ub?#libet, ?wherever?), but it does not always shift to the position predicted un- der the stress rule (l?mina, ?doorways,? limin?#que, ?and doorways,? lim?na#que). The explanation does not rely on -que destructively modifying the existing metrical structure; rather, it relies on the fact that it cannot. The addition of -que assigns a line 1 asterisk, but only to unbracketed line 0 asterisks?not to already bracketed and projected ones. The first two syllables of limina already having been bracketed, the only legal place to mark an asterisk is at the third syllable. No destructive modification of information in the nested HOCD is needed to get the stress shift, suggesting that in Latin it would be possible to compute the stress at both HOCDs, and limit interactions in the way suggested above. (For Halle and Kenstowicz, it does require a revocation of extrametricality on the last syllable, but this is an idiosyncrasy of their theory.) Finally, resyllabification phenomena are pervasive and clearly require access to cat- egorical information, but only up to a point: (sea[l]), ((sea.[l])(o.ffice)), where compound- ing appears to be L(II), but ((the)(sea[l]))((offered)((a)(donut))), with no resyllabification at a higher HOCD. The categorical identity of [l] is necessary in order to recompute the light/dark allophony. 273 In sum: the well-known association between non-structure-preservation and late application (with its notable consequence of failure to be blocked in non-derived environ- ments) is deduced under the current architecture, but needs to be weakened in an inside-to- outside phonetic interpretation theory. In particular, although the crucial step in the deduc- tion makes use of the fact that only gradient information is visible after interpretation? and gradient interpretation is incompatible with categorical phonological grammar?any attempt to compute the phonetic interpretation of an entire utterance by doing complete interpretation of its parts seems to run into immediate difficulty by this logic, because computation seems to require some access to categorical, and not only gradient phonet- ically interpreted, material inside nested domains. I noted that some limited amount of interpretation-reversal could alleviate the problems. See Chapter 5 for a bit more discus- sion. Finally, I note that some measure of constrained non-structure-preservation in the cyclic phonology can be tolerated; it is not a hard constraint. On the other hand, as first hinted at by Kiparsky 1985, any cases where post-cyclic processes are genuinely structure- preserving substantially weaken a theory of gradient post-cyclic processes, and that is cer- tainly also true here. (A relevant case brought up by Kiparsky is Vata ATR harmony, but his question?he does not know the answer?is not whether it is structure-preserving, but whether, given that it is not, it bears Cohn 1990?s interpolation-type hallmarks of phonetic harmony. This would be an interesting case to examine further phonetically.) 274 4.3 Summary In this chapter, I have addressed two consequences of taking seriously the notion of context-dependent phonetic interpretation, as proposed in the current architecture: under the assumptions of LPTH, I have discussed the absence of conventional ?surface represen- tations? coding allophones in a categorical way?which I call SC-representations, to sug- gest ?surface category??and their replacement with more abstract representations, which do not code predictable allophones?which I call AC-representations, to suggest ?abstract category.? I have also discussed non-structure-preservation, the definitional property of strict allophony, and deduced the ?phonology-first? architecture, which is this theory?s answer to the ?post-cyclic? or ?post-lexical? rule block in Lexical Phonology and related theories. I have pointed out some of the problems with this architecture, when taken seri- ously as a model for cyclic phonetic interpretation, and suggested tentative solutions. 275 Chapter 5: Phonetic transforms II: Linguistic phenomena Everything that happens will happen today, and nothing has changed, but nothing?s the same. ?David Byrne, ?Everything That Happens Will Happen Today? In this final chapter, I show that incomplete neutralization is not only easy to han- dle on the current theory, but actually predicted to be a very general case for phonetic rules. I show how the posterior evaluation measure handles various key phonetic patterns (strict allophony, incomplete neutralization, true neutralization), and why the first should tend to be phonetic, the last should tend to be phonological, and why incomplete neutral- ization might sometimes be learned as complete (phonological) neutralization, leading to language change. I then show that the opaque ordering exhibited by Canadian Raising is not only easy to handle, but predicted to be the only possible type of interaction between phonetic transforms. In order to maintain this empirically, I need to clearly separate the kinds of phonetic processes handled by phonetic transforms from another type of appar- ently phonetic process, phonetic spreading, which I propose takes place via (deletion and) sharing of phonetic features. I finish by discussing a few outstanding issues. 276 5.1 Incomplete neutralization Nothing in the theory above says anything about how phonetic transforms deal with the existence of other phonetic categories in the system. Although we have alluded to the idea that phonetic transforms create ?new categories,? not generally expected to be iden- tical to the realization of any other existing phonetic category, we have not said anything about the possibility that the result might be very, very similar. Incomplete neutralization is exactly this. Incomplete neutralization refers to a pro- cess that gives contextual variant pronunciations that are similar, but not phonetically identical, to some other segmental category. The best known example of incomplete neu- tralization is German word-final devoicing (Port & O?Dell 1986); however, many other voicing alternations show evidence of being incomplete (Catalan: Wheeler 2005; Dutch: Ernestus & Baayen 2006; Polish: Slowiaczek & Dinnsen 1985; Russian: Pye 1986). The English flapping alternation is arguably incompletely neutralized for voicing, a fact which will be discussed later in this chapter (Braver 2011). Turkish devoicing does not appear to be incomplete, on the other hand, although the data (Wilson 2003) is not perfectly clear, and the Korean neutralization of th, s, s*, t and th to t in coda position is also evidently phonetically complete (Kim & Jongman 1996). To give a clearer picture of the phonetic effect of incomplete neutralization, I have combined the results of two acoustic studies on German stops in Figure 5.1. Figure 5.1 shows that stops in both final and initial position are less phonetically different than stops in intervocalic position (the means are closer together).1 Stops in ini- 1The intervocalic position I have chosen rather arbitrarily as the basis for comparison just because it 277 Figure 5.1: Duration of voicing versus duration of aspiration for German stops in three positions. The left panel shows intervocalic position (a ?lenition? position); the middle panel superimposes coda position (the ?final devoicing? or ?final fortition? position); and the right panel superimposes word-initial position (the ?initial fortition? position). The data are taken from summary statistics presented by Jessen (1998), except for the coda position data, which are taken from summary statistics presented by Port & O?Dell (1986). Categories are plotted as though they were Gaussian, even though their distributions are all truncated at zero. Categories are plotted as though they had zero covariance, but their covariances were actually just not reported in either study. tial position show little to no detectable voicing (in fact always zero for voiceless stops) but the aspiration contrast is roughly the same. Stops in final position show an incomplete neutralization in both voicing and aspiration, but their voicing remains phonetically dis- tinct from initial position (almost certainly because they are preceded by sonorants in final position; for final and intervocalic position the voicing is so-called ?voicing into closure,? measured from the offset of the preceding sonorant, while for initial position any voicing prior to aspiration is counted, as the point at which closure is initiated cannot be deter- mined from the phonetic data). The data is presented principally for illustrative purposes. As far as I know, no single experimental study has measured both the final devoicing con- text and either of the two contexts studied by Jessen. As the experimental conditions and the tools used for measurement were somewhat different across the two studies, the com- highlights the contrast best. This environment is usually understood to be subject to ?additional? voicing, but whether one of these environments actually represents the basic phonetic representation of the contrast, and, if so, which, is impossible to know just by looking at this data. 278 parison of the results is imperfect. The main point, however, is that, in both cases, the laryngeal contrast is partly neutralized, but underlying voiced and voiceless stops remain phonetically distinct. Much of the debate surrounding incomplete neutralization during the 1980s seemed to rule out the possibility that it could coexist with true neutralization: the followup study of Fourakis & Iverson 1984 tried to rule out the possibility that speakers were enhancing the contrast on the basis of the orthography, on the assumption that anything else would be problematic; and Port & Leary 2005 explicitly uses incomplete neutralization as an argument against the existence of categorical phonological processes. However, the cur- rent architecture is in line with a handful of recent proposals that recognize the need for both categorical and gradient context-dependent operations (van Oostendorp 2006; Braver 2011). Furthermore, given the discussion of Old East Slavic post-velar fronting in Chapter 4, the same grammatical mechanism may be assumed to be at work in phonetic enhance- ment, and so both things?that voicing can be incompletely neutralized, and that the de- gree of neutralization can change as a response to pressure to enhance a contrast?can be true. The question, of course, is how incomplete neutralization, or more generally, pho- netic grammar, can coexist with phonological grammar, and still permit phonological grammar to do its job. The basic questions one needs to answer, then, are what evidence would lead the learner to think that it should be attributable to a phonetic rule and which would not. In Chapter 4, we said something to the effect that complementary distribution would lead a learner to posit a phonetic rule; in this section we flesh that out. Comple- mentary distribution will of course break down to a fair degree for a real speech perceiver 279 (not an oracle) under incomplete neutralization; I discuss when the learner would find it appropriate to maintain a phonetic rule nevertheless, and when it will abandon this. I then discuss briefly the kinds of additional pressures that will lead a learner to alter the phonological grammar to explain the effect as a ?real? neutralization. Before moving on to the main discussion of how these patterns will be treated by the learner, however, it is worth noting some general patterns that seem to hold in the pho- netic data. The order holding between the means across categories on both dimensions remains fixed across all three contexts, but the difference is smaller in the fortition posi- tions, except for initial position, where the difference in aspiration is slightly larger than in intervocalic position. The ordering between the variances along the two dimensions also remains fixed within categories and across categories, across all three contexts; in initial and intervocalic position the variances are the same for voiced stops. Future work should attempt to check these descriptive generalizations to rule out the possibility of interference from the difference in methodology (the numerical difference in the variances in coda po- sition is particularly salient, but in principle any or none of these generalizations could be spurious). They are all suggestive of the pattern laid out for Inukitut and Kalaallisut retraction and fronting in Chapter 3, and further study will help to support generalizations about . Regardless of the details of how phonetic transforms work, however, the fact that the phonetic distributions for the two categories are not identical to each other in any position indicates that what is happening is not that one category is changing into another at the AC-representation level. I view the rough preservation of the variance structure as welcome, but tentative, additional support for the uniformity of this class of process with the allophonic processes discussed in Chapter 3. 280 5.1.1 Empirical predictions 5.1.1.1 One category, one process We left the framing of the relation between complementary distribution and phonetic rules in Chapter 4 as: ?phonetic rules imply complementary distribution, but why should the converse hold??i.e., that complementary distribution implies or strongly suggests a phonetic rule.? Actually, there are several things about this that need to be cleaned up before we can proceed: (108) Complementary distribution needs to be swapped out for degree of complementarity of distribution. (109) Suggests means increases the posterior probability. (110) Phonetic rule needs to be swapped out for magnitude of phonetic transform. (111) The last two points imply that suggests a phonetic rule needs to be swapped out for shifts the posterior distribution of the phonetic transform magnitude away from zero. (112) Complementarity of distribution can only be evaluated with respect to a particular hypothesized category structure. The first point is a consequence of the fact that we do not actually perceive categories: we get tokens and infer categories. Thus we cannot ever expect the complementarity of distribution to be any more perfect than our inferred categories. The second point should be self-explanatory by now. The third point is not actually quite right: we will be asking a 281 qualitative, and not a quantitative question; but to start with just consider that the question ?is there a phonetic rule? means ?how much bigger is the phonetic process than zero,? and the reasoning will be easier. The fifth point is the most important. No structure ?just emerges? from any finite data set, apart from the data itself. Complementary distribution is no different. The usual definitions of complementary distribution are based on the sim- plifying assumption that there is a correct assignment of tokens to a set of phones that could be given by some oracle, and in any particular case we assume that our transcriber has done this assignment right. However, the oracle categories are just one of infinitely many structures that could have been assigned to a finite collection of phonetic obser- vations. Thus, in order to talk about complementary distribution at all, we need to hold an assumption about how the ?true? category assignment goes and implicitly assume that such a true assignment exists; it does not simply suffice to be given a phonetic corpus. For the current purposes, I will make life easy and assume that the true oracle phones are a pair of equal variance Gaussians which may differ in location. Each one gives rise to pronunciations in a different context, either x or x0. Simplifying, I hold constant that the learner is attempting to learn the base representation r and the transform representation t for only one single category. Given this, if the locations for the two (true) categories are exactly the same, then I take this to be the case where there really only is one oracle category. I will call the oracle Gaussian associated with x the category A, and I will call the one associated with x0 the category B. Complementary distribution is about the probability distribution of contexts in which a particular category appears. Even besides the issues about categories versus tokens, this is a problematic definition if the categories A and B 282 are simply defined as those categories associated with the contexts x and x0! We instead take the context to be the output of some decision rule for the category given a particular observation, under the intuition that complementary distribution asks how well we could use the ?apparent? categorization to predict the context. Thus here the context distribution which is to be assessed as complementary or not is the distribution of predicted x/x0 given the actual category (either A or B). The degree of complementarity d is, as in Chapter 3, the symmetrized KL-divergence. Hold constant NA, the number of observations that are actually from category A, and NB, the number of observations that are actually from category B. Let N3A and N 3 B be the number of A and B observations correctly classified by the decision rule, respectively, and N7A = NA N 3 A and N 7 B = NB N 3 B . Then we have: " log N7A N3B NB NA ! N7A NA + log N3A N7B NB NA ! N3A NA # (113) + " log N3B N7A NA NB ! N3B NB + log N7B N3A NA NB ! N3B NB # = " N3B NB N7A NA # log " N3A N 3 B N7A N 7 B # Where we are going is to look at the learner?s posterior distribution over transforms Tx0 , and how it shifts as a function of d. However, we cannot make any such inferences about what the learner will conclude based on the actual value of d for any finite corpus: any particular value of d can be associated with a wide range of posterior distributions over Tx0 . Instead, we can integrate over all possible data sets for a given pair of categories 283 A and B to obtain the expected value of d. Since a given data set is summarized by the number of correct and incorrect classifications, we have, for fixed NA, NB: E [d] = NA N3A =0 NB N3B =0 p N3A A3 (1 p A3 )N 7 A p N3B B3 (1 p B3 )N 7 B " N3B NB N7A NA # log " N3A N 3 B N7A N 7 B # (114) The summand is always positive and finite, and, as long as p A3 ; p B3 0:5, then it is increasing in both. The most obvious decision rule is to select the maximum-likelihood category and toss a coin if the probabilities are equal. For a pair of equal-variance Gaus- sian distributions, the decision bound is a hyperplane at the midpoint between the two means?in our case, the hyperplane normal to t which passes through r+ 12t?and pA3 = p B3 =: p3 is just the total probability beyond the decision bound to the side of the Gaus- sian which faces the other category. Since this is always the ?smaller? side, for jtj > 0, p3 > 0:5. Thus the summand and thus E[d] increases in p3. There is only one way for p3 to increase, however, for fixed variance: increase jtj. Thus if the expected degree of complementarity of distribution increases, then the separation between the categories must have increased. Now, what does this have to do with the learner? In the limit, as the size of the corpus goes to infinity, the mean values for the two clusters will be r and r+ t, and it is easy to show that the mean of the posterior on r ; t will be Gaussian and centered at a combination of the prior mean with these empirical means (with some small corrections for correlation). Thus, as t moves away from zero, so will the posterior for t shift in that direction. Thus: as complementarity of distribution increases, the magnitude of the 284 preferred phonetic transform also increases. This would not exactly be what we were originally after?complementary distribu- tion leads to a phonetic rule, not leads to a larger-magnitude phonetic rule?were it not for the variable selection parameter g . Recall from Chapter 3 that this model allows the learner to disable particular phonetic transforms by fixing them at zero (g = 0). Recall further that no a priori bias towards one or the other value of g is necessary. The joint posterior on g in conjunction with the other hyperparameters it regulates gives rise to a Bayesian Occam?s Razor effect (see Chapter 2). In our case, this takes the expression for the joint posterior from this (ignoring S)? (115) f (r; tjX ;Y ) = f (r; tjX ;Y;A0;W) f (A0) f (W) dA0;W ?to this: f (r; tjX ;Y ) = 1 2 f (r; tjX ;Y;A0;W) f (A0) f (W) dA0;W(116) + 1 2 d (t) f (rjX ;Y;A0;red;Wred) f (A0;red) f (Wred) dA0;red;Wred The densities f (A0;red) and f (Wred) are meant to be the ?reduced? densities for the distri- butions on these hyperparameters wherein the t row of A0 and the corresponding rows and columns of W are fixed at zero to yield a degenerate Gaussian. These densities are neces- sarily numerically greater over the regions where t = 0, because (see Chapter 3) they are obtained as the conditional distributions of the relevant submatrices?satisfying similar- ity, (see Chapter 2), and thus giving rise to the Bayesian Occam?s Razor. The consequence 285 is that the posterior probability will be higher for some particular ?no transform? solution than for some other solution with a given non-zero t if the following ratio is greater than one: (117) f (Y jX ;r;0) f (Y jX ;r0; t) f (r;0jA0;W) f (A0) f (W) dA0;W+ f (rjA0;red;Wred) f (A0;red) f (W;red) dA0;red;Wred f (r0;tjA0;W) f (A0) f (W) dA0;W Even for the most favorable circumstances, the Bayesian Occam?s Razor will give rise to a bias for the simpler solution (via the difference in the determinants of the hyperparameters S0 andF for A0 andW across the full and reduced cases). In the case where the maximum likelihood solution is close to t = 0, the posterior probability on t (since it is Gaussian) will be reasonably high at t = 0, and the left-hand (likelihood) ratio will be not much less than one. Thus for distributions which are non-complementary because the distributions of phonetic values in the two contexts are exactly overlapping, or close to overlapping, the learner?s bias will clearly favor a system without a phonetic rule (and this is the only possible way the distributions could show low complementarity given the equal-variance Gaussian assumption). This is why complementary distribution leads the learner to posit a phonetic rule, and non-complementary distribution leads the learner not to. 5.1.1.2 Two categories, one process There is little to be said about the relation between complementary distribution and the choice of the learner to posit two categories or one, except that there is none. What I mean is that, if there are two clusters that the learner is faced with, then the learner may take them to be allophones of the same category if they are in complementary distribution (to 286 whatever degree). However, the learner may also take them to be two separate categories, regardless. If both clusters are fairly normal and the two are in strongly complementary distribution, then this will require that each of the two separate posited categories have zero transform; nothing more. There is nothing wrong with this solution except that it is subject to countervailing simplicity bias.2 The tradeoff between the simplicity bias against this solution and the simplicity bias against the one category, one process solution?given that both achieve precisely the same likelihood, assuming the two categories have equal variance?is complex and sensitive to the prior. Thus I will say nothing more about it. However, consider this our base case: the two oracle categories are in perfect comple- mentary distribution, as before. Now consider the following two cases: (i) category A systematically lacks any ob- servations from x0 but category B is mixed (we could say that there are three categories, two x, and one x0 overlapping perfectly with one of the two x); (ii) both categories A and B are mixed (we could say there are four categories, in the obvious way). Now, clearly, the SKLD will be greater for case (i) than for case (ii). Case (ii) obviously should give rise to two different categories with a simplicity bias favoring no transform, just as in the previous section. For case (i) it is still clear that two categories are needed, but what is more subtle is the fact that positing a non-zero transform that perfectly overlaps an exist- ing category?a kind of ?quasi-structure-preservation??leads to no boost in likelihood! 2The role of BOR in the Dirichlet process mixture model is somewhat more complex than for the variable selection prior. First, there is also a non-BOR simplicity bias effected by the ?rich get richer? property. Second, the consideration of the addition of a previously unused category must be made point by point, and so the expression illustrating where the BOR comes from in a DP mixture for a data set of size N is extremely unwieldy. The idea is the same, though: probability of making use of an already-used category is scaled by the likelihood under that category, whereas for a previously unused category, the probability is scaled by the likelihood integral over all possible parameters. If the likelihood is already quite high for the existing category, then it will tend to go down when integrated. 287 Thus this too should be dispreferred by the BOR. However, now consider what happens when we tweak case (i) to look like incom- plete neutralization by overlaying the additional x0 copy of category A (call it A0) not ex- actly at the same location as category B, but close to it. Assume that it consists entirely of observations from context x0. As the separation between the two means increases, the best achievable likelihood by a two-category, one-transform model begins to exceed the maximum likelihood solution under a two-category, no-transform model, and there will reach a point, just as in the one-category case above, where the scale tips against the BOR in favor of a phonetic rule. Exactly the same reasoning holds for the case where the covariance differs between A/A0 and B, and, of course, both ways of being incomplete neutralization?different location and different category shape?will in general be com- bined. Thus we see that, at a certain point, (which will depend on the particular data set), the increase in likelihood permitted will outstrip the BOR effect induced by the choice to set g = 1. All of the foregoing analyses assume that categories nicely hew to the assumptions of the Chapter 3 linear Gaussian model, and of course, this is not exactly true. However, the point should be clear: whatever data makes a ?good? incomplete neutralization is different from what makes a ?good? two-category, no-transform solution in terms of the likelihood; and, considering only one category, complementary distribution data makes ?good? a phonetic-transform solution. Whatever the correct theory of phonetic category maps and turns out to be, there may be cases for which there is actually no gain in like- lihood, like the the quasi-structure-preservation case discussed above. For these cases, we predict that the learner will choose the simpler model if there is one. In general, however, 288 the learner?s preferences will be driven by a combination of fit to the data and bias (in- cluding simplicity). This illustration of the corner cases shows how the reasoning goes, and suggests, if allophony really is phonetic and incomplete neutralization really is in- ferred by the learner in the face of just the types of phonetic data illustrated above, that the current phonetic category conception shares at least some important properties with the correct one. Stepping back further, it gives a clear picture of why and when allophony (and incomplete neutralization) should be a late phonetic rule: under the current model, these solutions lead the learner to a better tradeoff between faithfulness to the data and markedness of grammar. 5.1.1.3 Two categories, one categorical process We have showed that, when two clusters both seem to contain observations from both of two contextual environments, the best result would be two separate categories, and zero contextual transforms. This is one case of non-distribution that would therefore lead the learner not to posit a transform-. We have also showed that cases where the two clusters are in totally complementary distribution are predicted under a single-category, single transform solution?that many cases, in particular those where the two are plau- sibly Gaussian with the same variance, would lead to single-category, single-transform systems. Thus, complementary distribution, all things being equal, leads the learner to non-structure-preserving late phonetic rules. Finally, we also showed that there were ex- otic ?quasi-structure-preserving? solutions that could be supported in case a single cluster was largely devoid of observations from some particular environment, but that these would 289 be dispreferred by the prior. We showed what the intermediate cases looked like: two cat- egories and a single non-zero transform would be preferred in a whole host of cases in between classical strict allophony and quasi-structure-preservation, all of which we might call ?incomplete neutralization.? Now we would like to say something about what the grammar would do in the face of the quasi-structure-preservation data. One option is nothing: as we said, it is just a case where one region happens to be missing data from one particular context; there is nothing special about this, and so nothing more really needs to be said. It is usually thought, however, that phonological grammars nevertheless do say something about this. Without an explicit prior for phonological grammars, it is difficult to say much more. Furthermore, as was discussed extensively in the preceding chapters, there are two classes of solutions to this problem, depending on the theory and depending on the circumstance: simply state a static restriction, to the effect that ?this category never appears in this context,? and let that be the grammatical knowledge; or implement a causal explanation, to the effect that ?this category becomes this other category in this context.? Consider these two types of statements and the predictions they make: 1. We subsequently consider sequences in which a segment is very likely some cat- egory A and its surrounding segments are very likely the impossible/neutralizing context x to ?not fit? very well (both cases) 2. We subsequently consider the neutralization target B to be one of two lexical seg- ments, A or B (second case) Clearly, in both cases, the grammar is more restrictive with respect to the set of AC- 290 sequences that will be tolerated. This is a good thing, because any licit AC-sequence will have higher likelihood (see Chapter 2 for the short discussion of restrictiveness in proba- bilistic inference). If the grammar is variable-length, this will presumably trade off against simplicity, which will disprefer the additional constraints or rules needed to guarantee this restrictiveness (again, see Chapter 2). However, one thing is clear: so long as it is possible to do so, the learner will have a motivation to ban segment A in context x. This has nothing to do with morphological pressures, which are often intuitively thought to be the principal motivation for the grammar ?explaining? the absence of a particular phone sequence. See Jarosz 2006 for models making this point in another way. I will not attempt to adjudicate between these two types of formulations; as dis- cussed above, using an Optimality-Theoretic grammar, it is possible for a surface con- straint to give rise to the same ambiguity about the underlying identity of a segment in the context x, without additional (more complex) grammatical apparatus, if other morpholog- ical pressures demand it. Similarly, under a solution which forces all instances of A in context x to be realized as B, there is no need for an explicit surface constraint banning A in that context (except in the case, irrelevant here, where other processes create new instances of A in that context). It is worth pointing out, however, that, the preference for restrictiveness notwith- standing, morphological pressures can also be relevant. In particular, the learner may have reason to collapse two similar strings as instances of the same morpheme, with a common underlying representation (see Chapter 2, Chapter 4). In this case, restrictiveness will encourage the formulation of grammatical constraints that rule out free variation in the realization of this morpheme, and restrict each of the different variants to the contexts in 291 which they occurred. The effect of this is that, for a given instance of the neutralization target B in the context x, other facts about the surrounding environment will boost the posterior probability of either (i) underlying representation B; or (ii) underlying represen- tation A and a grammar requiring, not merely tolerating, B in the context x. Information of type (ii) would boost the probability of either a quasi-structure-preservation solution or a phonological solution; we would rely on the variable selection prior more strongly dis- preferring the quasi-structure-preservation solution than the phonological grammar prior disprefers the neutralization solution, if we wanted to ensure we get the ?classical? phono- logical neutralization solution back. The architecture of grammar then becomes very im- portant: cyclic reapplication following the SCC (see Chapter 4) and process interactions (see below), along with the homoscedasticity assumption of our linear Gaussian model, or whatever is actually to be said about the phonetic interpretation of phonetic transforms, become crucial pieces of evidence pushing the learner towards one solution or the other. 5.1.2 Summary Incomplete neutralization is not a mysterious phenomenon, nor, I have argued, should it ever have been. It is simply allophony that never quite got its ducks in a row; but the phonetic grammar is forgiving. As far as the grammar is concerned, it is simply al- lophony. The only sticking point arises when we consider the fact that the ?quasi-structure- preservation? solution coexists with the phonological solution in the set of possible gram- mars. The reason that late rules are not structure-preserving, according to this theory, is entirely due to a bias for simplicity. I have called the previous section ?Empirical predic- 292 tions? not because I believe that we could test any specific predictions tomorrow; absent a particular prior for phonological grammars, what exactly the system learners will arrive at remains something of a mystery. Rather, I believe simply believe that I have laid out the corner cases well enough that, with a more complete theory, we could make predictions, and we could start to constrain such a theory by examining changes currently going on for which we have phonetic corpora. The German incomplete neutralization facts appear to be stable; unless some other process comes into the language with which it could potentially interact, there is no reason to think that learners are going to upend its incompleteness any time soon. However, if the two categories were to change in all positions to become more equal in shape, then, under the current instantiation of the theory, the learner would have much more trouble telling the incompletely neutralized allophones apart. Learners would eventually be predicted to make a saltatory leap from a phonetic to a categorical process. This is presumably an unmarked case for the phonemicization of an allophone and the introduction of a categorical process into a language. 5.2 Phonetic process interactions 5.2.1 Predictions We come to the last topic to be explored in this dissertation: interactions among phonetic transforms. Interactions, broadly construed, are simply those considerations that need to be taken when two or more mappings are thought to be potentially applicable to a particular input?and yet there is only one output. Do the mappings compose, with the output to one being the input to the next? If so, which way do they compose (or does it not 293 matter)? Is there a different kind of combination at play, whereby one input is submitted to several mappings, and then some other function, or set of functions, is used to combine the outputs? The issue turns only tangentially on whether a grammar is represented as the specification of a set of grammatical sub-mappings, or as a set of output constraints, or whatever else we might think. Any useful large collection of mappings will inevitably be constructed out of combinations of elementary mappings, and phonological grammars, broadly construed, are no exception. That is because by ?represented? we generally refer only crucially to some structure followed by the learner; it makes no sense to think about the receptive or productive computations taking place by ?computing? output constraints without ?computing? outputs!?and it is hard to imagine either of these computations, given their wide range of possible inputs, not somehow being structured into a combi- nation of elementary processes (deletion, spreading, footing, elementary transducers, or whatever else we might propose). Rules about how these elementary operations combine are always necessary, whether to the net operation of a grammar learned as individually specified grammatical processes or to the theory of Gen.3 Consider, for example, the Canadian Raising pattern discussed several times now: write, ride, writer, rider: [rjt], [rajd], [rj?r], [raj?r]. There are essentially two processes involved: a process changing the vowel quality (Raising) and a process shortening the closure duration and roughly neutralizing the voicing contrast for the following coronal 3This cuts both ways: for phonetic grammars, although we have assumed that the prior acts as if indi- vidual processes are directly specified, we could equally imagine a theory where phonetic transforms were specified indirectly via constraints. The questions about how elementary phonetic transforms combine to give rise to the net result would not change; what would potentially change is how any given grammatical mapping is to be decomposed (and that means that the implications of any particular empirical case might change). In addressing particular cases below, I will continue to assume that two general phonetic processes that can be identified independently are objects that will need to be combined using . 294 stop (Flapping). Assuming the diphthong is to be taken as a single segment, Raising is clearly non-structure-prserving, and thus obviously phonetic under the current theory, (where ?obviously? stands in for all the qualifications outlined above). The Flapping rule is phonetic for similar reasons.4 The interaction between these two processes is exactly as predicted under the strong phonetic transform hypothesis, whereby only categorical information can feed phonetic processes (?in the same EOD block, if the rules are level ordered). (118) =rad= a =rat= =rat?r= =rad?r= a Under the current hypothesis, this kind of interaction is completely prototypical. The input to the phonetic computation is of a different type than the output, thus composition is impossible, and the only choice for application is simultaneous application, followed by reconstruction of the output using another set of functions (in this case, the sequential relation is the same, but implicitly there is some kind of phonetic sequencing function that combines the outputs; and in general, for transforms applying to the same segment, 4If Flapping is a phonetic transform, we predict that the neutralization is highly likely to be incom- plete. The phonetic facts are nuanced. The preceding vowel duration, which is a near-universal correlate of voicing, is incompletely neutralized. The closure duration differences are miniscule and not consistently in the same direction across experiments, and thus there is very likely no difference at all (=d= =t= 4 ms: Charles-Luce, Dressler & Ragonese 1999; +0:9 ms: Herd, Jongman & Sereno 2010; +0:4 ms: Braver 2011; Braver also finds only a 0.7% difference in percentage of voicing into closure; neither of the latter two studies report variances, but the histogram given by Herd et al. seems to suggest close alignment of distributional shape). The qualitative difference between the two dimensions may be due to a floor effect in the closure duration (stop closure durations do not generally go below 25 ms, and the closure durations in these sets of results are close to this minimum), but this leads to a potential alternate explanation: the process is truly neutralizing, but the vowel shortening is phonetic and allophonic. This explanation is par- ticularly salient in light of the fact that the much shorter duration of [j] as opposed to [aj] is usually lumped in with the effects of Raising. However, suppose this is correct: Shortening must still be sensitive to the underlying voicing status of the flap, and, since Shortening is phonetic, under our theory, Flapping still must be phonetic (or at least must follow the relevant phonetic EOD block). 295 we also use ). This is different from the previous approaches to Canadian Raising and related phenomena. The surface-constraint approach struggles because it is different: in particular, it struggles because the environment for Raising cannot be correctly stated with respect to the outputs. The derivational approach is fine, but it works differently: the environment for (crucially) Flapping is stated over an input that is the output of Raising. This is not necessary, but it is forced under the derivational theory. Instead, the current approach states that the input for both rules is the same object, the output of the categorical phonology, which I have called the AC-representation. Chomsky & Halle 1968 argued against earlier attempts to introduce simultaneous application. One reason was that it could not handle all patterns: it undergenerated.5 Another reason was that, although most cases of simultaneous application can be given an analysis under a derivational theory, the one case of a simultaneous application pattern that is impossible for derivational theories does not seem to be attested: thus simultaneous application never really needed and it seemed to overgenerate. Putting these two things together, simultaneous application was neither necessary nor sufficient to handle all of the 5Most of what they discussed there and elsewhere in this regard falls outside the scope of phonetic rules. However, in one example, they do invoke Raising. The interaction is with Pig Latin, however, which is a language game and thus makes the case somewhat more difficult to reason about. The example goes like this: all English speakers with Raising who know Pig Latin will pronounce sight ! ightsay with [j]; but some speakers will pronounce sigh! ighsay with [aj], while others will pronounce it with [j]. Simultaneous application of Raising and Pig Latin would predict only [aj], never [j], since the voiceless [s] does not follow [aj] underlyingly. It is hard to know what type of rule Pig Latin is, but the possibility of [j] would imply that some speakers treat it as reordering the segments in a way that phonetic interpretation of the individual segments can see. Level ordering might help to explain this difference, with [aj] being amenable to a ?Late Pig Latin? solution, if Berm?dez-Otero is correct that Raising does not apply at the word level, (see Chapter 4), or if level ordering could give rise to a similar effect because of failure to resyllabify. Early Pig Latin would be an application of the same rule at a narrower HOCD. The Pig Latin rule would in any case need to be able to rearrange segments and resyllabify on the output of phonetic interpretation. The ?real? example they give, following Joos, of ?Dialect B,? in which Flapping precedes Raising to yield [tjpraj?r] has been argued never to have existed (Kaye 1990) on the simple grounds that Joos claims the speakers were schoolchildren in 1942, but Kaye could find no trace of them as adults in 1990. If these speakers did ever exist, though, for them, Flapping must in fact have been categorical Voicing, [tjprajd?r]. 296 phonological mappings. With our newly restricted focus on allophonic (/incompletely- neutralizing/quasi-structure-preserving but late) rules, I will reexamine the question of sufficiency in this question. The question of necessity is interesting as well, but I have less to say about it. The case given by Chomsky and Halle that they said would make it necessary was like this: given rules of the form A! Y=?X , B! X=?Y , in a mutual feeding relation, the pre- diction of simultaneous application is that ABY sequences will surface as AXY , and BAX sequences will surface as BYX . The two rules are both triggered by the underlying, not the derived, environments, and so applying one rule has no effect on the operation other. This pattern is impossible under either ordering if the combination is by direct composition: either the output Y of the first rule should trigger the second, or the output X of the second rule should trigger the first. However, the resultant pattern, they said, interestingly does not exist. Under the current theory, a clearer way to write these two hypothetical rules would be as A! y=?X , B! x=?Y , where y and x are two phonetic representations not interpretable as Y or X . Such circumstances are rare to begin with, and I can find no relevant examples; however, the current theory predicts that this is not a systematic gap. 5.2.2 Possible counterexamples Rubach 1984, Booij & Rubach 1987 propose that the end-of-domain rules applying at the word level, the post-cyclic rules, are distinct from those applying at the end of the derivation in toto, including across word boundaries, the post-lexical rules (see Chapter 4). We have already seen in Chapter 4 that in a model with cyclic spellout, any kind of 297 interactions between end-of-domain rules are problematic for the theory that these are all phonetic interpretation rules; but Booij and Rubach?s post-cyclic rules are extra problem- atic: not only can they interact with post-lexical allophonic rules, they can interact with each other, they can tolerate exceptions, and they can be neutralizing. What separates them from the cyclic rules is that they can apply morpheme-internally, and they do not feed any of the cyclic rules in their own domain. There problems are clearly real: the Surface Palatalization rule adding noncontrastive palatalization to consonants preceding [i], but the post-cyclic rule of Retraction taking [i] to [] can bleed it. This rule has excep- tions and also itself crucially follows the r-Spellout and Yer Deletion post-cyclic rules. A reanalysis of Polish phonology which avoids reference to the problematic post-cyclic rules would be going too far afield, since there are already questions about interactions between end-of-domain rules; we must accept for the moment that there may be indeed an end-of-domain block which is not phonetic, limiting the scope of the theory substantially. However, they do report an interaction between two post-lexical Polish voicing rules. Polish has a rule of Final Devoicing which applies word-finally to all obstruents, as in sad, ?orchard,? with a devoiced final consonant, versus the nominative plural sad+y, where the d is voiced. Polish also has a rule of Regressive Voicing Assimilation by which obstruents assimilate in voicing to following stops or voiceless fricatives, morpheme- internally and across morpheme boundaries, as in pro?+b+a, ?request? (nominalization), with ? pronounced as voiced due to the following b, and to all following obstruents across word boundaries, as in kryzys gospodarczy, ?economic crisis,? with voicing on s due to the following g (we will discuss the value of collapsing the two cases together in a moment). Now, what happens if a word-final voiced segment, which ought to be subject to Final De- 298 voicing, is also followed by a voiced segment, which ought to trigger Regressive Voicing Assimilation to voice a voiceless segment? The result is that the segment is voiced, as in sad wi?niowy, ?cherry orchard,? with the initial [v] in wi?niowy evidently responsible for the fact that d is voiced. For these two rules to be phonetic is not a problem for the current theory: the envi- ronments ?word-final? and ?following voiced obstruent? should sum using and the two transforms should thus combine using . The segment is predicted to be both devoiced and voiced, and would probably be somewhat distinct in voicing from underlying voiced stops in other postions. However, this is somewhat strange, as it requires that underlying voiced stops be subject to assimilation to a following voiced stop. This is not implausi- ble, but presumably the phonetic difference would be rather small normally. On the other hand, at least if the transcription is to be trusted, then re-voicing is evidently sufficient to push a largely voiceless segment to being perceived as voiced. It is clear that we predict that these re-voiced segments ought to (almost surely) have some different voicing sta- tus from both devoiced and underlyingly voiced segments, but we really do not want to predict that the difference between re-voiced and devoiced segments is barely detectable, which is what would be predicted under a naive view of . Now, we do not know exactly what the operator is, and so it might be possible that phonetic voicing would combine with phonetic devoicing in a way that would yield the desired effect, which is (again presumably) that re-voiced segments are more like voiced than devoiced segments. One approach would be to say that is not generally commuta- tive, justifiable if here is not commutative, perhaps because the context inside the word domain (?word-final?) must be ?added? first. Another approach would be to recall from 299 Chapter 3 that appears to do something more than just add raw acoustic measurements, (naturally), but in particular that it does something consistent with some kind of rescaling of the bounds of the phonetic space (see above also). Even that appears to be problematic, however. Recall that in Chapter 3 we showed data from Kalaallisut suggesting that Post- Coronal Fronting and Uvular Retraction, when both applied, had effects on vowels that were lawfully related to their effects when only one applied. However, in that case, we saw that the effect of either was smaller acoustically when they combined, and that led us to a particular conjecture about transforms as scaling wherein the number of transforms applying would predict the degree to which the phonetic space is scaled down. Here, on the other hand, the effect of Regressive Voicing Assimilation to combine in the appropri- ate way with devoicing, the phonetic space needs to be scaled up under the application, as versus the non-application, of the transform. Now consider that there is yet a third voicing assimilation process, Progressive Voic- ing Assimilation, which applies to sequences of two obstruents within morphemes and across morpheme boundaries, (but not across word boundaries), when the first is voiceless and the second one is a voiced fricative, as in bitw+a, ?battle? (nom. sg.), [bjitfa], ver- sus bitew+n+y, ?warlike,? [bjitevn]. Notice the complementarity of environments word- internally with respect to Regressive Voicing Assimilation: that rule has a special excep- tion when the second consonant is a voiced fricative, which is exactly the case where Progressive Voicing Assimilation applies. The way that Booij and Rubach handle this is by ordering Progressive Voicing Assimilation before Regressive Voicing Assimilation and letting the one bleed the other (if Regressive Voicing Assimilation applied first then we would get [bjidva]). In fact, the rules must be ordered like this because Progressive 300 Voicing Assimilation is a post-cyclic rule and Regressive Voicing Assimilation is a post- lexical rule. We cannot do this here if both rules are phonetic. Furthermore, if both rules apply, then we expect something more like [bjidfa]. One solution is simply to keep the en- vironment for Regressive Voicing Assimilation as we state it above, explicitly excluding the fricative case. Another approach would say that voicing assimilation is simply qualitatively differ- ent from the other phonetic rules we have been looking at so far. Following the intuition that Regressive Voicing Assimilation?s yielding of either voicing or devoicing, depending on the trigger, suggests a kind of spreading, we might posit a simple deletion of the target voicing feature and then some interpolation of voicing according to the remaining marked feature, in a manner following Cohn 1990. We could then retain the simultaneous appli- cation of Progressive and Regressive Voicing Assimilation: both voicing features delete word-internally when the second is a voiced fricative, and the result is voiceless not be- cause the first was voiceless but because voicelessness is the default; this is somewhat strange, however, since, intervocalically, voicing ought to be the natural interpolation. If it is correct, though, then we still need to say something about the case where a gradient process of Final Devoicing could also apply: does the voicing information on the word- final obstruent get removed by Assimilation, filled in phonetically and then modified in a gradient way using by Devoicing, or does Devoicing fail to apply because the feature has been removed? According to the transcription, the result is consistent with the latter, but if these processes really are gradient then the question is worthy of further investi- gation. Generally speaking, the introduction of a second type of late/phonetic operation, feature removal, demands that we say how it interacts with itself and with , in particular 301 cases or in general. In sum, the logic of the problem is as follows: Regressive Assimilation arguably, and certainly according to the standard description, counterfeeds Final Devoicing, which is the type of rule interaction predicted for phonetic processes applying to different seg- ments (see above), but which is not what is predicted when the processes apply to the same segment, as in this case. Rather, we predict something which does not have any natural correspondent in the classification of categorical rule interactions: combination with . If it is actually not true that RA genuinely negates FD, then we are faced again with the question of exactly what it means for the two to combine with in the usual way. As we have some doubts about the validity of this phonetically, at least given the coarse description of the output, we raise the possibility that RA is not a phonetic rule combining with . It would not do to say that RA is a phonological rule that makes the resulting offending non-devoiced/re-voiced segment [+voice], because this would do nothing to solve the problem. The interaction with Progressive Assimilation is arguably less impor- tant, because the only issue, if the two are both phonetic processes combining with , is that the prior distribution on complex environments for phonetic rules might be thought to disprefer the statement of the RA environment which explicitly excludes the fricative case on grounds of simplicity; still, this is a possible grammar for which the available data in favor ought to be abundant, so it seems likely that the problem would dissolve. Before making any commitments, let us consider the facts from Dutch, which are almost exactly the same (Zonneveld 1983, Grijzenhout & Kr?mer 2000). Dutch also has a rule of Final Devoicing (pad+en, ?toad (pl.),? [pd?n], versus pad, ?toad,? pronounced sim- ilarly to [pt], but seemingly a phonetic process: Ernestus, Lahey, Verhees & Baayen 2006). 302 The rule does not apply only at word boundaries, but also finally in narrower phonolog- ical domains, as in goud+achtig, ?gold-ish,? [xut.x.t?x], the outer domain of which is evidently the main stress domain, but (at least according to Grijzenhout & Kr?mer 2000) does not trigger internal resyllabification, making the internal domain evidently a stronger one than the narrowest domains of affixation, (for which Grijzenhout reports resyllabifi- cation), but weaker than the word-level domain. Dutch also has Regressive Voicing As- similation applying across the board to assimilate obstruents to the voicing on following stops, as in eet+baar, ?edible,? given as [edbar]. Furthermore, Dutch also has Progres- sive Voicing Assimilation, which applies, as in Polish, to voiced fricatives after voiceless obstruents. Unlike in Polish, both of these processes apply (as stated) across word bound- aries. According to Grijzenhout, RA does not apply in the case of the second consonant being a voiceless fricative, unlike in Polish; instead, PA is extended to this case too. This would predict, for an underlying voiced?voiceless fricative sequence, that we would get voiced?voiced; RA would predict voiceless?voiceless. Surprisingly, Grijzenhout reports voiceless?voiceless in this case (vriend+schap, ?friendship,? with the d voiceless). How- ever, since this is a sufficiently strong boundary to be an FD environment, she attributes this to FD applying first to devoice the d; then PA applies. Examples from weaker bound- aries or morpheme-internal cases could help us sort this us out, but the L(I) suffixes in Dutch are all vowel-initial, according to Booij 1977, and morpheme-internally, the fact is that on the surface there are no voicing mismatches, and even if there were underlyingly, there would be no way to tell which segment was voiced and which voiceless. I will return to this in a moment. For now, I put aside the case where the second segment is a voiceless fricative, and 303 just consider the remaining cases. The interaction between Dutch RA and FD is just like in Polish. If the sequence is underlyingly voiced?voiced, and the environment is both an RA and an FD environment, then the result is voiced?voiced?either by blocking or re- voicing?whatever the phonetic facts turn out to be: the compound leef+baar, ?liveable,? is reported as [levbar] (in spite of the quirky spelling as f, which reflects neither the voic- ing on the underlying source morpheme, lev-, (as in lev+en, ?to live?), nor the surface pronunciation). The interaction between Dutch PA and FD is now relevant; in Polish, there is no possibility for such an interaction. The relevant case is underlying voiced?voiced frica- tive. The relevant example is raad+zaam, ?advisable,? which is said to surface as [ratsam]. The implication is that FD does indeed feed PA, consistent with the analysis of the voiced? voiceless fricative case. This is a problem. The logic is much like the cases reviewed in Chapter 4: there we reviewed the problem of interpreting higher-order cyclic domains as phonetic interpretation domains, which was a problem because the categorical informa- tion sometimes needed to be preserved. We said that it could in fact be preserved in those cases, by a weakening of the theory to allow recovery of categorical information from gradient information when it at least exists; but that after all the cyclic domain did not need to be interpreted as a phonetic interpretation domain. Now we seem to be faced with a much more serious problem. There is no issue about cyclic domains to be deferred, and we are facing the case where the reason we take the categorical information to be absent is because we suppose that it was never there in the first place. After FD, the only categorical information that is recoverable should be the underlying +voice value, but now we need the voicelessness to spread to the assimilated segment. 304 Now consider again the three types of solutions laid out for the Polish problem. First, that RA is typed just like FD: the two processes combine with , although we then need to qualify exactly what the effect of is. This simply denies that RA fully negates the application of FD. Second, RA is not like FD: the two combine in some other way, and in particular, RA is a deletion process, and deletion processes are allowed to bleed phonetic transforms. A third way which would not be applicable for RA would be to give up: to say that RA is a phonetic process that combines with , but it can block the application of FD, in this case not because a phonetic processes can supposedly be sensitive to gradient information in the contextual environment, but in fact because they can apply or fail to apply on the basis of precise phonetic value output by another process of the phonetic grammar. This would not work for the same reason that making RA phonological would not work: it makes the wrong predictions. Now apply this logic here. The problem is not that PA needs to undo FD, but rather that FD seems to need to precede PA. The other difference is that there are two different targets involved. The analog of the first approach would be to say that, by using to combine PA voicing the second segment and FD on the first segment, we somehow wind up with a voiceless second segment. This does not have much sense to it. There is not even the slightest independent motivation for making FD into a qualitatively different type of process. As for lifting some limitations on process interactions, this will do here: if PA is an process, then a simple admission of gradient environments into the theory, in violation of our research strategy, would say that, in the voiced?voiced fricative case, FD first applies to give a devoiced segment, and then PA takes as input a context equal to, not the value of the voicing feature, but some gradient phonetic value; it would perhaps adjust 305 the voicing in proportion to the voicing on the preceding segment. We could get something like this make more sense if, again, PA is qualitatively different, and not an process. The way this worked for RA was that RA could bleed FD if RA removed the voicing feature on the segment that is the target of FD. If PA removes the voicing feature on the second segment, however, then that segment is crucially not the target of the transform FD. The result follows if PA removes the voicing feature, then FD applies to affect the voicing on the first segment before interpolation takes place. This is ?before? in the informational sense: the output of interpolation and FD together is the predicted output of interpolation applying to the output of FD; FD applying to the output of interpolation would give voicing on the second segment but voicelessness on the first segment. The phonetic facts may actually turn out to be subtle, but, if they are just as reported, then, to make this ?before? even more natural, we might say that it actually could not have been any other way: the second possible computation, where FD applies to only the first segment, is illicit. This would follow if the segmental boundary were no longer applicable in the interpolation case?FD treats the voicing information across the two segments ?as a unit??and so there would actually be only one way to compose the two. This, in turn, would follow under an autosegmental phonetic model where the phonetic voicing information did not need to be associated uniquely with one segment, but could be linked to two segments simultaneously, and if in fact in this case it had to be for whatever reason. This is largely consistent with what we said about RA, but there is one issue: there we said that RA was deletion of one of the two segments? voicing features, but FD was blocked because the voicing feature was gone. Here we are saying that PA is deletion of one of the two segments? voicing features, but FD is not 306 blocked because the voicing feature is not gone. Evidently, if it is the case that FD takes the segmental boundary to be inapplicable for the purposes of what gradient information it applies to, it is not the case that FD takes the underlying segmental division between the first-segment voicing feature to be inapplicable for the purposes of determining whether it applies. Further manipulation of our intuitions to make this state of affairs appear to follow from something is not necessary?we can simply leave this to be the fact of the matter in this case. However, it may be worth investigating the possibility that this is only the fact of the matter in this particular case because the environment for FD crucially refers to the boundary following the particular segment in question. There are other ways to look at it that might make it a general fact; but we are too deep in speculation at this point. To sum up: the Dutch and Polish cases both show interactions between what we would like to say are phonetic processes (in the case of FD, for empirical reasons). How- ever, they show two problematic patterns for the present theory: RA either shows a po- tentially problematic effect of in both languages, virtually undoing the effect of FD on underlying voiced segments, in spite of the fact that there is no good reason to say it applies to voiced segments in any way, much less in this way; and PA in Dutch seems to be fed by FD. These two patterns together suggest a different approach to assimilation processes, one in which, although they may be late, they are not phonetic transforms that combine with . Regressive Voicing Assimilation in Russian stands in the same relation with respect to Final Devoicing as these two cases of RA, and nothing more needs to be said, although some of the details are different.6 6First, RA in Russian spreads until it reaches a vowel; sonorants are optionally either transparent or 307 I also mentioned a few cases put forward by Booij and Rubach, and, in particular, the problematic interaction between Retraction and Surface Palatalization in Polish: de- spite the apparent lateness of that Retraction rule, the solution that would be forced upon us here would be that Retraction is a rule of the categorical phonology. Although we have said nothing about lexical exceptions in the present theory, if we did bar phonetic rules from tolerating lexical exceptions, such a solution would be further forced upon us, be- cause Retraction has lexical exceptions?but of course there is really no way to force an outright ban on effective lexical exceptions to particular rule environments as long as the phonetic grammar can perform outright deletion, because there is always the possibility that the AC-representation contains a segment which does not surface and only serves to affected. Second, [v] behaves anomalously, only triggering RA where FD has applied to it. RA and FD, however, affect it obligatorily. Most analyses treat [v] as being in some sense more sonorant-like under- lyingly, but, as discussed at length in Kiparsky 1985, the puzzle is how to trade out its non-obstruency (no obligatory FD/RA) for for obstruency (obligatory FD/RA, implementation as a fricative) in a way that avoids the unwanted effect of having it trigger RA. The solution of Hall 2007, following Avery 1996 and Rice 1993, is to posit that voicing on obstruents is simply not the same as voicing on sonorants. Russian [v] is anomalous in bearing neither the sonorant (SV) nor the obstruent (Laryngeal) voicing feature, and so it does not generally spread its Laryngeal feature. It is endowed with such a feature either late?prior to RA?in the case of FD/RA, which give it a Laryngeal feature, or very late?after RA?otherwise. The problem if all of this is simultaneous is that cashing out [v] as an obstruent for the purposes of FD will not be allowed to occur before it is cashed out as an obstruent generally. A solution similar to Kiparsky 1985 and Hayes 1984 whereby [v] is a sonorant (i.e., marked with SV), and SV-bearing segments do undergo RA and FD (here, processes that affect the Laryngeal feature only). I differ from these accounts, which treat voicelessness uniformly as the result of having a bare Laryngeal node, in, of course, asserting that it is the filling in of phonetic detail for Laryngeal that gives rise to both RA and FD. This detail may be option- ally interpreted on SV-marked segments. The interpretation of [v] as an obstruent involves removing the ?prophylactic? SV feature, which may be done simultaneously with RA and FD, and then forces the interpre- tation of the Laryngeal information. All this is more about the representation and phonetic implementation of [v]; the rule interactions are not a problem in any case. The question about Russian RA that bore on the issue of whether the structuralist phoneme should be maintained, namely, whether it should be split into two different rules depending on the phonemic status of the output, is answered here in the negative. This says nothing about whether the structuralist phoneme is in a sense maintained, by which I mean whether or not the result of voicing via RA is actually allophonically distinct from the corresponding underlyingly voiced segment, where those exist. There would be a reason to assert this if RA were due to a phonetic transform, which would predict that the result of RA would only be like the underlyingly voiced segment by an unlikely accident; but, although above I suggested an interpolation view of phonetic spreading, it is also possible that spreading does not necessarily to lead to non-structure-preserving gradient outputs, but will rather preserve the interpretation of the Laryngeal feature idiosyncratic to the given segment if it has one. Given that transforms like FD can affect gradient detail on several segments at once, there seems to be some tension here, but I defer the issue to a worked out theory of phonetic spreading and phonetic features. 308 block or trigger the application of a phonetic rule. (The analysis of synchronic Yer Dele- tion as floating segments by Rubach & Booij 1990 is a proposal that could be potentially reconstrued along these lines.) One more potential case of interaction of allophonic processes, brought to my at- tention by Andrew Nevins, is the supposed Laurentian French ATR harmony pattern de- scribed by Poliquin 2006. It is an accepted fact that LF has an allophonic Laxing alterna- tion in closed syllables (petit?petite, [ptsi]?[ptst]), except before voiced fricatives (?glise, [eliz]). Poliquin claims that LF also has a Regressive ATR Harmony rule which, depend- ing on the speaker, can give rise to various exotic patterns such as [smilitsd] and [similtsd], in addition to across the board harmony, as in [smltsd], for similitude. The exotic applica- tion facts are almost universally disputed by native speakers, (M. Gagnon, M. Brunelle, p.c.), and Poliquin?s phonetic studies are done over extremely small samples, and so fur- ther study is needed. However, if we take at least the existence of some kind of Harmony process at face value, whatever vowels it happens to be able to affect, it is a problem if the process of harmony is seen as taking the output of Laxing as input. The obvious solution is to give exactly the answer as for the RA/PA case: spreading is different. Furthermore, since Laxing applies at the source of the ATR feature, the (only) way to combine it with a sharing of the ATR feature is to apply Laxing to all vowels that share the ATR feature. This forces us into saying that there actually is a phonetic ATR feature (by which I simply mean ?dimension?) that gets modified independently of the place information. This is just as in the Dutch and Polish cases: we need to say that there is a mechanism by which the ATR information is shared across vowels. However, this ATR feature does not need to be present, but then subsequently deleted, on the preceding vowels in order to state Harmony, 309 as one might think following the analyses above (see Chapter 3 on contrastive specifica- tion to see the motivation for asking about this). What is crucial is merely the linking of the ATR feature to all the preceding vowels. Finally, although I have repeatedly deferred the issue of late, ?phonetic? deletion above (in Chapter 4), it comes up in an allophonic interaction case in Catalan (Mascar? 1976). Catalan Regressive Nasal Place Assimilation is counterbled by Cluster Simpli- fication, which deletes word-final elements of homorganic clusters, in just the way that would be expected if both are phonetic: venk, ?I sell,? is [be?]. The restriction of CS to homorganic clusters is to exclude cases like lp, but this restriction must be lifted for nasals in order for the simultaneous application to work. If deletion is like assimilation in being qualitatively different from phonetic transforms, (as suggested by the analysis of assimilation as involving delinking), then this case fills in a possibility we have not yet seen: the trigger of the NPA transform, rather than the target, (as in the RA?FD case), is removed (in the PA?FD case neither trigger nor target of FD is deleted). In that case, FD failed to apply. Here NPA does apply. The additional issue raised by this case is that NPA can apply across word boundaries and is fed by CS, as in venc vint pans, ?I sell twenty loaves of bread,? [b?bimpans]. This demands a level ordering solution wherein NPA ap- plies separately at two HOCDs (see Chapter 4), consistent with our discussion in Chapter 4 of these as being the only cases where allophonic feeding should be allowed. It also suggests that not even a trace of the adjacent intervening word-final simplified segment remains after deletion, (here the t in vint), intuitively consistent with the idea that deletion is qualitatively different from gradient transforms. 310 5.3 Statistics in linguistics To sum up the chapter: the association between true structure preservation and the phonetic component is explained by an exegesis of the learner?s preferences, with incom- plete neutralization folded into this category. No complete evaluation measure is avail- able yet for the phonological grammar, and thus for true neutralization. Quasi-structure- preservation is possible but will never emerge. There should be no feeding, (in the sense of feeding or bleeding), among phonetic processes in the same EOD block. As far as I can determine this is empirically confirmed, granting that spreading, and by extension presumably deletion, are different. There are also some late rules which are not phonetic, if Booij and Rubach are correct about the analysis of Polish. I will not pursue any other linguistic issues in this dissertation, except to make the passing remark that the spreading and transform cases could perhaps be unified in some measure if the transforms were all seen as non-contrastive features?with now bearing the burden of combining all fea- ture interpretations in a uniform way; the deletion analysis of the assimilation cases would perhaps remain somewhat mysterious. I finish by reflecting on the issues with which I began. I began this dissertation by assuring linguist readers that all of their apprehensions about statistical approaches to language were misapprehensions, and that, when statistics is assigned its proper role, there is absolutely nothing subversive about its use: no one is taking away your UG. In fact, some of the best arguments against ?general-purpose learning? come from statistics. For one, the bias?variance tradeoff refers to the fact that learning can give poor results (high prediction error) for one of two reasons: too much bias (hewing so firmly to 311 one type of solution that the pattern is missed); or too much variance (estimated model is too sensitive to small changes in the data, and thus makes incorrect predictions on new data). More complex models will in general be able to capture more nuances of the data, at the expense of the ability to generalize. Apart from militating against frameworks that are too flexible, this has an immediate consequence when related back to Chapter 2: if our grammatical theories really do have adjustable complexity, (and I have argued that in an important way they all do anyhow), then the Chomsky?s focus on simplicity in the 1960s (and indeed P??ini?s much earlier?see Kiparsky 2002) is probably more important to the learner than has been previously realized; and, if they do not have adjustable complexity, then we make such a move at our peril. Exemplar-based approaches should be seen in this light; see Geman, Bienenstock & Doursat 1992 and Hastie, Tibshirani & Friedman 2009. The no free lunch theorem (Wolpert 1996) states that there is no approach to learn- ing that will do universally well: the expected classification error depends on how close the posterior of the model is to the posterior implied by the process actually generating the data. This sounds obvious, but the consequence is that if one averages over all learn- ing problems, all algorithms perform exactly the same. The one glimmer of hope for the grammar learning problem is that the original theorem is stated only for supervised learning (access to the ?correct? answers during learning); no one has, to my knowledge, correctly reformulated the theorems for the unsupervised case, but it does not seem to me that there is anything in the theorem which could not be reformulated when extended to unsupervised learning. We should not come away with the message that the field of statis- tics and machine learning exists simply to prove that learning is hard: the point is simply that there is well-developed formal theory already on the market with the power to answer 312 the kinds of big-picture questions linguists care about. Some things which appear easy turn out to be difficult when studied carefully, and some things which appear difficult turn out to be easy. It should also, I hope, be clear by now that the proper role of statistics in linguistic theory has nothing directly to say about the other types of ?numerism? sometimes per- ceived by linguists to be pernicious, namely, gradience in grammar and frequency effects. This is not to say that these things are not relevant, but they are not about inference per se. Gradience in grammar refers to various things. One thing it sometimes refers to is gra- dient judgments; however, as thorough treatments of the issue tend to point out, (see, for example, Keller 2000), gradient grammatical judgments?and, more to the point, gradi- ent grammaticality?have been in the literature and have been taken to be important from the very beginnings of generative grammar (Chomsky 1975, Chomsky 1957, Chomsky & Halle 1968). Gradient judgments do not necessarily have anything to do with statistical theory?and they certainly do not need to be tied in any close way to frequency. They are important to the extent that judgments reflect ?degree of grammaticality? and to the extent that ?degree of grammaticality? is really the same evaluation as ?goodness of fit,? (or the foundation for it), which is an important tenet of Bayesian reductionism; they need not necessarily be the same thing, however. One thing that has not been extensively consid- ered since the days of the evaluation metrics, and which has been given new life here, is the issue of gradient biases?that is, a gradient UG. This is the default state of affairs for the Bayesian learner. Gradience in grammar also sometimes refers to fine phonetic detail being available where it is not normally thought to go, namely, in lexical storage. Some studies seem to 313 show this (Andruski, Blumstein & Burton 1994, McMurray, Tanenhaus & Aslin 2002). However, adding phonetic detail, contrary to the intuition, does not make anything about the way language works ?simpler? or ?more transparent? for the learner. This merely increases the complexity of the learning problem, and raises the possibility that learners would acquire phonetic details and then fail to be able to recognize small variations in pronunciation (overfitting?see above). Nor does the existence of phonetic detail say anything about whether categorical information is stored also. If one general lesson can be taken away from the analysis of learning in this dissertation, it is that the scope of the learning problem is quite vast no matter what, and, to make it manageable, a hierarchical Bayesian learner benefits from adjustable complexity to allow additional information to be learned only when necessary. Categorical phonological patterns still do exist, although, as I pointed out in the introduction, far too few phonetic studies have explored the empirical facts of categoricity given the availability of more and more large corpora in recent years. Finally, frequency effects are not the same as the capacity to reason statistically, and to the extent that reasoning statistically implies some kind of tracking of frequencies, this does not imply that these frequencies will show persistent effects in processing. One can easily imagine how online processing would benefit from frequency-sensitivity (better prediction, for example), but this does not necessarily lead us to any deep conclusions. In particular, arguments that frequency effects imply ?whole-item storage? tend to jump the gun somewhat. Frequency effects imply that frequency of something is, in some sense, tracked, but the link to lexical storage is indirect, and the link to lexical storage of entire items is tenuous. Careful thought along these lines has already helped to disentangle the various sources of frequency effects (Pylkk?nen, Stringfellow & Marantz 2002; Embick 314 & Marantz 2005). I would add only that, like coarse storage of phonetic categories and smaller numbers of categories, decomposed morphological representations will generally reap the general benefits of there being less to learn, and fewer ways to fail. This benefit due to reduction in the complexity of inference is not just an intuition; it can be stated mathematically as the Bayesian Occam?s Razor. In short: careful re-examination of any issue in linguistics through the lens of sta- tistical inference, I predict, reveal effects which are crucially due to learning, and which crucially need to be tied to both gradient biases and gradient goodness-of-fit-evaluation. The degree of precision with which we can reason about such issues with statistical theory in hand makes it indispensable. 5.4 Main findings I summarize again the main points and (logical) findings of this dissertation: 1. Statistics in linguistics (a) Brief argument: Simplicity-based evaluation of grammars is implicit in the standard practice of linguistics, and we should pay close attention to consid- erations of simplicity for this reason (b) Main argument: Bayesian inference can derive a preference for simpler gram- mars as long as grammars have some kind of structure to them; the linking principle is a Minimalist principle for prior construction called the Optimal Measure Principle 315 2. Modelling phonetic category learning (a) Main argument: A learning model that treats allophones as phonetic grammar outperforms a learning model that attempts to learn canonical surface repre- sentations on a simple problem using real phonetic data (b) Excursus: Statistical learning models for phonetic category learning can and should be fruitfully extended to handle learning with phonetic features 3. The phonetics?phonology interface (a) Main idea: Allophony is due to context-dependent phonetic transforms which take categorical phonological representations as their sole input i. Corollary: Canonical surface representations do not exist, forcing us to reanalyze certain patterns ii. Corollary: Allophonic processes should not feed each other A. Argument: Certain processes involving late phonetic spreading of a feature must be qualitatively different from other phonetic rules B. Conjecture: These rules involve delinking of phonetic features, and this is a qualitatively different process from phonetic transform appli- cation (b) Argument: Complementary distribution will encourage the learner to assign a pattern to the phonetic component of grammar i. Corollary: Allophony is late, confirming the empirical generalizations going back to the 1970s 316 ii. Corollary: Incomplete neutralization is just allophony (c) Argument: Allophony is unrecoverable i. Corollary: Allophony must be late, explaining the empirical generaliza- tion ii. Corollary: The cyclic spellout interpretation of phases or other types of cyclic level ordering is compatible with an interpretation as phonetic in- terpretation, so long as non-allophonic information remains recoverable 317 Bibliography Abramson, Arthur S. & Leigh Lisker. 1970. Discriminability Along the Voicing Contin- uum: Cross-Language Tests. Proceedings of Sixth International Conference of Pho- netic Sciences. 569?573. Adriaans, Frans & Ren? Kager. 2010. Adding generalization to statistical learning: The induction of phonotactics from continuous speech. Journal of Memory and Language 62. 311?331. http : / / www . sciencedirect . com / science / article / pii / S0749596X09001120. Albright, Adam & Bruce Hayes. 1999. An Automated Learner for Phonology and Mor- phology. Andruski, Jean, Sheila E Blumstein & Martha Burton. 1994. The effect of subphonetic differences on lexical access. Cognition 52(3). 163?187. Antoniak, Charles E. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics 2(6). 1152?1174. http://www. jstor.org/stable/2958336. Archangeli, Diana. 1984. Underspecification in Yawelmani Phonology and Morphology. MIT PhD Dissertation. Aronoff, Mark. 1974. Word-Structure. MIT PhD Dissertation. 318 Avery, J. Peter. 1996. The Representation of Voicing Contrasts. University of Toronto PhD Dissertation. Avery, Peter & Keren Rice. 1989. Segment Structure and Coronal Underspecification. Phonology 6(2). 179?200. http : / / journals . cambridge . org / production / action/cjoGetFulltext?fulltextid=2395536. Bach, Emmon & R.T. Harms. 1972. How do languages get crazy rules? In Robert Stock- well& Ronald Macaulay (eds.), Linguistic change and generative theory, 1?21. Bloom- ington, IN: Indiana University Press. Bar-Hillel, Yehoshua, Chaim Gaifman & Eli Shamir. 1963. On categorial and phrase- structure grammars. Bulletin of the Research Council of Israel F(9). 1?16. Berko, Jean. 1958. The child?s learning of English morphology.Word 14. 150?177. http: //books.google.com/books?hl=en&lr=&id=a1qJZpDU9YUC& oi=fnd&pg=PA253&dq=The+child%27s+learning+of+English+ morphology&ots=NCdkgaP9f4&sig=21ExUTBP7V2KS53n8czpcdENoVs. Berm?dez-Otero, Ricardo. 2013. The stem-level syndrome. Berm?dez-Otero, Ricardo & April McMahon. 2006. English phonology and morphology. In Bas Aarts & April McMahon (eds.), The handbook of english linguistics, 382?410. Oxford, UK: Blackwell. Berwick, Robert C, Paul Pietroski, Beracah Yankama & Noam Chomsky. 2011. Poverty of the stimulus revisited. Cognitive Science 35(7). 1207?42. http://www.ncbi.nlm. nih.gov/pubmed/21824178. Berwick, Robert C. & Amy Weinberg. 1984. The Grammatical Basis of Linguistic Per- formance. Cambridge, MA: MIT Press. 319 Blanchard, Daniel & Jeffrey Heinz. 2008. Improving Word Segmentation by Simultane- ously Learning Phonotactics. CoNLL 2008: Proceedings of the 12th Conference on Computational Natural Language Learning (August). 65?72. Bliese, Loren F. 1981. A Generative Grammar of Afar. Arlington, TX: Summer Institute of Linguistics. Bobaljik, Jonathan & Dianne Jonas. 1996. Subject positions and the roles of TP. Linguistic Inquiry 27. 195?236. de Boer, Bart & Patricia K. Kuhl. 2003. Investigating the role of infant-directed speech with a computer model. Acoustics Research Letters Online 4(4). 129?134. http:// link.aip.org/link/ARLOFJ/v4/i4/p129/s1&Agg=doi. Boersma, Paul. 2001. Empirical tests of the gradual learning algorithm. Linguistic in- quiry 32. 45?76. http://www.mitpressjournals.org/doi/abs/10.1162/ 002438901554586. Boersma, Paul & Jo Pater. 2007. Constructing constraints from language data: The case of Canadian English diphthongs. Booij, Geert. 1977. Dutch Morphology: A Study of Word Formation in Generative Gram- mar. Dordrecht: Foris Publications. Booij, G & J Rubach. 1987. Postcyclic versus postlexical rules in Lexical Phonology. Linguistic Inquiry 18. 11?44. http://www.jstor.org/stable/10.2307/4178523. Boomershine, Amanda, Kathleen Currie Hall, Elizabeth Hume & Keith Johnson. 2008. The influence of allophony vs contrast on perception: The case of Spanish and En- glish. In Peter Avery, B. Elan Dresher & Keren Rice (eds.), Phonological contrast. The Hague: Mouton. 320 Brame, Michael K. 1974. The Cycle in Phonology: Stress in Palestinian, Maltese, and Spanish. Linguistic Inquiry 5(1). 39?60. Braver, Aaron. 2011. Incomplete neutralization in American English flapping: A produc- tion study. University of Pennsylvania Working Papers in Linguistics 17(1). 1?11. Bresnan, Joan W. 1972. Stress and syntax: a reply. Language 48(2). 326?342. http:// www.jstor.org/stable/10.2307/412138. Browman, Catherine P & Louis Goldstein. 1993. Dynamics and Articulatory Phonology. Haskins Laboratories Status Report on Speech Research SR-113. 51?62. Brown, P. J., M. Vannucci & T. Fearn. 1998. Multivariate Bayesian variable selection and prediction. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60(3). 627?641. http://doi.wiley.com/10.1111/1467-9868.00144. Bush, Christopher A. & Steven N. MacEachern. 1996. A semiparametric Bayesian model for randomised block designs. Biometrika 83(2). 275?285. http://biomet.oupjournals. org/cgi/doi/10.1093/biomet/83.2.275. Charles-Luce, Jan, Kelly Dressler & Elvira Ragonese. 1999. Effects of semantic pre- dictability on children?s preservation of a phonemic voice contrast. Journal of Child Language 26. 505?530. Chomsky, Noam. 1957. Syntactic Structures. The Hague: Mouton. Chomsky, Noam. 1964. Current Issues in Linguistic Theory. In Jerry Fodor & Jerrold Katz (eds.), The structure of language: readings in the philosophy of language. Englewood Cliffs, NJ: Prentice Hall. http://en.scientificcommons.org/48023268. Chomsky, Noam. 1965. Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Chomsky, Noam. 1975. Reflections on Language. New York: Pantheon. 321 Chomsky, Noam. 1981. Lectures on Government and Binding. Dordrecht: Foris Publica- tions. Chomsky, Noam. 1986. Knowledge of Language. New York: Praeger. Chomsky, Noam. 1995. The Minimalist Program. Cambridge, MA: MIT Press. Chomsky, Noam. 2001. Derivation by phase. In Ken hale: a life in language. Cambridge, MA: MIT Press. Chomsky, Noam & Morris Halle. 1965. Some controversial questions in phonological theory. Journal of Linguistics 1. 97?214. Chomsky, Noam & Morris Halle. 1968. The Sound Pattern of English. New York, NY: Harper & Row. Cinque, Guglielmo. 1993. A null theory of phrase and compound stress. Linguistic inquiry 24. 239?297. http://www.jstor.org/stable/10.2307/4178812. Clark, R & Ian Roberts. 1993. A computational model of language learnability and lan- guage change. Linguistic Inquiry 24(2). 299?345. http://www.jstor.org/stable/ 10.2307/4178813. Cohn, Abigail. 1993. Nasalization in English: Phonology or Phonetics. Phonology 10(1). 43?82. Cohn, Abigail C. 1990. Phonetic and phonological rules of nasalization. UCLA PhD the- sis. Collet, Pierre, Antonio Galves & Arturo Lopes. 1995. Maximum Likelihood and Mini- mum Entropy Identification of Grammars. CoRR cmp-lg/950. Compton, Richard & Christine Pittman. 2010. Word-Formation by phase in Inuit. Lingua 120(9). 2095?2318. 322 Cornell, Sonia a, Aditi Lahiri & Carsten Eulitz. 2011. ?What you encode is not necessarily what you store?: evidence for sparse feature representations from mismatch negativity. Brain Research 1394. 79?89. http://www.ncbi.nlm.nih.gov/pubmed/21549357. Cox, Richard T. 1946. Probability, frequency and reasonable expectation. American Jour- nal of Physics 14(1). 1?13. http://www.cco.caltech.edu/$\sim$jimbeck/ summerlectures/references/ProbabilityFrequencyReasonableExpectation. pdf. Crain, Stephen & Mineharu Nakayama. 1987. Structure dependence in grammar forma- tion. Language 63. 522?543. Dawid, A.P. 1981. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika 68(1). 265. http://biomet.oxfordjournals. org/content/68/1/265.short. Dell, Gary. 1986. A spreading activation theory of retrieval in language production. Psy- chological Review 93. 283?321. Denis, Derek & Mark Pollard. 2008. An Acoustic Analysis of The Vowel Space of Inuk- titut. Inuktitut Linguistics Workshop. Dillon, Brian, Ewan Dunbar &William Idsardi. 2013. A single-stage approach to learning phonological categories: insights from inuktitut. Cognitive Science 37(2). 344?77. Dorais, Louis-Jacques. 1986. Inuktitut surface phonology: A trans-dialectal survey. Inter- national Journal of American Linguistics 52(1). 20?53. http://www.jstor.org/ stable/1265501. 323 Dowe, D. L., S. Gardner & G. Oppy. 2007. Bayes not Bust! Why Simplicity is no Prob- lem for Bayesians. The British Journal for the Philosophy of Science 58(4). 709?754. http://bjps.oxfordjournals.org/cgi/doi/10.1093/bjps/axm033. Dresher, BE. 2009a. Stress assignment in Tiberian Hebrew. In Eric Raimy & Charles Cairns (eds.), Architecture and representations in phonological theory. Cambridge, MA: MIT Press. http://books.google.com/books?hl=en&lr=&id=BFofF9bxA1sC& oi=fnd&pg=PA213&dq=Stress+Assignment+in+Tiberian+Hebrew&ots= HjYCP22y8Z&sig=kMxLw2sQeawMz0aMtAm3JkCnX8A. Dresher, B. Elan. 2009b. The Contrastive Hierarchy in Phonology. Cambridge, UK: Cam- bridge Univ Pr. Dresher, B. Elan & Jonathan Kaye. 1990. A computational learning model for metrical phonology. Cognition 34(2). 137?195. Dunbar, Ewan. 2008. The Acquisition of Morphophonology Under a Derivational Theory: A Basic Framework and Simulation Results (MA Thesis). University of Toronto. Dunbar, Ewan & William Idsardi. The Acquisition of Phonological Inventories. In Jeff Lidz, William Snyder & Joe Pater (eds.), Oxford handbook of developmental linguis- tics. Oxford: Oxford University Press. Dyck, Carrie. 1995. Constraining the phonology-phonetics interface:With exemplification from Spanish and Italian dialects. University of Toronto PhD Thesis. Elsner, Micha, Sharon Goldwater& Jacob Eisenstein. 2012. Bootstrapping a Unified Model of Lexical and Phonetic Acquisition. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (July). 184?193. 324 Embick, David & Alec Marantz. 2005. Cognitive neuroscience and the English past tense: comments on the paper by Ullman et al. Brain and language 93(2). http://www. ncbi.nlm.nih.gov/pubmed/15781308. Ernestus, Mirjam & R. Harald Baayen. 2006. The functionality of incomplete neutraliza- tion in Dutch: The case of past-tense formation. In Laboratory phonology 8, 27?49. Berlin: Mouton De Gruyter. http://books.google.com/books?hl=en&lr=&id= e86YANpgyisC&oi=fnd&pg=PA27&dq=The+functionality+of+incomplete+ neutralization+in+Dutch:+The+case+of+past-tense+formation&ots= QQtuGE5cpC&sig=9mPEstqAPl52okmMY4FupwquOMM. Ernestus, Mirjam, Mybeth Lahey, Femke Verhees & R Harald Baayen. 2006. Lexical frequency and voice assimilation. The Journal of the Acoustical Society of America 120(2). 1040?51. http://www.ncbi.nlm.nih.gov/pubmed/16938990. Evans, Bronwen G. & Paul Iverson. 2004. Vowel normalization for accent: An investi- gation of best exemplar locations in northern and southern British English sentences. The Journal of the Acoustical Society of America 115(1). 352. http://link.aip. org/link/JASMAN/v115/i1/p352/s1&Agg=doi. Feldman, Naomi, Thomas Griffiths & James Morgan. 2009. Learning phonetic categories by learning a lexicon. Proceedings of the 31st Annual Conference of the Cognitive Science Society. 2208?2213. Ferguson, Thomas S. 1973. A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1(2). 209?230. http://www.jstor.org/stable/2958008. Flemming, Edward. 1995. Auditory representations in phonology. UCLA PhD Disserta- tion. 325 Foraker, Stephani, Terry Regier, Naveen Khetarpal, Amy Perfors & Joshua Tenenbaum. 2009. Indirect evidence and the poverty of the stimulus: the case of anaphoric one. Cognitive Science 33(2). 287?300. http://www.ncbi.nlm.nih.gov/pubmed/ 21585472. Forster, Malcolm & Elliott Sober. 1994. How to tell when simpler, more unified, or less ad hoc theories will provide more accurate predictions. The British Journal for the Philosophy of Science 45(1). 1?35. http://bjps.oxfordjournals.org/cgi/ doi/10.1093/bjps/45.1.1http://bjps.oxfordjournals.org/content/45/ 1/1.short. Foulkes, Paul, James M Scobbie & DominicWatt. 2010. Sociophonetics. The Handook of Phonetic Sciences, Second Edition. 703?754. Fourakis, Marios & Gregory Iverson. 1984. On the ?incomplete neutralization? of German final obstruents. Phonetica 41. 140?149. Fowler, Carol A. 1986. An event approach to the study of speech perception from a direct- realist perspective. Journal of Phonetics 14. 3?28. van Fraassen, Bas. 1989. Laws and Symmetry. Oxford: Clarendon Press. Frisch, Stefan a. & RichardWright. 2002. The phonetics of phonological speech errors: An acoustic analysis of slips of the tongue. Journal of Phonetics 30(2). 139?162. http: //linkinghub.elsevier.com/retrieve/pii/S0095447002901762. Fromkin, Victoria. 1973. Speech Errors as Linguistic Evidence. The Hague: Mouton. Gagliardi, Annie. 2012. Input and Intake in Language Acquisition. University of Mary- land, College Park PhD Dissertation. 326 Gagliardi, Annie, Erin Bennett, Jeffrey Lidz & Naomi H Feldman. 2012. Children?s Infer- ences in Generalizing Novel Nouns and Adjectives. Proceedings of the 34th Annual Conference of the Cognitive Science Society. Gagliardi, Annie & Jeffrey Lidz. In press. Statistical Insensitivity in the Acquisition of Tsez Noun Classes. Language. Geman, S, E Bienenstock & R Doursat. 1992. Neural networks and the bias variance dilemma. Neural Computation 4(1). 1?58. http://www.mitpressjournals.org/ doi/abs/10.1162/neco.1992.4.1.1. Geman, Stuart & Donald Geman. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6). 721?741. Goldsmith, John. 1976. An overview of autosegmental phonology. Linguistic Analysis 2(1). 23?68. Goodman, N. 1955. Fact, Fiction and Forecast. Cambridge, MA: Harvard Univ Press. Goro, Takuya. 2007. Language-Specific Constraints on Scope Interpretation in First Lan- guage Acquisition. University of Maryland, College Park PhD Dissertation. Gouskova, Maria. 2003. Deriving Economy: Syncope in Optimality Theory. University of Massachusetts, Amherst PhD. Griffiths, Thomas L & Zoubin Ghahramani. 2006. Infinite Latent Feature Models and the Indian Buffet Process. Advances in Neural Information Processing Systems 18. Grijzenhout, Janet & Martin Kr?mer. 2000. Final devoicing and voicing assimilation in Dutch derivation and cliticization. In Barbara Stiebels & Dieter Wunderlich (eds.), Studia grammatica 45: lexicon in focus, 55?82. Berlin: Akademie Verlag. 327 Gulian, Margarita, Paola Escudero & Paul Boersma. 2007. Supervision Hampers Distri- butional Learning of Vowel Contrasts. ICPhS (August). 1893?1896. Hale, Mark & Charles Reiss. 2008. The Phonological Enterprise. Oxford: Oxford Uni- versity Press. Hall, Daniel Currie. 2007. The Role and Representation of Contrast in Phonological The- ory. University of Toronto PhD. Halle, M & Alec Marantz. 1993. Distributed Morphology and the pieces of inflection. In Ken Hale & Samuel Jay Keyser (eds.), The view from building 20: essays in linguis- tics in honor of sylvain bromberger. Cambridge, MA: MIT Press. http://scholar. google . com / scholar ? hl = en & btnG = Search & q = intitle : Distributed + Morphology+and+the+pieces+of+inflection#0. Halle, Morris. 1959. The Sound Pattern of Russian. The Hague: Mouton. Halle, Morris & Michael Kenstowicz. 1991. The Free Element Condition and Cyclic ver- sus Noncyclic Stress. Linguistic Inquiry 22(3). 457?501. Halle, Morris & K. P. Mohanan. 1985. Segmental phonology of Modern English. Linguis- tic inquiry 16(1). 57?116. http://www.jstor.org/stable/10.2307/4178420. Hall, Kathleen Currie&E Allyn Smith. 2006. Finding vowels without phonology? Montreal- Ottawa-Toronto Phonology Workshop (February). Hamann, Silke. 2003. The Phonetics and Phonology of Retroflexes. Utrecht: LOT Press. Hastie, Trevor, Rob Tibshirani & Jerome Friedman. 2009. The Elements of Statistical Learning. New York: Springer. 328 Hayes, Bruce. 1984. The phonetics and phonology of Russian voicing assimilation. In Mark Aronoff & Richard Oehrle (eds.), Language sound structure, 318?328. Cam- bridge, MA: MIT Press. Hayes, Bruce & Colin Wilson. 2008. A maximum entropy model of phonotactics and phonotactic learning. Linguistic Inquiry 39(3). 379?440. Heinz, Jeffrey. 2013. Computational theories of learning and developmental psycholin- guistics. In Jeffrey Lidz, William Snyder & Joe Pater (eds.), The cambridge handbook of developmental linguistics. Cambridge: Cambridge University Press. Heinz, Jeffrey&William Idsardi. 2011. Sentence and word complexity. Science 333(6040). 295?297. http://www.ncbi.nlm.nih.gov/pubmed/21764736. Henderson, Leah, Noah D Goodman, Joshua B Tenenbaum & James F Woodward. 2010. The structure and dynamics of scientific theories: A hierarchical Bayesian perspective. 77(2). 172?200. Herd, Wendy, Allard Jongman & Joan Sereno. 2010. An acoustic and perceptual analysis of /t/ and /d/ flaps in American English. Journal of Phonetics 38(4). 504?516. http: //linkinghub.elsevier.com/retrieve/pii/S0095447010000458. Hillenbrand, J., L.A. Getty, M.J. Clark & K. Wheeler. 1995. Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America 97(5). 3099? 3111. http://winnie.kuis.kyoto-u.ac.jp/members/okuno/Lecture/05/ Hearing/Hillenbrand-JASA-97-5-3099-3111.pdf. Hjort, NL. 1990. Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics 18. 1259?1294. http://www.jstor.org/ stable/10.2307/2242052. 329 Hoijer, H. 1949. Tonkawa: An Indian language of Texas. In C Osgood (ed.), Linguistic structures of native america. New York: Viking Fund Publications in Anthropology 6. Hooper, Joan. 1976. An Introduction to Natural Generative Phonology. New York: Aca- demic Press. Idsardi, William. 2006. Canadian Raising, Opacity, and Rephonemicization. The Cana- dian Journal of Linguistics / La revue canadienne de linguistique 51(2). 119?126. http://muse.jhu.edu/content/crossref/journals/canadian_journal_of_ linguistics/v051/51.2idsardi.pdf. Jackendoff, Ray. 1977. X-Bar Syntax: A Study of Phrase Structure. Cambridge, MA: MIT Press. Jakobson, Roman. 1929. Remarques sur l??volution phonologique du russe compar?e ? celle des autres langues slaves. In Selected works. Jakobson, Roman. 1941. Kindersprache, Aphasie und allgemeine Lautgesetze. Uppsala: Almqvist & Wiksells. Jakobson, Roman, Gunnar Fant & Morris Halle. 1952. Preliminaries to speech analysis: The distinctive features. Cambridge, MA: MIT Press. Jakobson, Roman& Morris Halle. 1956. Fundamentals of Language. The Hague: Mouton. Jarosz, Gaja. 2006. Rich Lexicons and Restrictive Grammars: Maximum Likelihood Learn- ing in Optimality Theory. Johns Hopkins University PhD thesis. Jarosz, Gaja. 2011. The Roles of Phonotactics and Frequency in the Learning of Alterna- tions. BUCLD 35 Proceedings. Jaynes, E. T. 2003. Probability Theory: The Logic Of Science. Cambridge: Cambridge University Press. 330 Jessen, Michael. 1998. Phonetics and phonology of tense and lax obstruents in German. Amsterdam: John Benjamins. Johnson, C. Douglas. 1972. Formal aspects of phonological description. The Hague: Mou- ton. Jones, Matt& BradleyC Love. 2011. Bayesian Fundamentalism or Enlightenment? On the explanatory status and theoretical contributions of Bayesian models of cognition. The Behavioral and brain sciences 34(4). http://www.ncbi.nlm.nih.gov/pubmed/ 21864419. Kaplan, Ronald M & Martin Kay. 1994. Regular Models of Phonological Rule Systems. Computational Linguistics. Kaye, Jonathan. 1990. What ever happened to Dialect B? In Joan Mascar? & Marina Nespor (eds.), Grammar in progress: glow essays for henk van riemsdijk, 259?263. Dordrecht: Foris Publications. Kazanina, Nina, Colin Phillips & William Idsardi. 2006. The influence of meaning on the perception of speech sounds. The Journal of the Acoustical Society of America 103(30). 11381?11386. http://www.ncbi.nlm.nih.gov/pubmed/16849423. Kean, Mary-Louise. 1975. The theory of markedness in generative grammar. MIT PhD thesis. http://en.scientificcommons.org/30563611. Keating, Patricia A. 1988. Underspecification in phonetics. Phonology 3(2). 275?292. Keller, Frank. 2000. Gradience in Grammar: Experimental and Computational Aspects of Degrees of Grammaticality. University of Ediburgh PhD Dissertation. 331 Kemp, Alan. 1994. Phonetic transcription: History. In R. E. Asher & E. J. A. Henderson (eds.), The encyclopedia of language and linguistics: volume 6, 3040?3051. Oxford, UK: Pergamon Press. Keyser, Samuel Jay & Kenneth N. Stevens. 2006. Enhancement and Overlap in the Speech Chain. Language 82(1). 33?63. http://muse.jhu.edu/content/crossref/ journals/language/v082/82.1keyser.pdf. Kim, Hyunsoon & Allard Jongman. 1996. Acoustic and perceptual evidence for complete neutralization of manner of articulation in Korean. Journal of Phonetics 24. 295?312. Kiparsky, Paul. 1971. Historical linguistics. In W.O. Dingwall (ed.), A survey of linguistic science. College Park, MD: University of Maryland Linguistics Program. Kiparsky, Paul. 1982. From Cyclic Phonology to Lexical Phonology. In Harry van der Hulst & Norval Smith (eds.), The structure of phonological representations, 131?175. Dordrecht: Foris Publications. Kiparsky, Paul. 1985. Some consequences of Lexical Phonology. Phonology 2. 85?138. Kiparsky, Paul. 1991. Economy and the Construction of the ?ivas?tras. In M. M. Desh- pande & S. Bhate (eds.), Paninian studies. Ann Arbor, MI. Kiparsky, Paul. 2000. Opacity and Cyclicity. The Linguistic Review 17. 351?367. Kiparsky, Paul. 2002. On the Architecture of P??ini?s Grammar. Kisseberth, Charles. 1970. Vowel elision in Tonkawa and derivational constraints. In J Sadock & A Vanek (eds.), Studies presented to robert b. lees by his students. Cham- paign, IL: Linguistic Research. Knaus, Johannes, Richard Wiese & Ulrike Domahs. 2011. Secondary stress is distributed rhythmically within words: an EEG study on German. ICPhS (August). 1114?1117. 332 Kuhl, Patricia K. 1991. Human adults and human infants show a ?perceptual magnet? effect for the prototypes of speech categories, monkeys do not. Perception and Psy- chophysics 50(2). 93?107. Ladefoged, Peter. 2005. Features and parameters for different purposes. UCLA Working Papers in Phonetics 104. 1?13. http://www.linguistics.ucla.edu/faciliti/ workpapph/104/1-PL_lsa_january_2005.pdf. Lahiri, Aditi & B. Elan Dresher. 1984. Diachronic and synchronic implications of declen- sion shifts. The Linguistic Review 3. 141?163. Lahiri, Aditi & Henning Reetz. 2002. Underspecified Recognition. Labphon 7. Lahiri, Aditi & Henning Reetz. 2010. Distinctive features: Phonological underspecifica- tion in representation and processing. Journal of Phonetics 38(1). 44?59. http:// linkinghub.elsevier.com/retrieve/pii/S0095447010000033. Lamb, Sydney. 1964. On alternation, transformation, realization, and stratification. Mono- graph Series on Languages and Linguistics 17. 105?22. Lasnik, Howard & Juan Uriagereka. 2002. On the poverty of the challenge. The Lin- guistic Review 19. 147?150. http://www.degruyter.com/dg/viewarticle. fullcontentlink:pdfeventlink/contentUri?format=INT&t:ac=j$002ftlir. 2002.19.issue-1-2$002ftlir.19.1-2.147$002ftlir.19.1-2.147.xml. Leben, William. 1973. Suprasegmental Phonology. MIT PhD Dissertation. Legate, Julie & Charles Yang. 2002. Empirical re-assessment of stimulus poverty. The Linguistic Review 19. 151?162. 333 Liberman, Alvin M. & Ignatius G. Mattingly. 1985. The motor theory of speech percep- tion revised. Cognition 21(1). 1?36. http://www.ncbi.nlm.nih.gov/pubmed/ 4075760. Liberman, Mark & Janet Pierrehumbert. 1984. Intonational invariance under changes in pitch range and length. In Mark Aronoff & Richard Oehrle (eds.), Language sound structure, 157?233. Cambridge, MA: MIT Press. http://scholar.google.com/ scholar?hl=en&btnG=Search&q=intitle:Intonational+invariance+under+ changes+in+pitch+range+and+length#0. Liberman, Mark & A. Prince. 1977. On Stress and Linguistic Rhythm. Linguistic inquiry 8(2). 249?336. http://www.jstor.org/stable/10.2307/4177987. MacEachern, Steven N. 1994. Estimating normal means with a conjugate style Dirichlet process prior. Communications in Statistics: Simulation and Computation 23(3). 727? 741. MacEachern, Steven N. 1999. Dependent Dirichlet processes. http://stat.columbia. edu/$\sim$porbanz/talks/MacEachern2000.pdf. MacKay, David. 2003. Information Theory, Inference, and Learning Algorithms. Cam- bridge: Cambridge University Press. MacWhinney, Brian. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates. Marr, David. 1982. Vision: A computational investigation into the human representation and processing of visual information. New York, NY: Henry Holt & Co. Martin, Joel, Howard Johnson, Benoit Farley & Anna Maclachlan. 2003. Aligning and us- ing an English-Inuktitut parallel corpus. Proceedings of the HLT-NAACL 2003 Work- 334 shop on Building and Using Parallel Texts: Data Driven Machine Translation and Be- yond (June). 115?118. http://portal.acm.org/citation.cfm?doid=1118905. 1118925. Mascar?, Joan. 1976. Catalan Phonology and the Phonological Cycle. MIT PhD thesis. Maye, Jessica, Janet F.Werker&LouAnn Gerken. 2002. Infant sensitivity to distributional information can affect phonetic discrimination. Cognition 82(3). B101?B111. http: //www.ncbi.nlm.nih.gov/pubmed/11747867. McCarthy, John. 1986. OCP effects: Gemination and antigemination. Linguistic Inquiry 17. 207?263. McCarthy, John J. 1999. Sympathy and phonological opacity. Phonology 16. 331?399. http://journals.cambridge.org/abstract_S0952675799003784. McMurray, Bob, Richard N Aslin & Joseph C Toscano. 2009. Statistical learning of pho- netic categories: insights from a computational approach. Developmental science 12(3). 369?78. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2742678&tool=pmcentrez&rendertype=abstract. McMurray, Bob, Michael K Tanenhaus & Richard N Aslin. 2002. Gradient effects of within-category phonetic variation on lexical access. Cognition 86(2). B33?42. http: //www.ncbi.nlm.nih.gov/pubmed/12435537. Mielke, Jeff. 2008. The Emergence of Distinctive Features. Oxford, UK: Oxford Univ Press. Mielke, Jeff, Mike Armstrong & Elizabeth Hume. 2003. Looking Through Opacity. The- oretical Linguistics 29. 123?139. 335 Miller, Kurt T, Thomas L Griffiths & Michael I Jordan. 2009. The Phylogenetic Indian Buffet Process : A Non-Exchangeable Nonparametric Prior for Latent Features. Ad- vances in Neural Information Processing Systems 22. Mor?n, Bruce. 2007. Phonological Segment Inventories and their Phonetic Variation: a substance-free approach. Generative Linguistics in the Old World (GLOWXXX). Niyogi, P & R C Berwick. 1996. A language learning model for finite parameter spaces. Cognition 61(1-2). 161?93. http://www.ncbi.nlm.nih.gov/pubmed/8990971. Noske, Roland. 1993. A Theory of Syllabification and Segmental Alternation: With Stud- ies on the Phonology of French, German, Tonkawa and Yawelmani. T?bingen: Max Niemeyer. van Oostendorp, Marc. 2006. Incomplete devoicing in formal phonology. Padgett, Jaye. 2003. Contrast and post-velar fronting in Russian. Natural Language and Linguistic Theory 21(1). 39?87. http://link.springer.com/article/10.1023/ A%3A1021879906505. Paradis, Carole. 1988. On Constraints and Repair Strategies. The Linguistic Review 6. 71? 97. Pearl, Lisa. 2007. Necessary Bias in Natural Language Learning. University of Maryland, College Park PhD Dissertation. Pearl, Lisa, Sharon Goldwater & Mark Steyvers. 2010. How ideal are we? Incorporating human limitations into Bayesian models of word segmentation. Proceedings of the 34th Annual Boston University Conference on Child Language Development. Cas- cadilla Press, Somerville. http://homepages.inf.ed.ac.uk/sgwater/papers/ bucld09-onlineseg.pdf. 336 Pearl, Lisa & Jeffrey Lidz. 2009. When domain-general learning fails and when it suc- ceeds: Identifying the contribution of domain specificity. Language Learning and De- velopment 5. 235?265. http://www.tandfonline.com/doi/abs/10.1080/ 15475440902979907. Pentus, Mati. 1993. Lambek grammars are context-free. Logic in Computer Science. 429? 433. Peperkamp, Sharon. 2003. Phonological acquisition: recent attainments and new chal- lenges. Language and speech 46(Pt 2-3). 87?113. http://www.ncbi.nlm.nih. gov/pubmed/14748441. Peperkamp, Sharon, Rozenn Le Calvez, Jean-Pierre Nadal & Emmanuel Dupoux. 2006. The acquisition of allophonic rules: statistical learning with linguistic constraints. Cog- nition 101(3). B31?841. http://www.ncbi.nlm.nih.gov/pubmed/16364279. Peperkamp, Sharon, Mich?le Pettinato & Emmanuel Dupoux. 2002. Allophonic Variation and the Acquisition of Phoneme Categories. Proceedings of the 27th Annual Boston University Conference on Language Development. Perfors, Amy, Joshua B Tenenbaum & Terry Regier. 2011. The learnability of abstract syntactic principles. Cognition 118. 306?338. http://www.ncbi.nlm.nih.gov/ pubmed/21186021. Phelps, Elaine. 1975. Iteration and disjunctive domains in phonology. Linguistic Analysis 1. 137?172. Phillips, Lawrence & Lisa Pearl. 2012. ?Less is More? in Bayesian word segmentation: When cognitively plausible learners outperform the ideal. Proceedings of the 34th Annual Conference of the Cognitive Science Society 2011. 863?868. 337 Piggott, Glyne & Heather Newell. 2006. Syllabification, stress and derivation by phase in Ojibwa. McGill Working Papers in Linguistics. 1?47. http://scholar.google. com/scholar?hl=en&btnG=Search&q=intitle:Syllabification,+stress+ and+derivation+by+phase+in+Ojibwa#0. Pinker, Steven & Alan Prince. 1988. On language and connectionism: Analysis of a par- allel distributed processing model of language acquisition. Cognition 28. 73?193. Pisoni, David B. 1973. Auditory and phonetic memory codes in the discrimination of consonants and vowels. Attention, Perception, and Psychophysics 13. 253?260. http: //www.springerlink.com/index/88V549745842656J.pdf. Pitt, Mark a., Keith Johnson, Elizabeth Hume, Scott Kiesling &William Raymond. 2005. The Buckeye corpus of conversational speech: labeling conventions and a test of tran- scriber reliability. Speech Communication 45(1). 89?95. http://linkinghub.elsevier. com/retrieve/pii/S0167639304000974. Poliquin, Gabriel. 2006. Canadian French Vowel Harmony. Harvard PhD Dissertation. http://roa.rutgers.edu/files/861-0906/861-POLIQUIN-0-0.PDF. Port, Robert & Adam Leary. 2005. Against formal phonology. Language 81(4). 927?964. http://muse.jhu.edu/journals/lan/summary/v081/81.4port.html. Port, Robert & Michael L. O?Dell. 1986. Neutralization of syllable-final voicing in Ger- man. Journal of Phonetics 13. 455?471. Prince, Alan. 1975. The Phonology and Morphology of Tiberian Hebrew. MIT PhD Dis- sertation. Prince, Alan & Paul Smolensky. 2004. Optimality Theory: Constraint interaction in gen- erative grammar. Oxford: Blackwell. 338 Pye, Shizuka. 1986.Word-final devoicing in Russian. Cambridge Papers in Phonetics and Experimental Linguistics 5. 1?10. Pylkk?nen, Liina, Andrew Stringfellow & Alec Marantz. 2002. Neuromagnetic Evidence for the Timing of Lexical Activation: An MEG Component Sensitive to Phonotactic Probability but Not to Neighborhood Density. Brain and Language 81(1-3). 666?678. http://linkinghub.elsevier.com/retrieve/pii/S0093934X01925556. Quine, Willard Van Orman. 1960. Word and Object. Cambridge, MA: MIT Press. Rice, KD. 1993. A reexamination of the feature [sonorant]: The status of ?sonorant ob- struents?. Language 69(2). 308?344. http://www.jstor.org/stable/10.2307/ 416536. Ringen, Catherine O & Robert M Vago. 1998. Hungarian vowel harmony in Optimality Theory. Phonology 15. 393?416. Rubach, Jerzy. 1984. Cyclic and Lexical Phonology: The Structure of Polish. Dordrecht: Foris Publications. Rubach, Jerzy & Geert Booij. 1990. Syllable structure assignment in Polish. Phonology 7(1). 121?158. http://journals.cambridge.org/abstract_S0952675700001135. Rumelhart, David E & James L McClelland. 1986. On learning the past tenses of English verbs. In David E Rumelhart, James L McClelland & The PDP Research Group (eds.), Parallel distributed processing: explorations in the microstructure of cognition. vol- ume 2: psychological and biological models. Cambridge, MA: Bradford Books/MIT Press. 339 Salor, ?, B.L. Pellom, T. Ciloglu & M Demirekler. 2007. Turkish speech corpora and recognition tools developed by porting SONIC: Towards multilingual speech recog- nition. Computer Speech and Language 21. 583?593. Sampson, Geoffrey. 2002. Exploring the richness of the stimulus. The Linguistic Review 19. 73?104. Scharinger, Mathias, Aditi Lahiri & Carsten Eulitz. 2010. Mismatch negativity effects of alternating vowels in morphologically complex word forms. Journal of Neurolin- guistics 23(4). 383?399. http://linkinghub.elsevier.com/retrieve/pii/ S091160441000028X. Scharinger, M. & A. Lahiri. 2010. Height Differences in English Dialects: Consequences for Processing and Representation. Language and Speech 53(2). 245?272. http:// las.sagepub.com/cgi/doi/10.1177/0023830909357154. Shieber, Stuart. 1985. Evidence against the context-freeness of natural language. Linguis- tics and Philosophy 8. 333?343. Slavin, Tanya. 2012. Truncation and Morphosyntactic Structure in Ojicree. McGill Work- ing Papers in Linguistics 22. 1?12. https://secureweb.mcgill.ca/mcgwpl/ sites/mcgill.ca.mcgwpl/files/slavin2012.pdf. Sledd, James H. 1966. Breaking, Umlaut, and the Southern Drawl. Language 42(1). 18? 41. Slowiaczek, L.M. & D.A. Dinnsen. 1985. On the neutralizing status of Polish word-final devoicing. Journal of Phonetics 13(3). 325?341. 340 Steriade, Donc. 1987. Redundant values. CLS 23: Papers from the 23rd Annual Regional Meeting of the Chicago Linguistic Society. Part Two: Parasession on Autosegmental and Metrical Phonology. 339?62. Stevens, Jon. 2011. Learning Object Names in Real Time with Little Data. Proceedings of the 33rd Annual Conference of the Cognitive Science Society. 903?908. Stevens, Jon, John Trueswell, Charles Yang & Lila Gleitman. 2013. The Pursuit of Word Meanings. http://www.ircs.upenn.edu/$\sim$truesweb/trueswell_pdfs/ Stevens_et_al_submitted.pdf. Stevens, Kenneth N. 2002. Toward a model for lexical access based on acoustic landmarks and distinctive features. The Journal of the Acoustical Society of America 111(4). 1872. http://link.aip.org/link/JASMAN/v111/i4/p1872/s1&Agg=doi. Teh, Yee Whye & Dilan G?r?r. 2009. Indian Buffet Processes with Power-law Behavior. Advances in Neural Information Processing Systems 22. Tenenbaum, Joshua. 1999. A Bayesian Framework for Concept Learning. Cambridge, MA: MIT PhD thesis. Thibaux, Romain & Michael I Jordan. 2007. Hierarchical Beta Processes and the Indian Buffet Process. International Conference on Artificial Intelligence and Statistics. 564? 571. Uriagereka, Juan. 1999. Multiple spell-out. In Samuel D. Epstein & Norbert R. Hornstein (eds.), Working minimalism. Cambridge, MA: MIT Press. Vallabha, Gautam K, JamesL McClelland, Ferran Pons, Janet FWerker& Shigeaki Amano. 2007. Unsupervised learning of vowel categories from infant-directed speech. Pro- ceedings of the National Academy of Sciences of the United States of America 104(33). 341 13273?8. http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 1934922&tool=pmcentrez&rendertype=abstract. Viau, Joshua & Jeffrey Lidz. 2011. Selective learning in the acquisition of Kannada di- transitives. Language 87(4). 679?714. Wan, I-ping & Jeri Jaeger. 1998. Speech errors and the representation of tone in Man- darin Chinese. Phonology 15(3). 417?461. http://journals.cambridge.org/ production/action/cjoGetFulltext?fulltextid=40714. Wheeler, Max. 2005. The Phonology of Catalan. Oxford, UK: Oxford University Press. Williamson, Sinead, Peter Orbanz & Zoubin Ghahramani. 2010. Dependent Indian Buffet Processes. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 6. Wilson, Stephen M. 2003. A phonetic study of voiced, voiceless and alternating stops in Turkish. CRL Newsletter 15(1). 3?13. http://scholar.google.com/scholar? hl=en&btnG=Search&q=intitle:A+phonetic+study+of+voiced,+voiceless+ and+alternating+stops+in+Turkish#0. Wolpert, David. 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation 8(7). 1341?1390. Yang, Charles. 2002. Knowledge and Learning in Natural Language. Oxford: Oxford Uni- versity Press. Yip, Kenneth & Gerald Jay Sussman. 1997. Sparse Representations for Fast, One-shot learning. Proceedings of the National Conference on Artificial Intelligence. Zimmermann, Malte. 2002. Boys Buying Two Sausages Each: On the Syntax and Seman- tics of Distance-Distributivity. University of Amsterdam PhD thesis. 342 Zonneveld, Wim. 1983. Lexical and phonological properties of Dutch voicing assimila- tion. In Marcel van den Broecke, Vincent van Heuven&Wim Zonneveld (eds.), Sound structures: studies for antonie cohen, 297?312. Dordrecht: Foris Publications. 343