ABSTRACT Title of dissertation: SENSORY AND PERCEPTUAL CODES IN CORTICAL AUDITORY PROCESSING Francisco Israel Cervantes Constantino, Doctor of Philosophy, 2017 Dissertation directed by: Professor Jonathan Z. Simon Department of Electrical and Computer Engineering Department of Biology Institute for Systems Research A  key  aspect  of  human  auditory  cognition  is  establishing  efficient  and  reliable  representations  about  the  acoustic  environment,  especially  at  the  level  of  auditory  cortex.  Since  the  inception  of  encoding  models  that  relate  sound  to  neural  response,  three  longstanding  questions  remain  open.  First,  on  the  apparently  insurmountable  problem  of  fundamental  changes  to  cortical  responses  depending  on  certain  categories  of  sound  (e.g.  simple  tones  versus  environmental  sound).  Second,  on  how  to  integrate  inner  or  subjective  perceptual  experiences  into  sound  encoding  models,  given  that  they  presuppose  existing,  direct  physical  stimulation  which  is  sometimes  missed.  And  third,  on  how  does  context  and  learning  fine-­‐tune  these  encoding  rules,  as  adaptive  changes  to  improve  impoverished  conditions  particularly  important  for  communication  sounds. In  this  series,  each  question  is  addressed  by  analysis  of  mappings  from  sound  stimuli  delivered-­‐to  and/or  perceived-­‐by  a  listener,  to  large-­‐scale  cortically-­‐sourced  response  time  series  from  magnetoencephalography.  It  is  first  shown  that  the  divergent,  categorical  modes  of  sensory  coding  may  unify  by  exploring  alternative    acoustic  representations  other  than  the  traditional  spectrogram,  such  as  temporal  transient  maps.  Encoding  models  of  either  of  artificial  random  tones,  music,  or  speech  stimulus  classes,  were  substantially  matched  in  their  structure  when  represented  from  acoustic  energy  increases  –consistent  with  the  existence  of  a  domain-­‐general  common  baseline  processing  stage.   Separately,  the  matter  of  the  perceptual  experience  of  sound  via  cortical  responses  is  addressed  via  stereotyped  rhythmic  patterns  normally  entraining  cortical  responses  with  equal  periodicity.  Here,  it  is  shown  that  under  conditions  of  perceptual  restoration,  namely  cases  where  a  listener  reports  hearing  a  specific  sound  pattern  in  the  midst  of  noise  nonetheless,  one  may  access  such  endogenous  representations  in  the  form  of  evoked  cortical  oscillations  at  the  same  rhythmic  rate. Finally,  with  regards  to  natural  speech,  it  is  shown  that  extensive  prior  experience  over  repeated  listening  of  the  same  sentence  materials  may  facilitate  the  ability  to  reconstruct  the  original  stimulus  even  where  noise  replaces  it,  and  to  also  expedite  normal  cortical  processing  times  in  listeners.  Overall,  the  findings  demonstrate  cases  by  which  sensory  and  perceptual  coding  approaches  jointly  continue  to  expand  the  enquiry  about  listeners’  personal  experience  of  the  communication-­‐rich  soundscape.   SENSORY AND PERCEPTUAL CODES IN CORTICAL AUDITORY PROCESSING by
 Francisco Israel Cervantes Constantino Dissertation proposal submitted to the Faculty of the Graduate School of the University of Maryland, College Park
in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 Advisory Committee: Professor Jonathan Z. Simon, Chair Professor Matthew Goupell, Dean’s Representative Professor Behtash Babadi Professor Daniel A. Butts Professor Ellen Lau  The projects were funded by the US National Health Institutes (NICDC R01 DC 00843, 014085). I thank support by the Mexican Consejo Nacional de Ciencia y Tecnología through its graduate scholarship program. I would like to thank substantive input, in collecting data for and proof-reading our first project, by Elizabeth Camenga, Katya Dombrowski, Benjamin Walsh, and Marisel Villafañe-Delgado. Also to Elizabeth Nguyen, Natalia Lapinskaya and Anna Namyst for their excellent technical assistance.   ii Table of contents List of Figures………………………………………………………………………….(iii) Chapter I: Introduction Physiological origins of MEG signals……………….……………………………......…(1) Neural representations at the auditory system ……………………………………...…...(8) Chapter II: Functional significance of spectrotemporal response functions obtained using magnetoencephalography Summary ………………………………………………………………………………(17) Introduction ……………………………………………………………...…………… (18) Results ………………………………………………………………..………………. (21) Discussion…………………………………………………………....……………..… (33) General methods …………………………………………………………...…………. (41) Chapter III: Dynamic cortical representation of perceptual filling-in for missing acoustic rhythm Summary ………………………………………………………………..……………. (51) Introduction……………………………………………………..…………………..… (52) Results…………………………………………………….………………………...… (53) Discussion…………………………………………………………………………….. (63) General methods.…………………………………………...…………………………. (68) Chapter IV: Prior knowledge influences cortical latency and fidelity of the neural representation of missing speech Summary……………………………………………………………………………….(74) Introduction ……………………..…………...……………………………………..… (75) Results ……………………………………………………………………………........(78) Discussion…………………………...……………………………….……………...…(82) General methods………………………………..…...……………………………….... (89) Chapter V: Conclusions……………………………………………………………....(96) Appendix A………………………………………………………………………….....(98) Appendix B …………………………………………………………………………..(104) Appendix C…………………………………………………………………………...(107) References…………………………………..…………..………………………..…..(108)             iii List of Figures Figure 1.1 Magnetic field group generators.....................................................................(2) Figure 1.2 Biophysical structures facilitating neuromagnetic signals…………………..(3) Figure 1.3 Direct evidence of open field magnetic signals from layer-organized pyramidal cells……………………………………………..(4) Figure 1.4 MEG sensitivity and resulting field distributions: forward and inverse approaches………………………………………………………(6) Figure 1.5 The spike-triggered average method……………………………………….(10) Figure 1.6 Stimulus ensemble qualitatively affects STRF estimate at higher-order auditory areas………………………………………………...(13) Figure 1.7 Iterative STRF estimation method via boosting…………………………...(16) Figure 2.1 Spectrotemporal encoding models of MEG signals from human auditory cortex……………………………………………………………...(22) Figure 2.2 Consistency between response function model predictive model features and evoked potentials……………………………………...(24) Figure 2.3 STRFs generated using different stimulus representations achieve different levels of functionality……………………………………(27) Figure 2.4 Interpretational power from stimulus representations across STRFs from different stimulus classes……………………………………..(31) Figure 3.1 Neural representations of (un)modulated masked sound from a representative subject………………………………………………………(55) Figure 3.2 Percept-specific endogenous representations of patterned sound………….(57) Figure 3.3 Rhythmic target power acts as a discriminant neural statistic for perceived rhythm………………………………………………(59) Figure 3.4 Stimulus- and percept-specific spectrotemporal modulations of cortical activity during restored rhythm…………………………………(62) Figure 4.1 Cortical reconstruction of acoustically missing word-level speech envelope from noise by repeated replays of narrated story………...(80) Figure 4.2 Frequent repetitions of natural speech speed-up their cortical processing………………………………………………………….(82) Figure A.1 Example of equivalence between standard evoked potentials and temporal response function components……………………………...(98) Figure A.2 Stimulus and transfer functions differences across stimulus classes……..(99) Figure A.3 Models of subject temporal response function principal peaks………….(100) Figure A.4 Representation format transformed from early to mid latency speech processing at individual level…………………………………….(101) Figure A.5 Addition of static nonlinearity to multitone response properties………...(102) Figure B.1 Spatial filters associated with auditory steady-state responses…………..(104) Figure B.2 Neural representations of a rhythmic pattern embedded in noise………..(105) Figure B.3 No systematic acoustic influence of ambiguous perception of stimuli…..(106)   1 Introduction Physiological origins of MEG signals Information processing in the nervous system relies on the ability for neuron cell membranes to transfer electric charge in organized manner. Each transfer can be modeled as a point ionic current event, and in turn this generates a magnetic field in its near vicinity. Current transfer constrained by a neural wire-like segment (a ‘process’) implies an effective displacement along the course of its main axis (Fig. 1). By Maxwell equations, a new magnetic field distribution is there created, with a geometry that can be represented as a series of concentric magnetic field contours whose strength decreases with distance. If all transfer locations and directions were distributed randomly, the resulting local then add and cancel each other at chance, superposing to a resulting global picture of near-zero field at any given time, following a “closed field” (Fig. 1). It follows that for a MEG signal to be measurable, some degree of underlying coherent anatomical organization is necessary so that constructive superposition is favored (e.g. an open field configuration) as an effective magnetic source of greater strength than any of its constituents. Fortunately, in many locations the brain has a sufficient degree of organization for net fields from different locations to add constructively, and thus amplify into a neuromagnetic signal.   2 Figure 1. Magnetic field group generators. A single cell generator (left) with a mostly linear architecture may be active and its charge transfer effectively transfer along its axis. This is the source for a magnetic field distribution, and its strength can be depicted as contour lines that decrease with distance from the source direction (thick arrow); its own direction (contour arrows) depends additionally on the source orientation. Given these constraints, the spatial organization (directions and orientations) of multiple current generators (right) may determine how each contribution adds up externally, either adding up as an open field, cancelling out as a closed field, or likely keeping an intermediate form as a mixed field. Activity changes may happen from one another, and an aim of the MEG technique is to measure this global image, in particular when do over time. Still, even if a neuromagnetic signal is measurable it does not mean that it is directly interpretable. Fig. 1 illustrates the point that anatomical descriptions of the underlying generators are necessary for an unambiguous interpretations. At the individual level neurons have a striking variety of shapes and sizes, something that may translate into different current transfer properties. Multipolar neurons with long axon processes and many dendrites are frequently found in the mammalian central nervous system, featuring among them the pyramidal cell (Fig. 2) with a chief excitatory function in layered structures such as cerebral cortex and hippocampus. For our purposes, its main morphological characteristic is its single, long thick apical dendrite that dictates a principal axis of information transfer from dendrites to axon, and of (bidirectional) current flow along its axis. Configurationally, these units are locally arranged in parallel to each other, and normal to the tissue surface layer. Globally, such histological pattern is   3 maintained along the folded surface of cortex or hippocampus. When jointly active, events at individual units may be amplified generating magnetic fields of sufficient strength to be measurable with current detector systems[1]. Figure 2. Biophysical structures facilitating neuromagnetic signals. Information transfer in a pyramidal excitatory cell from the hippocampus (left) flows upwards from dendrites to soma to axon. Most dendritic branches appear to end at the main apical dendrite leading to a principal axis of current flow. These cell types have an inverted pyramid-like soma of about 20 µm average size and, histologically, often arrange bidimensionally in parallel as shown by a silver stain of cerebral cortex (middle). Thus if simultaneously active, a magnetic field contour surrounding their surface domain may superpose additively. This normal-oriented-to-surface configuration is a major building block of cortical layered structure as is visible at the millimetre scale in a sample motor cortex section (right). Images modified from: NeuroLex.org, Stritch School of Medicine (Lumen), and [2]. The anatomical basis for neuromagnetic signals arising from these biophysical arrays was directly addressed in an in vitro study by Wu & Okada [3], where a slice of parallel pyramidal-cell tissue that would in principle facilitate an open field distribution was used (Fig. 3). Neuromagnetic signals were successfully measured from a these samples, with   4 consistent directionality according to the physiological current flow patterns predicted by stimulation at either extreme of the pyramidal cell set (soma or far dendrites). Depending on stimulation distance from the soma (near or far), evoked signals featured bi- or tri- phasic behavior respectively, where the first phase is explainable by intracellular current flow (from stimulation origin to opposite axial end), but the last phase flows in both cases from apex to base, also in consistency with simultaneous electrophysiology data. The difference between first and last phases was then interpreted in terms of cells that have been directly stimulated (~30%) initially, versus late neuromagnetic signals originating from recurrent excitatory connections within the slice [4]. Figure 3. Direct evidence of open field magnetic signals from layer-organized pyramidal cells. (A) In the Wu & Okada experiment [3], a slice of hippocampus (CA3) tissue, where pyramidal cells are closely packed in parallel and fixed with equal orientations, is placed in the centre of four detector coil array (SQUIDs) screening   5 magnetic fields traversing each coil’s circular area. Cell soma and a few cell’s dendritic processes are represented. The slice was then stimulated along either the apical dendrite or the basal dendrite (soma) regions, defining either of two opposite current flow directions. Modified from [4]. (B) The right hand rule for the magnetic field caused by a current wire predicts its direction as it curls around the wire’s axis, which are opposite depending on current flow as determined by stimulation location (bold). (C) In the coil array, a neuromagnetic signal consistent with the expected open field distribution will be recorded at coil detectors 2 and 4, but not at detectors 1 and 3, as the latter two lie along the current flow axis and no field traverses their circular area. Depending on the current flow direction, any magnetic field reaching coils 2 and 4 may be entering from below (negative) or above (positive) the coils (equivalently, exiting from above or below them). (D) Resulting neuromagnetic signals observed only at coils 2 and 4 are near-mirror images of each other, with sign dependent on stimulation site. Inhibitory responses are blocked in this preparation. While hippocampus and cerebral cortex both have an anatomical organization that favors representations of neural activity as equivalent dipoles, it is not the case that in general all neural activity spanning their entire surfaces are be accessible to noninvasive MEG recording. The Wu & Okada study demonstrates an example of location-dependent signal measurement: virtually zero magnetic fields can be recorded along the current axis, as cytoarchitecture only builds up an open field elsewhere. An implication for human studies is that the convoluted cortical surface may constrain which regions are more or less to MEG sensors (Fig. 4A). As a rule of thumb, areas lying within sulci will be oriented parallel to the scalp surface and thus visible, while those domains located on gyri will be oriented normal to the scalp (but Heschl gyrus a notable exception), thus are magnetically inaccessible to sensors directly above. This problem is diminished by the presence of multiple sensors over the head array, some of which may be further away   6 from the generator, but in adequate relative orientation so as to measure a signal from the radial source in question. This issue of visibility can be modeled by the “forward problem”: that is, to map activity from a hypothesized anatomical source, such as the superior temporal lobe, to a field distribution recorded over the sensor array. This minimally requires projecting the field predicted by the laws of electromagnetism upon the spatial distribution of sensors to generate an image of the resulting magnetic field distributions. In real scenarios, it is the latter which is first available to the experimenter (Fig. 4B), for the “inverse problem” of estimating the likely current distribution anatomical source given the magnetic field distribution, has to be addressed. Figure 4. MEG sensitivity and resulting field distributions: forward and inverse approaches. (A) A cortical domain organization and location relative to a detector coil   7 dictates whether neuromagnetic signals may be retrieved from that domain. When apical dendrites across the superior temporal gyrus are active, a collective open magnetic field distribution is generated but little of it will reach outside of the scalp thus rendered invisible to a magnetic field detector at that location. Such neurons are radially organized, unlike tangential units from domain A, whose open field distribution exits and enters again the scalp, thus able to reach and cross through one detector coil from the MEG sensor set. (B) A bi-hemispheric magnetic field distribution, the first activity source map available to the experimenter, arises during auditory stimulation. Red isocontour lines approximate regions of equal magnetic field that cross from scalp into the MEG helmet surface, curling through exterior space and crossing back into the helmet surface (with opposite sign), then back into the scalp. The diagram overlay suggests its origin at the superior aspect of the temporal lobe. Aside from spatial coherence requirements, open fields also gain strength when generator units are active with temporal coherence. Estimates for the minimum amount of excitatory pyramidal units that need to be simultaneously active so as to generate a readable MEG signal, are of the order of between 104 to 105, resulting from current flow measurement resolution limits of 10 nA·m [5]; the minimum number of units may relate, based on anatomical estimates, to columnar patch ensembles of about 0.6 mm2 [6] although column densities may vary across different functional cortical areas. Coherence may of course vary over time, and MEG excellent temporal resolution is be used to probe the course of distributed, highly synchronized activation patterns throughout cortical domains. Altogether, the physiological basis of neuromagnetic signals make them amenable for questions of how and when do cortical domains covary with experimental timeseries variables.   8 Neural representations in the auditory system Finding straightforward encoding relationships or neural representations of sensory information has been often done from single unit activity. Information about sensory stimuli may also be encoded in other domains spanning measurement scales other than the single unit, for example both sub-scale membrane fluctuations, and supra-scale aggregate fields such as in in multi-unit ensembles and the M/EEG signal. One consequence of interpreting sensory events within these neural fluctuations is that it extends neural coding strategies from spiking activity to smooth, continuous timeseries. Encoding and representations: the receptive field How does the human cortex represent sound? When responses from auditory areas are found to covary with a sound stimulus statistic, the feature in question is said to be represented in the neural response. If a set of represented features are found in a way that predicts both behavioral and physiological outcomes – for instance, an encoding predictor arising in both animal models and supporting human psychophysics data (cf. [7]), then it qualifies as one of several possible solutions to the encoding problem. The encoding question summarizes as “how does a neural domain represent a physical variable x?”. Part of this question is answered by the extensive tonotopic organization of the auditory system, showing clustering of neighbor neurons by common response selectivity to neighboring spectral tuning regions. Refining the encoding question applied to the   9 auditory system then to “when does a neural domain represent x?”, the task is then to consider spectral features that dynamically evolve over time. Given a neural domain response, what estimate represents neural selectivity in terms of both spectral tuning and temporal properties such as latency, or periodicity, at the same time?. This description is embodied in the spectro-temporal receptive field (STRF), originally obtained by manipulating neuron spike datasets in a procedure known as reverse correlation or triggered correlation [8], [9]. This original method represented single unit selectivity: if each spiking time-series of neural output is reversed in time, then individual spikes conceivably denote time flags or triggers for particular features in the (time-reversed) neural input or stimulus, since it marks the occurrence of a neural event. The collection of such spike-triggered features (equal in number to the amount of spikes in the set) is then the ensemble of stimuli that precedes a neural event, and the ensemble is then summarized by averaging and ordering by frequency, thus representing the correlogram of that neuron’s STRF (Fig. 5)[10]. In practice this procedure can be reformulated as a “black-box” operator mapping from sound input to a neural output such as spike activity data, or as later extended, continuous field activity [11], [12]. This latter option allowed examination of implications for neural processing stemming from auditory neural assemblies in the aggregate, such as in MEG.   10 Figure 5. The spike-triggered average method. In this example, a neuron’s receptive field is described along two unspecified dimensions X and Y, which parameterize the addresses where both “positive” and “negative” stimuli (black/white) are delivered over the entire experimental session. The positive or negative stimulus delivery at specific addresses in turn associates with generation of one or more spike events. In the image, surface maps of all key addresses are summarized according to the stimulus sign; the spike triggered average is then the difference between maps. Modified from [13]. This systems theory characterization has important limitations, namely that output should not depend on future states, nor or on all past states of the system; that neural domains display nonlinearities as a consequence of simultaneous dependence on several parameters (and/or on several terms of the same parameter); and that responses may depend on high-order factors about the input – arbitrary stimulus categories is a common classic example [14]. A fruitful approach has been is to approximate the system by its   11 linear representation, with the possibility to add nonlinear terms of increasing order [11] when necessary. When the actual order is unknown (where “order” refers to the structure of present and past dependencies in the system), then the generalized receptive field may require describing an arbitrary number of expansion kernels. In contrast, if prior knowledge of the actual or approximate order has been established, the second-order kernel among the generalized receptive field expansion directly relates to the STRF as a spectro-temporal response function; both second-order kernel and STRF are identical if the system itself is of order two [11], [12], speaking to the severe theoretical restrictions upon of the linear STRF model to capture every aspect of the system behavior. Within limited predictive space however, the potential return of interpretational power may be high given its domain overlapping with a fundamental organization principle of the auditory system Testing linear models of auditory coding Only in the unlikely case where the system’s response is a mere proportionality relationship between input and output, it can be said to be linear1, and the actual receptive field is entirely captured by the system’s response function. More typically, unknown higher-order nonlinearities in the system will not be captured by the STRF, contributing to a decrease in the predictive power of the model. Therefore, estimation of linear STRF models enable assessment of the extent to which neural output is directly proportional to                                                                                                                1This  is  different  from  the  system’s  order.  In  this  special  case  all  aspects  of  a  second-­‐order  nonlinearity  in  the  system  are  completely  explained  by  the  second-­‐order  kernel.  In  other  words,  analysis  of  the  system  at  the  second  expansion  term  does  not  preclude  the  fact  that  the  system  may  involve  a  linear  transformation  from  input  to  output.     12 spectral change in the input over time (for this and other interpretations see [15]); if this extent is considerable, then the spectrogram is said to be linearly encoded by the neural domain in question [16]. Among critical issues related to testing STRF model validity is the choice of an adequate stimulus ensemble. For mathematical reasons, random noise stimuli such as Gaussian white noise (GWN) had been instrumental in algorithmic implementations of the original system kernel expansion described above. Failures in generalizing these estimates’ predictive ability, namely to implement the model estimate on novel stimuli other than what was used in the original estimation ensemble itself, arise especially beyond peripheral stages [17]. For example, the spike triggered average method is inappropriate for STRF estimation from stimuli beyond white noise, such as natural speech [18] (Fig. 6); rationales behind similar pitfalls are reviewed in [17]. This problem points to the key issue of identifying sound stimulation patterns that may be adequate for the auditory stage in question, as the ascending auditory system shows tuning to increasing levels of spectrotemporal complexity [16]–[18].   13 Figure 6. Stimulus ensemble qualitatively affects STRF estimate at higher-order auditory areas. For auditory neurons tuned to complex stimuli, the reverse correlation method yields spurious results under spike triggered averaging, since responses are not driven by uncorrelated stimuli (left column). Strategies accounting for statistical differences between stimulus classes may result in extraction of aspects relevant to the general stimulus-response relationship, which may appear as consistently interpretable STRFs across different stimulus classes (right column) even if from complex-tuned auditory areas, such as illustrated here. Image from [18]. Overall, these observations lead to suggest that the validity of a linear model based on a STRF may be affected not only on the intrinsic nonlinear nature of the system, but also by selection criteria laid by the experimenter, namely the specific neural domain probed   14 and the stimulus ensemble/class under consideration. In turn, this means that STRF estimates relaying poor predictive power do not necessarily invalidate the encoding relationship between the stimulus class used and the neural site probed; it only suggests that (i) given the encoding model parameters, a linear transform is unlikely to represent deterministically the input-output relationship; (ii) the estimation technique poorly approximates the linear aspect of the system, or (iii) a combination of both. Estimating linear models of auditory coding Robust estimation techniques attempt to eliminate the risk of falling under case (ii) above. Among these are estimation optimization models that seek reduction of error between a neural response and the response prediction by the STRF. In one of such approaches, error minimization occurs recursively, via step-wise modifications to the STRF. Boosting, an optimization approach in this spirit, belongs to a family of gradient descent techniques applied to auditory data [19]; its name denotes the strategy to run an arbitrary, weak estimation algorithm first, to produce an estimate that will only be required to perform slightly better than chance or than guessing. The same algorithm is then run several times on different instances of the dataset (for example, re-weighted versions of data). Initial low accuracy rates are improved (‘boosted’) once a single, more accurate estimate is built from the initial poor training outputs in the algorithm by incorporating them into a jointly fitted additive expansion [20]–[22]. Incorporation criteria and the choice of training subsets may vary; the method denoted here as boosting corresponds to a forward stage-wise (but not joint) fitting that follows a greedy heuristic, adding the contribution leading to the largest available mean-squared-error reduction at   15 each given step [19], [23]; indeed such reduction is desirable because it amounts in turn to maximizing the predictive power of the model [24]. Operationally, STRF estimates by boosting are initialized as a null matrix of dimensions TxF, where T equals the number of experimental time bins and F is the total of frequency bins; optimization follows by exploring fixed increments and decrements per single spectrotemporal bin separately. The exploration yields a total of 2xFxT possible candidates, among which the best mean-squared-error reduction is selected and accumulated upon to the running STRF in gradient descent (Fig. 7). The procedure is iterated until more modifications introduce undesirable behavior, such as a sustained increase in mean-squared error [19], since the method is not guaranteed to find a global optimum. The final STRF estimate consists then of the history of locally optimal choices added recursively. Formulations of the procedure implementation, along with a description of preventive measures with regards to overfitting (e.g. cross-validation), are available in [19], [25].   16 Figure 7. Iterative STRF estimation method via boosting. At each step, only one address in the spectro-temporal map reduces most error between response and prediction, and such bin is kept as a seed for the next iteration. As repetitions eventually lead to noisier estimates, stopping at an intermediate step (circled) prevents further reduction in error due exclusively to statistical properties of the stimulus-response ensemble used for training estimation. Overfitting prevention is done by optimizing with respect to reductions in generalization error involving novel stimulus-response ensembles not used in training. Image from [19].     17 Chapter II Functional significance of spectrotemporal response functions obtained using magnetoencephalography   Summary The  spectrotemporal  response  function  (STRF)  model  of  neural  encoding  quantitatively  associates  dynamic  auditory  neural  (output)  responses  to  a  spectrogram-­‐like  representation  of  a  dynamic  (input)  stimulus.  STRFs  were  experimentally  obtained  via  whole-­‐head  human  cortical  responses  to  dynamic  auditory  stimuli  using  magnetoencephalography  (MEG).  The  stimuli  employed  consisted  of  unpredictable  pure  tones  presented  at  a  range  of  rates.  The  predictive  power  of  the  estimated  STRFs  was  found  to  be  comparable  to  those  obtained  from  the  cortical  single  and  multiunit  activity  literature.  The  STRFs  were  also  qualitatively  consistent  with  those  obtained  from  electrophysiological  studies  in  animal  models;  in  particular  their  local-­‐field-­‐potential-­‐generated  spectral  distributions  and  multiunit-­‐activity-­‐generated  temporal  distributions.  Comparison  of  these  MEG  STRFs  with  others  obtained  using  natural  speech  and  music  stimuli  reveal  a  general  structure  consistent  with  common  baseline  auditory  processing,  including  evidence  for  a  transition  in  low-­‐level  neural  representations  of  natural  speech  by  100  ms,  when  an  appropriately  chosen  stimulus  representation  was  used.  It  is  also  demonstrated  that  MEG-­‐based  STRFs  contain  information  similar  to  that     18 obtained  using  classic  auditory  evoked  potential  based  approaches,  but  with  extended  applications  to  long-­‐duration,  non-­‐repeated  stimuli.   Introduction  Empirically measured sensory receptive fields and response functions offer analytical characterizations of computations attainable by the auditory system[26]–[28]. Applied linear systems methods such as the spectrotemporal response function (STRF)[11], [14], [29] have similarly led to informative computational characterizations of central auditory neural function with respect to sound encoding and perception[30]. The STRF can be viewed as a representation of the approximate neural response to changing auditory input in time or frequency; any particular functional description will vary according to the location and role of the neurons. Different stimulus classes (e.g. artificially generated sounds vs. natural sounds) may produce related, but dissimilar STRFs from the same neural unit, speaking to fundamental processing differences (and similarities) of auditory encoding[18], [31], [32]. An emerging view in electrophysiology is that the STRF may represent a snapshot of the entire network converging onto that neuron (or group of neurons)[32], incorporating this population’s activity in its neural representation of the spectrotemporal features of the stimulus[30]. As seen here, STRFs also have a role in investigations of ensemble auditory coding, using neural recordings obtained from magnetoencephalography (MEG) or electroencephalography (EEG). STRFs directly characterize the relationship between a sound stimulus and the   19 accompanying neural response. For neural ensembles, rather than individual neurons, many individual linear components may be jointly pooled, perhaps even superadditively (depending on the underlying neuroanatomy and neurophysiology of the signal source). Also, as in the case of a single-neuron-based STRF, it may be methodologically simpler to use controlled stimuli rather than natural sounds[24], [33], [34]. It remains to be determined how the spectrotemporal features of ensemble-based STRFs correspond to the time-varying evoked-related-potential responses (and other standard MEG/EEG measures) as a function of frequency, and also to what extent the STRF encoding model can provide analogous additional information besides predictive power. Furthermore, the STRF estimate of a stimulus-response relationship may depend on the particular representation chosen for that stimulus; in particular, it remains unknown which specific stimulus representations are optimal for the purpose of matching STRF features to neural function[35], and whether any such choices can address the key question of how to generalize across stimulus classes, from artificial to natural stimuli. Finally, it is important to discuss overlap between these non-invasively obtained STRFs and those available from local field potential (LFP) data or from other invasive recordings. In order to address these questions, evoked cortical activity recordings from healthy listeners were obtained with MEG during active listening of pseudo-random multi-tone patterns[33], [36] presented at a variety of rates. STRFs were obtained per subject and condition, in order to assess the extent to which the MEG responses were linearly explainable by a sparse representation of the stimulus sound pattern, and whether rate- related changes are consistent with those found using invasive electrophysiological techniques. Peak components in STRFs and temporal response functions (TRFs) were   20 identified and their latencies compared to those obtained with standard tone-based averaging. Alternative representations of the stimulus, including the auditory spectrogram, were used for reverse correlation in order to constrain the space of stimulus representations given the properties of the MEG cortical signal. Finally, these functionally informative STRFs were compared to those from datasets from studies using natural speech[37] and music processing[38]. This allowed an investigation of ensemble-dependent issues arising from STRF comparisons when using artificial vs. natural stimuli[31]. MEG-based STRFs are shown to functionally explain considerable amounts of response variability while revealing a parsimonious mapping of response features seen in classic averaging methods to those obtained from dynamic stimuli timeseries. Quantitatively, the MEG-based STRFs account for similar levels of predictive power to single and multiunit responses in auditory cortex[24]. Qualitatively, the mappings show reasonable correspondence with those from local field potential activity in animal models[39], [40] and manifest similar stimulus dependencies (e.g., density[33]). We find that similar STRF structure is seen across responses to stimuli as diverse as natural speech and music, demonstrating convergence across stimulus classes. This last result, however, depends on the use of a specific (sparse) representation of acoustic stimuli, the nature of which provides additional knowledge regarding the role of spectrotemporal modulations on predictive frameworks of auditory cortical representations over a wide range of dynamic sound classes.   21 Results  MEG cortical responses predictable from the STRF linear model.  Potential successes of the STRF as a linear model to predict MEG responses from acoustic stimuli are evaluated by comparing the actual vs. predicted responses which, unlike spike-generated STRFs, are continuous waveforms (Fig 1A). Model predictions are obtained by linearly convolving the corresponding STRF with the stimulus representation, using cross- validation (arbitrary separation of training data from testing data) to prevent overfitting, which makes this a conservative estimate due to noise present in the testing data[24], [34]. If instead only the training data is used, i.e., fitting to the same data as is tested, STRF estimates provide a stringent upper limit as to how good any linear prediction can be. STRFs estimated using cross-validation predict the large negative deflections (Fig 1a, red) that follow tone onsets (~100 ms post impulse) well, but unlike those from training (Fig 1a, blue), are less accurate for positive excursions (both data sets summarized in Fig 1b). The ability of the STRF model to predict the encoding relationship between sound patterns and cortical responses can be measured as the fraction of response variability explainable by the linear model, estimated on an individual condition and subject basis, once intrinsic response variability (unrelated to the stimulus) has been removed[24], [34]. MEG STRF predictions were found to range as high as 34% of variance explained across participants and rates, using cross-validated data. When the fraction of variance explainable by the model was compared with normalized noise power (or inverse SNR), the explainable fraction in the theoretical noiseless limit was estimated to be 23.0 ± 2.0% (mean ± st. dev.; CI: 19.0–26.9%) as part of a significant linear regression relationship (F=45.9; p=2.7x10-9; R2=0.386), with an upper limit of 71% as provided by training-data   22 only results (Fig 1c). Figure 1. Spectrotemporal encoding models of MEG signals from human auditory cortex. a) A 7 s sample recording of an MEG response to a sparse multitone pattern (2 tones/s), with STRF-based predictions. b) STRFs were optimized by iteratively minimizing prediction error on the entire dataset, referred to as training (blue, r=0.74), or alternatively on their ability to generalize (cross-validate) over testing datasets (red, r=0.62). c) The predictive power of the STRF models is shown by linear regression of individual STRFs across participants and conditions on their corrected normalized noise power (i.e., inverse signal-to-noise ratio, an indicator of trial by trial reliability, see   23 Methods). Extrapolation of performance to zero noise power gives a noise-corrected expected performance for both the conservative cross-validation-based estimates and the fundamental-upper-limit training estimates.   Fraction of response explained by STRF features consistent with standard evoked potentials.  STRFs based on MEG responses display consistent spectrotemporal structure in the form of positive-negative-positive complex deflections (Fig 2a) coinciding with typical auditory cortical latencies (e.g. those of the P1-N1-P2 complex in averaged EEG responses to isolated tones). In particular, the multitone STRFs demonstrate strong negative responses at ~100 ms post impulse onset (STRF100). The specific STRF100 latency depends on stimulus frequency, varying ~20 ms over the frequency range 180- 700 Hz; at higher frequencies the latency is approximately constant (STRF100 latencies for 2 tones/s shown in Fig 2b, black). STRF100 latencies were found to follow standard tone- evoked M100 latencies[41]–[46] obtained under various conditions (Fig 2b; also Table 1). The correspondence suggests a quantitative link between the STRF100 and M100, and therefore between STRF-based techniques and ordinary auditory evoked cortical potentials. Analyzing the same experimental data using standard evoked response analysis instead (epoching and averaging over responses to all tones in the sparsest multitone pattern) demonstrates strong temporal correspondence at the group level (Appendix A Supplementary Fig 1).   24 Figure 2. Consistency between response function model predictive model features and evoked potentials. a) Grand average spectrotemporal response functions based on multitone stimuli demonstrate a positive-negative-positive structured sequence between 50 and 200 ms following tone onset; tone cloud density introduces qualitative changes in relative amplitude and delays: at increased rates an early positive component (50-100 ms; STRF50) emerges, while the medium latency negative component (100-150 ms; STRF100) attenuates, and a late positive component (150-200 ms) present only in the sparsest conditions disappears. b) STRF100 components are delayed by over 20 ms as tone carrier frequency decreases from 2 to 0.2 KHz, in a manner consistent with those of evoked potentials in single tone presentations[41], [42], [44]–[46] (Table 1), indicating a correspondence between impulse response functions obtained through reverse correlation and averaged evoked potentials. A common latency decrease across studies and conditions is observed for carrier frequencies in the 180-700 Hz range. c) Temporal response functions, obtained by reverse correlation with the stimulus envelope collapsed across frequencies, show features similar to the P1-N1-P2 complex commonly found in EEG evoked potentials[47]. Higher tone presentation rates result in the emergence of the TRF50 and in decreased amplitude and increased latency of the TRF100 (inset), as well as the attenuation of a later-latency positive deflection. Error bars are 1 standard error of the mean.   25 # of subjects [mean age] Sound delivery [Sensor location] Tone duration (ms) Presentation rate [tones/s] Peak finding method Roberts & Poeppel, 1996[41]; Greenberg et al., 1997[42] 5 [24-33 y] Monaural [Contralateral, Left] 400 0.7 - 1.3 Equivalent dipole, Maximum RMS Lütkenhöner & Steinsträter, 1998[43] 1 [28 y] Monaural [Contralateral, Left] 520 ~0.4 Maximum RMS Roberts et al., 2000[44] 8 [-] - [Both hemispheres] - - - Mäkelä  et  al.,   2002[45]   11 [32 y] Binaural [Both hemispheres] 200 1 Optimal sensor pair Salajegheh et al., 2004[46] 11 [45.8 y] Binaural [Both hemispheres] 400 0.8 - 1.3 Maximum RMS from optimal 12 sensors Table 1. Comparison of studies reporting M100 absolute latency values in response to pure tones, with participants, recording mode, stimulus details, and M100 peak determination method where available. To further investigate the correspondence between STRFs and evoked potentials (specifically the effects of tone density), reverse correlation was performed with respect to frequency-collapsed representations of the stimulus, generating the frequency- independent Temporal Response Function (TRF, Fig 2c). The ~100 ms latency negative peak (TRF100) amplitude decreased with increasing tone-density by ~60% across modulation rate range studied, while latency increased 20% (see inset). In contrast, the   26 ~50 ms latency positive deflections (TRF50) had the smallest amplitude for the sparsest multitone condition. Thus sources with ~50 ms latency generate a strong increase in cortical activity with a transition from scattered to continuous pure tones, while sources with ~100 ms latency decrease in strength as they are delayed. Cortical activity in sources with 150 ms latency may also be active, provided the inter-tone interval is long. STRF most informative for onset-based representations of multitone stimulus.   Methodologically, the acoustic representation of the stimulus used to generate the STRF may employ any number of available time-frequency representations of the sound, including the widely-used spectrogram[19], [24], [48], [49]. One reason to consider alternatives to the spectrogram is to compare STRF features with evoked response features, since an evoked response to tones is calculated not with respect to the spectrotemporal duration of the tones but only to their onsets. Thus analyses also included binary and sparse representations of the stimulus: single tones were modeled as trigger-like impulses timed with tone onset and organized by frequency. Indeed, stimulus features that are known to be encoded by auditory cortex include onsets, offsets, and stimulus duration (in the form of sustained responses)[50]–[53]. Since the MEG signal is aggregated across synchronized individual neurons[6], evidence for those same encodings requires investigation. Reverse correlation techniques are well suited for this larger-scale analysis because it explores the outcome of alternative stimulus representations that emphasize such features. The stimulus representations tested here (cf. Fig 3 insets) were (i) the ideal trigger representation, (ii) the ideal edge representation (both onset and offset triggers), (iii) the ideal stimulus first-order derivative (onset and   27 negatively-signed-offset triggers), which can itself be used to generate the trigger representation if followed by half-wave rectification, (iv) the ideal stimulus pulse envelope, which has constant value from onset to offset (and which can itself be used to generate the previous representation if followed by differentiation), (v) the actual acoustic stimulus passed through a filterbank with identical center frequencies as the tone, whose envelope is then extracted (see Methods), and (vi) a generalized envelope onset representation obtained via half-wave rectification of the previously defined filterbank envelope output. Only the last two can be applied to natural (non-discrete) stimuli, and so are especially important in later sections.   28 Figure 3. STRFs generated using different stimulus representations achieve different levels of functionality. STRFs generated from multitone patterns are functionally informative (e.g., comparable to evoked potential analysis) when each individual tone is discretely represented by its onset (top left) but not when represented instead by the timing of its temporal edges (middle left), sign (bottom left), or a discrete representation of the entire pulse duration (top right). Related to the spectrogram, the representation based on passing the acoustic signal through a series of filterbanks, then extracting envelopes per band (middle right) yields only barely discernible results. Extracting onset timing information from the filterbank, in contrast, was quite functionally informative (half-wave rectification of the first derivative of the filterbank output; bottom left). Critically, filterbank-based methods do not require a priori definitions of temporal edges and can be used for arbitrary stimuli. Color scales as in bottom right inset, except for Derivative STRF.   Grand average STRFs in Fig 3 demonstrate that among such representations, only those expressing tone onset events explicitly yield components comparable to those of evoked potential analysis (first and last STRFs of Fig 3); STRFs from the alternatives introduced ringing and/or pre-causal artifacts. As with the original onset-based trigger representation, reverse correlation with both temporal edges predicted activity from the first edge in accordance with the latency by frequency dependence, but also produced a pre-causal mirror component, in advance of the original and at the tone duration distance. This pattern suggests that tone offset was not explicitly encoded here. This interpretation is supported by analysis of STRFs generated by the derivative representation, which correspondingly flips the sign of the same pre-causal mirror component, but is additionally contaminated via constructive interference by a series of artefactual ringing cycles. The pulse representation, which can be viewed as an idealized envelope, produces   29 STRFs that are essentially featureless (or at best, whose features are barely discernible above the noise floor). This result is unexpected since typical auditory reverse correlation studies use a duration-based stimulus envelope representation[25], [54] and the temporal envelope is often hypothesized to be the response-driving feature. Similarly, the acoustic envelope representation (using a filterbank model; see Methods) also produced featureless STRFs. An attempt to re-create the onset representation (i.e. half-wave rectification of the acoustic envelope representation derivative), did however generate STRFs with features comparable to evoked potential analysis, and enabls the extraction of onset-like information in general from diverse complex natural stimuli. Because of the remarkable agreement between the idealized and the acoustic onset models, interpretations based on evoked potentials may extend to reverse correlation analysis applied to other stimulus classes where definitions of onsets would be a priori unknown or not controlled for, such as natural sounds. Convergent STRF models across artificial and natural stimuli.  Because of its potential to reveal hierarchical processing mechanisms, a major goal in auditory reverse correlation has been to examine the encoding relationship for critical natural stimuli including speech and other communication sounds. To this end, datasets from two previously unpublished studies on speech and music processing were submitted to the same analysis methods as the multitone pattern (Fig 4a), with stimuli represented by their envelope onsets. As with the onset-based representation of the multitone patterns, STRFs for speech and music exhibited qualitatively similar structures, with distinctive biphasic components near 50 and 100 ms post rising transient impulses (onsets) along the same investigated spectral region. Inspection  of  the  stimuli  under  either  envelope  or  envelope  onset     30 representations  suggests  that  the  latter  procedure  effectively  increases  similarity  in  the  underlying  distribution  across  stimulus  classes  (Appendix A Supplementary Fig 2a).  The frequency dependence of relative peak delays was also maintained for these stimulus classes (Appendix A Supplementary Fig 2b) but with class-dependent timing differences, suggesting a common fundamental mode of spectrotemporal cortical processing up to ~200 ms and after which notable processing differences appear according to the stimulus class. While neural data from all studies were obtained from different subject groups, one subject did participate in those two studies and in a modified pilot version of this experiment (2.4 tones per second presentation rate); these data are presented in Fig 4b, again showing strong qualitative similarity both to group data and class-dependent timing differences. This subject’s topographic magnetic field maps associated with the neuromagnetic signals derived in each of the three studies are displayed in Fig 4c; mapping each STRF to overlapping spatial distributions is consistent with source activity at the superior aspect of the temporal lobes. To better illustrate class-dependent temporal differences across the studies, TRFs were obtained by collapsing STRFs across spectral bins, as shown in Fig 4d. These plots emphasize spectrally consistent changes in temporal processing due to stimulus class, along with relative amplitude differences. As before, early activity appeared least prominent for the spectrotemporally sparsest stimuli; in the case of the single participant tested across all three stimulus classes, a high-temporal resolution analysis of the multitone TRF100 shows its dynamics are very close to those of the speech envelope   31 counterpart (Appendix A Supplementary Fig 3a), with characteristic time constants of ~3 ms (Appendix A Supplementary Fig 3b). The response dynamics for music, however, do not follow similarly, which suggests that features other than overall acoustic onsets may contribute to synchronized auditory responses in these cortical populations. Figure 4. Interpretational power from stimulus representations across STRFs from different stimulus classes. a) Group normalized STRFs from the multitone pattern experiment (N=15), and from studies on natural speech (N=12) and on music (N=15), reveal considerable structural similarity when stimulus onset is extracted as a driving feature of the neuromagnetic response. b) Neuromagnetic STRFs from the same participant across the tones, speech, and music studies, which show substantial consistencies across stimuli when represented by their temporal envelope onsets per frequency band. c) The topographic distribution from same subject as in (b) revealed strong bilateral consistency across classes but with increased left hemisphere-bias during speech processing. d) Top: Timing of major neuromagnetic activity peaks, as shown by TRFs derived from spectral integration of the STRFs in (a), results vary depending on stimulus class and/or context: earliest positive and negative deflections change with increasing acoustic density but also with additional spectrotemporal complexity as found   32 in natural speech and music. Bottom: Group TRFs comparing both speech envelope and envelope onset related activity. Timing differences are explainable by differential acoustic representation in early (< 0.1 s) but not late activity peaks, suggesting the formation of higher order neural representation of elements in speech acoustics by ~100 ms. Only the first deflection timing difference is explained by slope-to-maximum time differences between stimulus representations (inset, same color coding). Curves smoothed by a 5 ms moving average. Cortical transformation of natural speech envelope representation.  In reverse correlation analyses, exploration of alternative representations of the stimulus may provide complementary insight into the functional operations by the auditory system. Fig 4a and Fig 4b show that for natural speech, STRFs based on the acoustic envelope (row 5) led to functionally informative STRFs, consistent with prior approaches[25], [55]. STRFs based on the envelope onset representation (row 3) are similar, which is expected since the envelope onset is correlated with the original envelope. In terms of timing, the corresponding group TRFs (Fig 4d) show a difference of 43 ms between TRF50 peak components. This was found to be the same as the characteristic delay between their underlying representations, obtained by cross-correlation of the stimulus representations. Such a close correspondence is evidence that at the level of the neural source of the TRF50, an increasing acoustic envelope operates as a fundamental auditory feature of the stimulus. In contrast, the corresponding comparison of STRF100 peaks across the two representations (envelope and envelope onset) gives a much reduced difference of 20 ms (Fig 4d, S5 Fig), not consistent with the acoustic differences between the corresponding representation peaks. Compression in components’ relative delays were observed across   33 spectral bins (Appendix A Supplementary Fig 2b), as well as in individual temporal response functions (Appendix A Supplementary Figs 3a, 4). Discussion The present investigation describes STRFs as a series of response function mappings from artificial and natural sounds to auditory neural responses. It has been demonstrated that these STRFs possess similar predictive power as their single-unit cortical counterparts, and, importantly, show strong similarities across stimulus classes when an acoustic envelope onset representation is chosen. Specific choices of spectrotemporal stimulus representations[35] result in STRF models that are not only predictive but whose temporal structure is highly consistent with that from standard evoked potential components. Comparison to spike-based spectro-temporal receptive fields. The spectrotemporal receptive field can be considered a spike-triggered averaged spectrogram, from auditory periphery[8], [14] or central nervous system recordings[27], [56]–[58]. Since reverse correlation is a more general principle than spike-triggered averaging[9] it has been used here to characterize and predict the neural responses of auditory systems where both input and output are continuous time-series[12] via the underlying response function of the system. Whether measured by spikes or continuous neural responses, neural systems are non-linear, so predictive linear models of central neural coding are necessarily incomplete descriptions of the underlying coding relationship and are bounded by the predictive power and interpretability they maintain within the limits of the linear   34 regime[24]. The multitone stimulus employed here is comparable to a dynamic random chord stimulus[24], [59], [60] though it has more temporal degrees of freedom, allowing cross- frequency overlap in a continuing pattern that prevents constant tone presentation rates. It is more similar to dynamic random chords than other artificial stimuli used to estimate STRFs, such as ripple noise and moving ripples[17], [61], which focus on stimulus modulations instead. The predictable fraction of variance in the evoked MEG source timeseries was found to be 19–27%, in close correspondence with 18%[24] and 31%[62] predictive power from primary auditory cortex (A1) single/multiunit responses. Comparisons regarding predictive power (and other STRF properties) should also take into account fundamental differences in the underlying signal (spiking versus dendritic- origin activity) and its scale (neuron or highly local population versus meso-scale cortical patches[6], [63]), the animal model, and state (e.g., performing a task vs. resting vs. anesthetized)[30].   Qualitatively,  the  STRFs  presented  here  exhibit  a  general  broadband  structure  with  frequency-­‐dependent  latencies  and  amplitude  changes  depending  on  stimulus  density.  Remarkably,  similar  properties  appear  in  STRFs  obtained  from  LFP  in  mammalian  A1[39],  [40],  [64],  featuring  broadband  inhibitory-­‐excitatory  component  sequences  and,  often,  frequency-­‐dependent  latencies.  Component  latencies  in  mammalian  A1  are  ~50%  shorter  than  here,  which  may  be  explained  by  reduced  equivalent  cortico-­‐cortical  transmission  length  delays[65]  for  the  species  involved  in  those  studies. With  respect  to  human  studies,  the  component  latencies  reported  here  are  consistent  with  multiunit  activity[66]  and  high-­‐gamma  activity  in     35 electrocorticography  (ECoG)[60]  in  the  functional  equivalent  of  A1,  Heschl’s  gyrus.  The  STRFs  obtained  in  such  datasets  principally  reflect  neural  spiking,  resulting  in  mappings  with  narrow-­‐band  features,  consistent  with  their  interpretation  as  units  locally  sampled  along  the  tonotopic  gradient  of  A1.  Indeed,  frequency  selectivity  becomes  reduced  for  local  field  potential  recordings[39],  [64]  (i.e.  ECoG  frequencies  below  high-­‐gamma)  as  they  sample  redundant  activity  across  distant  recording  sites  with  intra-­‐cortical  interactions[67]  –  which  may  effectively  smooth  the  spectral  selectivity  distribution[39].  Unlike  local  recordings,  which  due  to  high-­‐frequency  selectivity  can  require  that  receptive  fields  be  realigned  by  best  frequency[49]  to  extract  statistical  features,  MEG  STRFs  offer  distributed  access  to  more  global  cortical  network  domains. Plausibly, analog results may be expected from future human  auditory  LFP  STRF  studies  from  invasive  procedures, given that these  have  typically  focused  on  multiunit  and  high-­‐gamma  activity[60].     In addition, these MEG-based tone-generated STRFs show stimulus-dependent differences as seen elsewhere in the receptive field literature (see review by Eggermont[30]), namely, amplitude decreases with density. This is consistent with awake primate results, where three-fold increases in tone presentation rate (9.7 to 31 tones/s) may be accompanied by a magnitude decrease of about a third in the STRF maxima[33]; here, a similar multiplicative change in tone density (2 to 6 tones/s) produced a peak decrease of about half. Suppression of excitatory contributions[36], or emergent inhibitory activity throughout A1 single units[33] have been proposed as mechanisms for response field modulations observable from LFP recordings[64]. In cortical neurons, increased firing rates may accentuate depression rate imbalances   36 between excitatory synapses and those with increasingly inhibitory activity[33], [40], [49], which is a known factor involved in receptive field modulations in somatosensory[68] and visual areas[69]. For auditory recordings, more inhibition may effectively increment responses’ spectral specificity or bandwidth at higher tone densities[33], [64] – the analog of which was not observed in the MEG STRFs (see   Westö & May[70] for a cautionary note on interpreting inhibitory contributions to STRFs following dense stimuli). Among factors reducing STRF predictive power is the increase of inhibitory fields in estimates from single unit recordings[33]; in MEG this effect appeared to be mirrored by response function components of opposite sign to the STRF100. Further research is thus necessary concerning the coarse-grained level of analysis that is accessible via MEG/EEG respectively, in comparison to that afforded by single/multiunit signals. Association with auditory evoked potentials. Unlike traditional averaging methods, reverse correlation involves continuous delivery of a dynamic stimulus in order to generate a predictive model (of novel instances of the same sound class). It has been shown here that STRFs and TRFs can be directly compared to standard auditory evoked responses, namely the magnetic M50, M100, late auditory evoked responses, and the P1- N1-P2 complex in EEG[47]: (i) The earliest positive-polarity component, the STRF50, seen at higher multitone densities, is a temporal analog to the M50 response originating from Heschl’s gyrus (including core/primary areas)[71], [72]; its amplitude may also be modulated by inter-   37 stimulus intervals[72] at low presentation rates (<2 tones/s). Known modulators of M50 amplitudes include harmonic versus noise-like bursts[73], [74], prepulse inhibition[75], and automatic processing of redundant information as a form of sensory gating in paired- click stimulus designs[76], [77]. In terms of predictive power, this component did not generalize well over novel instances of the multitone random pattern; this is consistent with an adaptive role contributing to considerable changes in the response profile dependent on the local context on the order of a few seconds or less. (ii) The subsequent major component, the negative-polarity STRF100, exhibited magnitude decrease and delay increase with density, with a sharp transition after the sparsest density level. Suppression of the M100 response from supratemporal cortex has been observed in the transition from low to higher tone presentation rates[72] highlighting the interpretation of increased inhibitory effects that include generalized refractoriness among neurons at denser conditions[78]. This component is also subject to attentional modulation[79]–[81] which may reflect that individual tones in a densely populated scene fail to capture attention individually. Because of this component’s involvement in tracking perceptual objects of an auditory scene[55], and of the increasing quality of flow and continuity in these artificial stimuli, sharp transitions in this component may suggest indices of ‘crowding’ relevant to the figure-ground separation problem[82], [83]. Accounts of spectrally-dependent latency in the evoked M100 components[41]–[46] were consistent at this stage and fall within the sensitivity domain for human voice pitch production and discrimination[45]. These latencies are also consistent with those of pitch-specific onset responses, whether elicited by complex tones or by centrally-generated Huggins pitch percepts[84], [85].   38 (iii) The second major positive-polarity peak appears only in response to the sparsest stimulus. Auditory event-related potentials at ~200 ms latency have been described in EEG as expectancy indices, exhibiting greater amplitudes for tones whose presentation in time is uncertain[86]. At denser conditions, shorter inter-stimulus-intervals may reduce the tone-evoked analogous EEG P2 component, regardless of presentation within a repetition sequence or as an oddball, suggesting involvement of modulation mechanisms other than habituation[78]. On alternative representations of stimulus state. In addition to their predictive power, STRF profiles are functionally informative in a way similar to trial-averaged evoked responses to isolated stimuli[15], [18], [35]; this was the case for STRFs compared across stimulus classes when the stimulus representations were filterbank-derived onsets. Other abstract representations of this stimulus pattern, including both temporal edges, their directionality, the duration of sustained acoustic energy, or, related to the latter, the spectrogram, did not appear to be similarly functionally informative – even though they contained and extended information from onset representation. Nevertheless, predictive power was similar across representations, suggesting that this metric alone is insufficient to expose which aspects of the stimulus map to the system’s response. More complex tone patterns might allow predictive power to become more informative regarding the statistical characterization of a stimulus (c.f.[87]). The lack of evidence for explicit neural encoding of offsets is in accord with neurophysiological evidence suggesting offset- encoding cells to be outnumbered by onset cells, and/or to have minor neural response profiles relative to onset encoders[50], [51], which in the aggregate would result in differential contributions to the neuromagnetic response.   39 On extension to natural stimuli. Processing of environmental sounds, including conspecific calls, is a critical auditory task. Encoding models incorporating natural sounds with complex spectrotemporal structure provide powerful computational insights into the auditory system that may be inaccessible with synthetic stimuli only[18], [30], [31]. STRFs derived from invasive recordings from A1 perform similarly in terms of predictive power, using random tone chord stimuli, animal  calls,  environmental  sounds,  sound  effects,  and  music[24],  [48], at the population level. For some subset of these neurons, successful linear encoding of the spectrogram may also occur in the same unit for both artificial and natural vocalization encodings[32],  [49].  The  search  for  predictive  models  that  generalize  over  novel  stimuli  not  in  the  training  set  has  proven  difficult  however[32], [59].  The  temporal  statistics  intrinsic  to  natural  sounds  may  be  critical[32],  and  some  evidence  from  A1  STRFs  demonstrates  higher  predictive  power  using  conspecific  vocalizations  that  are  not  dilated  or  compressed[88];  similarly,  comparisons  are  also  favorable  for  artificial  and  communication  sounds  controlled  for  the  span  of  their  temporal  and  spectral  modulations  jointly,  but  allowing  differences  in  their  amplitude  fluctuations  over  time[32].  Observed  stimulus-­‐class  dependencies  in  STRF  spectrotemporal  properties  appear  as  small  time-­‐shifts  of  STRF  features,  plus  the  emergence  of  additional  late  activity  for  speech  and  music.  Analysis  of  such  differences  suffer  from  confounds  arising  from  statistical  non-­‐uniformities  among  the  sampled  classes[18],  [19],  [31],  [32]  and  fully  addressing  this  issue  is  beyond  the  scope  of  this  investigation.  The  question  of  whether  detailed  class-­‐dependent  temporal  coding  frameworks  may  be  achieved  by  means  of  linear  methods  remains  open.     40 On  speech-­‐derived  STRFs.   Advances   in  understanding   cognitive  processes   relevant  to  speech  processing  have  followed  from  reverse  correlation  studies  that  used  the  speech  acoustic  envelope  (as  represented  by  low-­‐frequency,  1-­‐15  Hz  fluctuations  in  ECoG[89],   [90]   and   MEG/EEG[25],   [55],   [91],   [92]   recordings).   We   find   that   the  speech  envelope  STRF100  component  exhibits  similar  spectral-­‐dependent  latency  as  the   M100   evoked   response,   thus   suggesting   a   level   of   speech   analysis   that   still  contains   independent   spectral   information.   Although   in   contrast  with   findings   of  near-­‐constant   M100   latencies   for   certain   synthetic   vowels   presented   in  isolation[45],  reverse  correlation  methods  over   long  natural  speech  presentations  are  better  suited  to  probe  domain-­‐general  processing  in  realistic  conditions  due  to  their  extended  sampling,.     Additionally, the methods may constrain the time course of the change in neural representations of human speech from spectrogram to higher level. The  low-­‐frequency  speech  envelope  and  its  onsets  are  both  operationally  related  to  functionally  informative  STRFs.  The  systematic  delay  between  timeseries  (peaks  in  the  latter  systematically  precede  those  in  the  former)  directly  accounted  for  the  relative  difference  between  the  resulting  pair  of  STRF50  components  after  each  representation.  The  acoustic  mismatch  could  not  explain,  however,  the  reduced  relative  difference  between  subsequent  STRF100  pairs.  The  interpretation  of  a  compression  is  consistent  with  current  models  of  step-­‐wise  speech  processing,  where  the  formation  of  speech  analysis  units  or  objects  is  preceded  by  an  earlier  spectrogram-­‐like  representation  of  acoustics  completed  by  ~80  ms  post  speech  impulse  onset.  After  this  time,  response  functions  did  not  account  for  the  expected     41 mismatch,  suggesting  a  neurally-­‐based  progression  into  a  modified  stage  of  the  neural  representation  of  speech  and  adding  to  a  body  of  MEG  evidence  for  a  cortical  hierarchy  of  speech  object  representations  (see  review  by  Zhang  and  colleagues  [93]).   Overall, these results demonstrate an important advantage of STRFs over standard epoch-averaging methods commonly when used in MEG applications, e.g., characterizing the phenomenology of disorders in clinical populations[94]: their ability to generalize to critical sounds beyond pure tones, most importantly natural speech. By providing both neural predictions and functional information, it allows noninvasive approaches to understanding developmental[95], learning and associative effects induced by tasks[96]–[98], or behavioral contexts[99], [100] – thus potentially furthering insight into the role of dynamical representations of sound in auditory cognition. General methods Participants. 15 subjects (6 women, 23.2 ± 2.9 years of age [mean ± SD]), 1 left- handed[101], participated in the multitone study. 12 subjects (6 women, 24.1 ± 3.0 years of age), all right-handed native English speakers, participated in the speech study. 15 subjects (5 women, 21.0 ± 1.7 years of age), all right-handed, participated in the music study. Each subject received monetary compensation proportional to the study duration (approximately 1.5 hours). Subjects had no history of neurological disorder or metal implants. The experimental protocol was approved by the UMCP Institutional Review Board and before each study session, informed written consent was obtained from the   42 participant. Stimuli.  Multitone study. Sound stimuli were constructed with the MATLAB® software package (MathWorks, Natick, United States) at a sampling rate of 44.1 KHz, and consisted of 50 s auditory scenes composed of pseudo-randomly presented 180 ms tones, each with frequency fi taken from a pool of 10 fixed values (range: 180-2144 Hz) in 2 equivalent rectangular bandwidth (ERB) steps[102] specified by 𝑓! = 𝑓!!! + 24.7 1+4.37𝑓!!!/1000 . For each frequency, tone onset times were uniformly distributed with a minimum inter-tone gap of 40 ms. Five tone presentation rates (2, 4, 6, 8, or 10 per second over all channels) were used separately. Tone onset times 𝛵! were independent across frequency bands and selected in 20 ms bins. Individual tones were modulated with 10 ms raised cosine on- and off-ramps. Tone level was calibrated according to frequency based on the 60-phon normal equal-loudness-level contour (ISO 226:2003) in order to adjust for perceived relative loudness differences; relative gains to a 1 KHz reference were determined in 2 dB SPL steps. Speech and music studies. For the speech study, a 60 s female voice audiobook excerpt [103] narrated from The Light Princess (Macdonald, 1864) was used as part of a related study on reverberant speech processing [37]. For the music study, 55 s samples across 6 different instrumental musical styles reflecting a variety of genres and traditions, were presented: orchestra, Symphony in F Major, No. 32, Movement I (Sammartini, c. 1740); swing, Cascades (Combelle, c. 1940); blues, Blues for B&W (Rogers & Hilden, 2003); sarangi, Raga Mishra Bhairavi: Alap (Narayan, 2002); pipa, Dance of the Yi People (Huiran, c. 1960); and a euphonium transcription of Dancing Night Wind (Benning, 1997). In all studies, audio signals were normalized and presented through the Presentation®   43 software package (NeuroBehavioral Systems, Berkeley, United States), using audio equipment equalized to a transfer function approximately flat from 40 to 3000 Hz. Sound stimuli were transmitted to subjects via ear insertion tubes E-A-RTONE® 3A of 50 Ω impedance and E-A-RLINK® disposable foam intra-auricular ends (Etymotic Research, Elk Grove Village, United States) that were inserted in the ear canals. Experimental design. For the multitone study, trials consisted of a main tone cloud pattern scene presented in series with per block, generated anew per each subject. This resulted in trials that contained between 0 and 3 multitone density transitions within the trial, and ranged from 70 to 120 s duration. Each of the five main scenes were repeated 4 times, and only these data epochs were analyzed. After a brief training session, subjects were instructed to attend to the ongoing stimulus with their eyes closed and to report rate transitions via a button press. Optional rests were available every 5 trials, totaling 1.5 hours recording time. Subjects received feedback on the correct number of transitions at the end of each trial. For the speech study, trials consisted of various story passages presented in random order at different reverberant noise levels. At the end of a trial, subjects were asked comprehensive questions about the passage, and rated its intelligibility. For the present study purposes, analysis was based exclusively on reverberation-free, no-noise (‘clean’) trials, repeated 3 times across the experiment. For the music study, trials consisted of each of the 6 samples presented individually in random order. At the end of each trial, a 5 s clip taken from the same or a different piece was presented and subjects identified if it was an excerpt of the preceding trial. Each sample repeated 3 times across the experiment. Neural data recording.  Magnetoencephalography (MEG) data were collected with a 160-   44 channel system (Kanazawa Technology Institute, Kanazawa, Japan)[104] inside a magnetically-shielded room (Yokogawa Electric Corporation, Musashino, Japan) at a sampling rate of 1 KHz. Superconducting quantum interference device (SQUID) sensors (15.5 mm diameter each) were uniformly distributed (~25 mm) inside a Dewar vase containing liquid-He refrigerant, with a concave outer surface fit to the average human head. Sensors are first-order axial gradiometers with 50 mm separation and sensitivity greater than 5 fT·Hz-1/2 in the white noise spectral region (> 1 KHz), except for three additional reference magnetometers separated from the neural sensors and arranged orthogonally to each other. A 1 Hz high-pass analog filter, a 200 Hz low-pass analog filter, and a 60 Hz analog notch filter were applied online respectively. Sensor channels with saturating or zero responses over more than 12.5 s recording time were excluded from analysis. Participants laid supine inside the magnetically shielded room and were asked to minimize body movement, particularly from the head. Neural data processing.  Environmental noise. To eliminate environmental magnetic noise contributions, time-shifted principal component analysis[105] (TS-PCA) was applied, a process that discards optimally-filtered environmental signals recorded on the reference sensors. Reference sensors were 3 physical magnetometers (see Neural Data Recording) plus 2 virtual channels obtained by independent component analysis[106] of the remaining data sensors and selecting the two components with the most unstructured broadband (0-500 Hz) power. Sensor-specific noise. Electronic sensor noise was removed via sensor noise suppression (SNS)[107] by substituting each channel signal with its projection onto the orthogonal basis space generated by all other sensors in the system. This method exploits redundant activity across elements of the dense array (where the   45 number of channels exceeds the number of brain sources of interest) by attenuating components specific to any single channel. Spatial filtering. Data-driven spatial filters were derived per participant using responses evoked by repeated trials in each of the respective studies. Response epochs of 45-55 s duration were extracted, band-pass filtered (1-15) Hz with a 2nd order Butterworth filter, and delay corrected (~13 ms). A linear transformation based on this manipulation was obtained per participant[108] to generate spatial filters that correspond to magnetic fields generated by the left and right auditory cortex (Appendix A Supplementary Fig 3). This spatial filter was applied to the raw data and the resulting neural signal, representing the most reproducible component of the evoked data, was selected as a single virtual sensor in analyses henceforth. Neural data analysis. Spectrotemporal response function of stimulus representation. For multitone patterns, pure onset representations only carry information at a time beginning with the onset of a tone. We formulate this representation as 𝑂 𝑓, 𝑡 = 𝛿!!!!"𝛿!!!! (1) where every onset has equal weight independent of its tone’s frequency band 𝑓! (𝑖 = 1,…,10), with onset timesΤ!" Tij of the j-th tone with frequency 𝑓!; 𝛿! is the discrete unit impulse centered at sample n. The input-output relation between this representation of auditory input and the evoked cortical response 𝑟 𝑡 is then modeled by a spectrotemporal response function (STRF). For discrete data this linear model is formulated as: 𝑟!"#$∗ 𝑡 = 𝑆𝑇𝑅𝐹 𝑓, 𝑡 𝑂 𝑓, 𝑡 − 𝜏 + 𝜀(𝑡)!! (2) where ε(t) is the residual contribution to the evoked response not explained by the linear system. Summing only over the frequency term allows evaluating the temporal profile of   46 the response function model (TRF). Exploration of alternative stimulus representations requires substitution of the O(f,t) term in (2) by the analogous time-frequency representation of the stimulus (e.g. by a spectrogram S(f,t)). For all stimuli, stimulus envelope filterbank representations were obtained by passing the original waveform through a filterbank of ten order 1000 FIR filters with passbands at mid-values between 𝑓! neighbors (see Stimuli, above) starting at 143 Hz. Filter delays were compensated and the envelope in each band was extracted as above. Sampling rates were reduced to 1 KHz and signals smoothed by a delay-corrected 4th order binomial FIR filter. Half-wave rectification (i.e. setting negative values to zero) of the derivative of the stimulus envelope filterbank output gave envelope onset representations of the stimulus signal based on the filterbank. Prior to reverse correlation, both envelope and envelope onset representations were transformed to dB-scale. Linear STRF model estimation. STRF estimation was performed via boosting, a technique where the error estimate ε(t) (in Eq. 2) is minimized iteratively via sequential modifications to the STRF[19]. The name originates from the ability to improve (‘boost’) an estimate learning algorithm by establishing aggregate decision rules from across a sequence of many estimation steps, each needing only slightly-better-than-chance accuracy[20]–[22]. This technique can then be implemented as a forward stage-wise fitting that follows a greedy heuristic, by adding the contribution with the largest available mean-squared-error reduction at each given step[19], [23] and in turn maximizing the predictive power of the model[24]. Operationally, STRF estimates by boosting were initialized as a null matrix of dimensions TxF, where T equals the number of experimental time bins and F is the total of frequency bins (=10; for TRF estimates,   47 F=1); optimization followed through exploring fixed increments and decrements per spectrotemporal bin individually. Among the resulting 2xFxT possible choices, the outcome with minimum mean-squared-error was selected as the next step in the running STRF estimate. The procedure was iterated, accumulating optimizations, until modifications instead produced a sustained increase in mean-squared error[23], since the method is not guaranteed to find a global optimum. This termination method effectively imposes a sparse structure on the STRF, which allows for extraction of high-temporal resolution features in the STRF even if only low-frequency content was present in the input waveforms (other  STRF  estimation  methods  such  as  normalized  reverse  correlation[18]  and  generalized  linear  models  could  also  used  [19],  [109],  [110]).  Other  detailed  descriptions of the boosting algorithm implementation for timeseries data, including MEG/EEG are available[19],  [25]. STRF predictive power bounds. The measured evoked cortical response r*(t) may include stimulus-independent noise, the presence of which is a consequence of the finite dataset size and leads to STRF model parameters that overfit to the training data. Performance measures that account for stimulus-independent noise are necessarily overestimates and therefore can be considered to act as empirical upper bounds of model performance[24]. In contrast, the risk of overfitting can be minimized using cross-validation, where a fraction of the r*(t) timeseries r t (e.g. 90%) is reserved for model training, and testing is done on the remaining fraction incorporating only the model’s ability to generalize over novel stimulus instances. This would be expected to underperform with respect to an optimal model for the dataset in question and so indicates a lower bound for its   48 performance [24]. In practice, it is this conservative, cross-validated lower-bound that is used for STRF estimates. Nonlinear extension. Linear  encoding  models  may  fail  to  characterize  firing  rate  predictions  based  on  effects  such  as  threshold  activity,  past-­‐history  dependencies,  dynamic  range  compression,  synaptic  transfer,  and  the  non-­‐negative  distribution  of  the  neuron  response  for  example.  At  the  single  neuron  level,  predictions  can  be  improved  via  introduction  of  static  nonlinearities  derived  by  empirical  fit[9],  or  via  intermediate  nonlinearities  in  more  complex  model  hierarchies[110].  For  coarse-­‐grained  continuous  neural  responses  such  as  local  field  potentials  and  the  MEG  signal  here,  it  appears  that  such  model  hierarchies  may  no  longer  apply  well.  When a static nonlinearity was incorporated using a linear-nonlinear (LN) model[111], [112], only a 2% improvement to predictive power resulted (quadratic fit, R2=0.972, S5 Fig) and so was not pursued. Estimation of STRF predictive power and noise limit extrapolation. To assess STRF model validity, predictive power was estimated as the fraction of a response signal variance that is stimulus-explained, and corrected for the reduction of noise-related variance achieved by averaging[24]. Namely, for MEG response timeseries r1(t), … ,rN(t)r! t ,… , r! t where N is the number of repetition trials, total variance is expressed as the average of each trial’s individual variance  Var 𝑟 = !! (Var(𝑟! 𝑡 )+⋯+ Var(𝑟! t )) (3) while evoked variance can be expressed as that of the average response Var 𝑟 . When N is large, the extent to which total variance is larger than evoked variance indexes   49 reliability for the response source. Contributions to total variance Var 𝑟 are then partitioned into those stemming from the evoked signal, and the remainder is treated as noise: Var signal = !!!! 𝑁 ∙ Var 𝑟 − Var 𝑟 (4) Var noise = !!!! Var 𝑟 − Var 𝑟 (5) such that estimates are corrected for cases where N is small. Often, STRF model estimates are optimized to produce accurate predictions of the evoked response only; in such cases, use of single-trial variance provides an additional statistic regarding the event-related contribution to available recordings. Once a STRF model has been obtained for a particular condition and subject, its ability to predict the evoked response is assessed as the extent of evoked response variance that is not residual error, that is Var 𝑟 −Var 𝑟 − 𝑟!"#$ . This expression is the model’s predictive power, which after division by the estimated signal power (eq. 4)Var signal , represents the fraction of stimulus-evoked variance described by the linear STRF model contingent on a given experimental condition and subject. Analogously, noise power in the same response may be normalized by the estimated signal power, providing the inverse proportion to which the procedure of averaging reduces response variability. When N is very large, a normalized noise power of e.g. 10 indicates that averaging reduces variance in the evoked signal to almost a tenth of the original total variance. In the hypothetical case where the procedure of averaging yields no reduction in variability (such as with identical trial response instances), the absence of variability reduction implies an absolute zero noise level. Empirically, each dataset’s (condition and subject) predictive power can be indexed by the intrinsic noise power (e.g. Fig 1C). Assuming the responses have been measured from a similar   50 population, regression analysis may produce an estimate of the STRF model class predictive power, via its extrapolation to the theoretical noise-free limit[24].     51 Chapter III Dynamic cortical representation of perceptual filling-in for missing acoustic rhythm Summary In the phenomenon of perceptual filling-in, missing sensory information can be reconstructed via interpolation from adjacent contextual cues by what is necessarily an endogenous, not yet well understood, neural process. In this investigation, sound stimuli were chosen to allow observation of fixed cortical oscillations driven by contextual (but missing) sensory input, thus entirely reflecting endogenous neural activity. The stimulus employed was a 5 Hz frequency-modulated tone, with brief masker probes (noise bursts) occasionally added. For half the probes, the rhythmic frequency modulation was moreover removed. Listeners reported whether the tone masked by each probe was perceived as being rhythmic or not. Time-frequency analysis of neural responses obtained by magnetoencephalography (MEG) shows that for maskers without the underlying acoustic rhythm, trials where rhythm was nonetheless perceived show higher evoked sustained rhythmic power than trials for which no rhythm was reported. The results support a model in which perceptual filling-in is aided by differential co-modulations of cortical activity at rates directly relevant to human speech communication. We propose that the presence of rhythmically-modulated neural dynamics predicts the subjective experience of a rhythmically modulated sound in real time, even when the perceptual experience is not supported by corresponding sensory data.       52 Introduction The ability to overcome the problem of missing but important sensory information, such as a conversation obscured by heavy background noise, is ethologically valuable. Even when physical information may be lost entirely, restorative phenomena such as the auditory continuity illusion, phonemic restoration, and other forms of perceptual filling- in[113]–[115], allow for the percept of stable hearing in natural environments. These effects have long been hypothesized to rely on the brain’s ability to conjecture a reasonable guess as to the nature of the missing fragments[113], [116]. Furthermore, as has been extensively argued, predictive coding is a task well suited for cerebral cortex[117]–[119] but systematic accounts of endogenous cortical mechanisms responsible for these percepts remain unspecified. Rhythmically-modulated sounds generate steady predictable events for which disruptions and resumptions may indicate the grouping strength of dynamic perceptual streams[120], [121]. If replacement of these sounds by noise may, under some circumstances, preserve the perceived rhythm in apparent continuity, how are such streams instantiated at the neural level? Rhythmic sounds drive auditory steady-state responses (aSSR) in auditory cortex and can be recorded non-invasively via magnetoencephalography (MEG)[122]– [124], with responses to rhythmic rates <10 Hz being especially prominent[125]–[128]. To the extent to which the neural responses track the stimulus rhythm, they can be considered sparse neural representations of the modulation rate. This experimental framework was employed to investigate the cortical effects of briefly masking and removing an ongoing low-frequency rhythmic pattern. We hypothesize that for cases where perceptual restoration of the removed rhythm occurs, the neural signature of the   53 removal is attenuated—akin to stabilization of a cortical representation, in line with perceptual grouping under dynamic continuity. This predicts that during perceptual filling-in, the dynamical evolution of a listener’s cortical response retains oscillation in synchrony with the expected but acoustically missing rhythm. Listeners’ perception of a continuous 5 Hz rhythmic pattern during masking was probed in a two-alternative forced choice task, where the acoustic pattern may or may not have been removed with equal probability. Simultaneously obtained MEG responses were then partitioned according to both physical and perceptual conditions, using wavelet analysis to localize oscillatory responses in time and frequency. The finding of rhythmic aSSR-like responses in cases where perceptual filling-in occurs is consistent with underlying mechanisms requiring a sustained neural representation of the restored feature[114]. Importantly, it demonstrates dynamical restoration processes occurring at scales commensurate with informal speech articulation rates[129], as well as within MEG frequency bands that reflect cortical phase-locking to the slow temporal envelope of natural stimuli[25], [127]. Results  Sustained neural rhythm follows acoustic rhythm in noise.  Subjects listened to four blocks (~14 min each) of a 5 Hz frequency modulated (FM) rhythmic stimulus, repeatedly masked by noise probes at pseudo-random times (see Methods). Half of the probes replaced the underlying rhythmic FM tone with a constant frequency tone, and half instead simply masked the underlying rhythmic stimulus, here called non-rhythmic   54 and rhythmic probes, respectively (Fig. 1A insets). Between noise masker segments, MEG responses to steady rhythmic intervals show strong aSSR, even on a per-trial basis. Noise masker segments generate strong transient onset-like responses, after which any residual phase-locked response may disappear, on average, for rhythm-absent probes but not rhythmically-driven probes (Fig. 1A). To determine whether across subjects this change results from a decrease in aSSR power, or increased temporal jitter that would reduce averaged aSSR, inter-trial phase coherence (ITPC) and power analyses were performed on single-trial and evoked data respectively (e.g. Fig. 1B). Results of inter-trial phase coherence (ITPC) analysis reveal that, within the 0.55 – 1.22 s post probe onset interval, the ITPC difference is significant across (N = 35) listeners (p < 0.001; non- parametric permutation test). Testing for evoked rhythmic power for across listeners similarly reveals a significant difference ( p< 0.001) within the 0.56 – 1.23 s post probe onset interval. Thus the dual phase and power analyses show that both decreased aSSR power and increased intertrial jitter contribute to the decrease of the neural 5 Hz component in rhythmically absent versus driven probes.   55   Figure 1. Neural representations of (un)modulated masked sound from a representative subject. (a) MEG responses before, during, and after a noise probe are shown (single MEG component obtained via spatial filtering; see Methods and Appendix B Supplementary Fig 1). The basic stimulus consists of a 5 Hz pulsatile (short duty-   56 cycle) FM tone, centered at f0 = 1024 Hz, to which 1.24 s noise probes were applied. Insets: illustration of a non-rhythmic probe where pulses are replaced by the constant tone (top); and a rhythmic probe, where the FM continues under the noise (bottom). Before and after the probes, phase locking to the main rhythmic stimulus is apparent even on a per-trial basis. Overlaid on each response raster, evoked activity (averaged separately for each probe type) reveals a measurable aSSR during rhythmically-driven probes (top) but not during rhythm-absent probes (bottom). (b) Top: Phase analysis at 5 Hz shows estimated phase-locking over time as measured by ITPC. During masking ITPC values drop to near floor in rhythm absent probes (orange) but only to half of baseline levels in rhythm-driven probes (blue). Bottom: Analysis of spectral power (also at the 5 Hz rhythm rate) also shows considerable difference between probe types for this subject. Sustained neural rhythm follows listeners’ perceived rhythm in noise. In order to determine how neural representations of rhythm co-varied with perception, after each trial the probe was classified by the subject as perceived as rhythmic or as non-rhythmic. This resulted in a 2-by-2 partition of analyzed trials: (1) non-rhythmic probes perceived rhythmic (‘filling-in’); (2) non-rhythmic probes perceived non-rhythmic (rhythm ‘absent’); (3) rhythmic probes perceived rhythmic (rhythm ‘present’); and (4) rhythmic probes perceived non-rhythmic (rhythm ‘missed’). Fig. 2 shows the grand average evoked 5 Hz response power before, during, and after noise probes, for each combined condition of stimulus and percept. Transient (and broadband) masker-onset responses were evident during the initial 0.3 s post masker onset (cf. Appendix B Supplementary Fig 2) (brief pre-causal dips accompanying these transients are due to convolution residuals from the continuous wavelet transform). For non-rhythmic probes (Fig. 2A), phase coherence dropped to almost 0% for   57 both perceptual conditions (filling-in and absent, right panel). Rhythmic spectral power also dropped from the initial baseline for both perceptual states, but the decrease was on average 7.9 dB worse when subjects reported the rhythm absent than present (filling-in). Decreases were restored to baseline values by 0.8 to 1.2 s post probe offset (equivalent to between 4 and 6 rhythmic pulse cycles. Thus, within non-rhythmic probes, a sustained and significant percept-specific difference was observed in rhythmic evoked power (0.56 to 1.19 s, p < 0.001), but this was not the case for phase locking (p> 0.18).     Figure 2. Percept-specific endogenous representations of patterned sound. Grand averages (N = 35) of rhythmic evoked power and intertrial phase coherence partitioned   58 by probe type and reported percept. Noise probe starts at the first vertical line at t = 0 s and continues until the next vertical line at t = 1.24 s. (a) Non-rhythmic probes: (Left) After an initial transient, rhythmic evoked power was reduced regardless of percept, but differentially by 7.9 dB depending on percept as present (magenta), or absent (orange). (Right) No significant difference was observed for ITPC, where there was a reduction to near floor during the probe. (b) Rhythmic probes: (Left) During masking, rhythmic evoked power drops by 9.5 dB in average, holding relatively steady for the duration of the probe. (Right) Similarly, inter-trial phase coherence drops by about 81% for the duration of the probe. For probes in which the rhythm was missed (brown), however, both evoked power and ITPC showed an additional reduction (only near the end of the probe) compared to rhythmically-driven probes (blue). Solid lines: mean across subjects and trials; Color bands: bootstrap 95% confidence of the mean over subjects; Grey bands: time intervals with no significant difference by percept. For rhythmic probes (Fig. 2B), the masker was associated with an average relative decrease of 9.5 dB evoked power regardless of perceptual condition (driven and missed), and with a relative decrease of ~75% in trial-to-trial phase locking. When subjects missed the rhythm, evoked power and inter-trial phase coherence both further decreased, with percept-specific decreases sustained over a longer period for ITPC (0.84 – 1.25 s, p < 0.001; right panel) than evoked power (1.04 – 1.15 s, p = 0.008; left panel). Rhythmic neural power as discrimination statistic in a rhythm detection task. With the observation that differential neural processing of masked rhythm depends on listeners’ percept, it was next investigated whether the observed divergence might have properties of an internal variable underlying discrimination. Based on the previous result, we hypothesized that the 5 Hz target neural processing power in the final ~600 ms of the   59 probe interval might act as such variable. For each subject, a metric was created from the rhythmic evoked power differences contrast, integrated over the 0.56 – 1.24 s interval of interest post probe onset. To illustrate the use of this latent variable as a discrimination statistic, a bootstrap resampling of trials (with replacement) was used to produce distributions of evoked power sustained over the critical window (two representative subjects shown in Fig. 3A). A neural discriminability metric was then computed from their relative separation (see Methods). To assess the potential of this sustained evoked power to operate as a variable relevant to perceptual discrimination, the neural metric was compared with psychometric d’ scores that index behavioral sensitivity of listeners to the detection task[130] (Fig. 3B, blue), with the result that the two are significantly correlated (ρ = 0.728, p = 1.04x10-6).   Figure 3. Rhythmic target power acts as a discriminant neural statistic for perceived rhythm. (a) Top: In two representative subjects, behavior covaries with empirically- derived neural discriminability distributions. Probability distributions of a given level of sustained (time-integrated) evoked power depend on the acoustic presence (blue) or absence (red) of stimulus rhythmic FM; a neural discriminability score (proportional to   60 horizontal black bar length) can be obtained from them. In the first subject (left panel), the small overlap between the distributions gives high neural discriminability; for the second subject (right panel), both distributions overlap substantially, giving poor discriminability. Bottom: Next, empirically-derived neural distributions were obtained only from non-rhythmic probes (i.e., the red curves in the top panels), now conditioned instead by percept. A similar pattern in the distributions is observed. Distributions obtained via bootstrap. (b) Over subjects, the psychometric d’ sensitivity index (abscissa) correlates with the neurometric discriminability index based on acoustic contrast (rhythmic versus non-rhythmic probe, blue; ρ = 0.73, p = 1.0x10-6). Critically, behavioral sensitivity to ‘filling-in’ also correlates with rhythmic evoked power differences despite the absence of stimulus rhythm via the related neurometric discriminability index based on perceptual contrast (filling-in versus reported absent, magenta; ρ = 0.69, p = 6.1x10-6). A related latent discrimination statistic, directly relevant to the phenomenon of filling-in, is computed with contributions only from endogenous (non-sensory) factors, by analyzing the responses to non-rhythmic probes exclusively (Fig. 3A, bottom). In these purely percept-specific (constant acoustics) distributions, neural power discriminability was defined analogously as the difference in rhythmic evoked power between filling-in and rhythm-absent trials, integrated over the time at which significant differences were observed at the group level in the previous section (0.56 to 1.19 s post probe onset, as in Fig. 2B). Just as for the acoustic contrasts, this discriminability index also correlates strongly with the psychometric sensitivity indices across listeners (Fig. 3B, magenta) (ρ=0.745, p=4.23x10-7). Thus, consistent with the properties of a latent discrimination statistic, sustained evoked power may account for both stimulus- and percept-specific differential processing, where the latter reflects only endogenous neural processes.   61 Spectrum of power increase in target-related neural rhythm dynamics with filling-in. Given the possibility that increased power at the 5 Hz rhythmic frequency would be accompanied by increased spectral power at other frequencies, it is important to consider whether change arises as a power gain specific to the target frequency or as a modulatory effect over a larger spectral region that includes the target frequency band. By extending the wavelet analysis over a broader frequency range (1-25 Hz), the spectral extent of restoration was probed to address whether changes are target-specific, or instead accompanied by other activity that may be behaviorally relevant. Evoked power analyses across probe conditions and subjects reveal that the evoked response contains two frequency ranges, one centered on the target 5 Hz, and the other centered on the 10 Hz first harmonic (Fig. 4A). To analyze time-frequency power contrast between conditions, corresponding spectrograms (baseline corrected per frequency band) were subtracted. In particular, the ‘driven’ minus ‘absent’ map results in a contrast whose differences arise from synchronization to physical differences in the sound, while ‘filling-in’ minus ‘absent’ maps differences due entirely to endogenous activity (Fig. 4B, left panels). For the first case, the defined ‘synchronized’ contrast (Fig. 4B, top left) group average data shows a spectrotemporal region, ~600 ms post probe offset until the end of the probe, of significant differential neural processing (p = 3.3x10- 4), rooted in physical stimuli differences. The region is limited to the spectral neighborhood of the target (half maximum 4.1-6.7 Hz; maximum 3.8-7.5 Hz), which may be expected as smearing from Fourier/Heisenberg uncertainty. For the ‘endogenous’ contrast (Fig. 4B, bottom left), a similar profile was found (half maximum 4.1-6.6 Hz; maximum 3.8-6.8 Hz; p = 6.7x10-4), with additional enhancement around the target first   62 harmonic (0.4 to 1.1 s post probe onset; half maximum 9.7-11 Hz; maximum 8.9 to 11.9 Hz; p = 0.01). In a related analysis of a third partition contrast, ‘rhythm-driven’ minus ‘missed’, no spectrotemporal cluster of significance was found (p=0.29). Figure 4. Stimulus- and percept-specific spectrotemporal modulations of cortical activity during restored rhythm. (a) Wavelet power correlograms, in a 1-25 Hz frequency range, reveal qualitative differences in steady neural responses post probe onset, across participants (N = 32). Color arrows indicate spectrogram pairs submitted to difference contrasts as follows. (b) Differences between spectrograms reveal differential processing under alternative percepts, whether based on different physical sounds (top left), or on endogenous restorative processes (bottom left), in both cases specific to the target 5 Hz frequency band. The latter case of filling-in generates enhanced sustained power in the first harmonic band (~10 Hz) as well. Synchronization maps are shown masked by regions of group-level significance, as determined by permutations within contrast pairs, performed independently across subjects (‘driven’, p = 3.3x10-4; ‘filling- in’-near 5 Hz p = 6.7x10-4, filling-in’near10 Hz p = 0.01). The lower-rate rhythmic enhancements (~5 Hz) coincide spectrotemporally even though the sensory bases for each are different (right). White vertical lines indicate noise probe temporal edges. Upon examination of whether the additional spectral information conveyed by these   63 maps improved neural predictions regarding listeners’ behavior, we found that neural discriminability indices based upon the ‘synchronized’ region in this section showed no improvement over the target frequency specific index obtained previously for 5 Hz only measures (ρ = 0.53; p = 0.001). The ‘endogenous’ regions, jointly, showed no improvement in predictive power of listener’s performance (ρ = 0.72; p = 1.4x10-6) over that of the target-based index alone. Separating these regions into 5 Hz and 10 Hz domains revealed that the lower (target rhythm) region was more predictive (5 Hz only: ρ = 0.73, p = 8.2x10-7; 10 Hz only: ρ = 0.44, p = 0.01). These results suggest that differential narrowband 5 Hz power is most critical to explain listeners’ detection performance shown previously, and that for filling-in trials, some improvement also arises from integrating over the broadened filter to include neighbor target frequencies present in the average timeseries of endogenous neural activity. Discussion The subjective experience of attending effectively to complex sound scenes in noisy environments can be substantially assisted by perceptual restoration. This effect is investigated using MEG to record the neural dynamics of a steady temporal pattern while repaired perceptually. Measures of differential cortical processing contributed to the identification of a discrimination statistic predicting a subject’s behavioral performance sensitivity. The data are consistent with the view that perceptual restoration is attributable to endogenous neural processes, emerging from learnable temporal patterns present in the tracked auditory object, at modulation rates that dominate natural communication speech   64 sounds. Perceptual restoration, the effect of hearing the continuation of a sound regardless of an interrupting masker, includes descriptions of “auditory induction”, “temporal induction”, “perceptual synthesis”, or “contextual catenation” of dynamic sounds in classic studies[131], [132]. It implies an ability to discount disruptive but extraneous interruptions to relevant acoustic signals, so much so that even noise-filled gaps are more likely to be discounted as such[132]. Where multiple interpretations of a relevant acoustic signal are possible (e.g. phonemes), perceptual restoration has been probed in identification tasks; for more constrained decision spaces, it may be probed based on sound delivery quality assessments, such as gap localization of the excised token signal (e.g. Warren’s paradigm[132]), and by discrimination of noise-added vs. noise-replaced token gap alternatives (e.g. Samuel’s paradigm[133]). Our method subscribes to the latter approach, also referred to as ‘filling-in’, which emphasizes the signal detection strategy followed in cases where a listener classification is inconsistent with the token absence in a gap[115], [134]–[140]. As has been noted[141], from the listener’s utilitarian perspective, this effect of induction in a challenging environment is not aimed at the production of decision errors (or illusions) but to assist against masking. Restoration refers to the perception of a token projected by a context (such as a speaker’s intention), with apparent intactness[141]. Critical to this is a strong masker, along with contextual evidence favoring a specific acoustic token with high probability. This combination allows inference that the lack of auditory evidence of the token could be ascribed to energetic masking[113], [116], [142].   65 A simple and compelling example of perceptual restoration is that of a pure tone followed by a brief noise-filled gap where the tone has been excised: this leads to a strong illusory percept of continuity of the tone[143]. The percept appears to rely on two related effects, the more obvious being conveying the original signal as uninterrupted, but also, critically, accompanied by an attenuation of discontinuity boundaries[144]. Neural correlates of both effects have been observed in single units in macaque primary auditory cortex (A1), where up to 35% of sampled single units respond to a gap with noise as though the tone were continuously present[145], [146]. In some cases there is also failure of a transient response at the end of the gap[145]. For human listeners, there is evidence that such compensatory principles may extend to disruptions to dynamically modulated sound, including amplitude-modulated (AM) sound, single vowels, and consonants within words[120], [121], [135], [139], [147], [148], the latter of which fall under the concept of phonemic restoration[113], [115], [133]. Depending on stimulus, neural correlates have been localized to different areas, including Heschl’s gyrus for missing AM noise[147], the posterior aspect of superior temporal gyrus for disrupted vowels[139], and wider brain networks including the superior temporal lobe in the case of missed phonemes[135], [148]. In addition, mixed evidence points to a basis for restoration in terms of endogenous modulations to boundary encoding: on the one hand, the search for differential onset responses to noise when under restoration, indexing alternative encoding, has yielded negative results so far[138], [139]; on the other, induced narrow-band (3-4 Hz) desynchronizations that are restoration-specific, and occur after gap onset, have been suggested by results from EEG[139], [149].   66 In this study the differential temporal boundary encoding under restoration was not specifically addressed[149], but instead the emphasis was on the neural representation of the missing rhythm itself, via measures of evoked rhythmic MEG responses. While restoration of continuous tones has been observed for segments as long as 1.4 s[150] behaviorally, to our knowledge this is the first investigation where cortical aSSRs are directly implicated in perceptual restoration, sustained in real time representing a temporal code. That neural phase information was not reliable, despite an apparent continuity of the rhythm, is consistent with behavioral analyses suggesting that listeners may not track phase information under illusory FM continuity[120], [121]. An cortical EEG study by Vinnik and colleagues[138] showed no change to neural spectral power sustained along noise gaps embedded in a 40 Hz AM context stimulus during restoration; on the other hand, it has been shown that changes to neural spectral power in brainstem responses may occur during restored pitch of a missing 800 Hz carrier tone[140]. It is possible that while gamma-rate acoustic modulations can be represented cortically with a temporal code[125], [151], they are also at rates that involve pitch quality – a representation of which implies substantially distinct cortical coding modes[152] assisting restoration. In other sensory modalities, some restorative phenomena may fall in the category of perceptual experience that does not represent the absence of a physical stimulus, but rather, an alternative interpretation based on additional contextual information, e.g., the case of illusory induction of perceived kinesthetic trajectories[153], [154], and of spatial contours in certain visual displays[155]–[157]. Context-sensitivity in general is considered a requisite for cortical predictive coding[158], which in the case of hearing   67 may depend on known priors regarding the sound temporal dynamics. A compelling example arises from missing, but highly expected, click-like sounds that generate auditory onset-like responses locked to the nominal time of delivery of the missed sound[159]. Additionally, long duration, rhythmic metric structures may produce endogenous neural locking to a subharmonic frequency of the actual acoustic beat when it has the potential to be perceived as the underlying rhythm, whether listeners are instructed to do so[160], or passively listen in the absence of instruction[161]. Correspondingly, the data here show that with perceptual restoration of masked rhythm, endogenous representational differences may emerge as early as 0.6 s post masking, at the target rhythm. There is also activity at the first harmonic, 10 Hz, but there one cannot entirely rule out yet alternative explanations involving enhanced alpha activity[138], since with increased alertness at some trials over others, a systematic differential in spontaneous alpha activity might be responsible[162]. For filling-in and rhythm-missed trials related to inattention, reduced vigilance might be expected to effectively increase alpha activity. We did not, however find this; instead, filling-in trials displayed a narrow- band 10 Hz power increase strongly concurrent with the target duration, therefore consistent with being a harmonic of the endogenous 5 Hz rhythm. Alpha-band related effects due to non-uniform attentional states should be investigated in future studies using rhythms whose first harmonics are not in the alpha band. Our data does not reject the possibility of spontaneous and temporally patterned cortical activity profiles influencing sensory processing, as in ongoing slow-wave activity that may interact with evoked signals as a temporally coordinated modulation of excitability across distributed cortical fields[163], [164] .   68 Focus on analysis of endogenous activity may address circumstances under which the brain repairs certain temporal features of highly stereotyped sound. This is part of the general problem of determining what relationship does a neurally-instantiated representation of a missed pattern has with a template representation mapping to actual acoustic experience. Solutions may offer key insight into biologically-inspired applications dealing with incomplete information. In particular, the modulation studied here corresponds to the temporal scale of syllabic production in human speech[165] and the slow temporal envelope of natural stimuli[166], thus raising the question of whether similar restorative phenomena exist during sequences of inner or imagined speech, as well as during auditory hallucinations. General methods Participants.  35 subjects (12 women, 25.7 ± 4.4 years of age) with no history of neurological disorder or metal implants participated in the study, and received monetary compensation proportional to the study duration (~ 2 hours). The experimental protocol was approved by the UMCP Institutional Review Board, and all experiments were performed in accordance with its relevant guidelines and regulations. Informed written consent was obtained from all participants before study sessions. Stimuli. Four  template  sound stimuli were constructed with MATLAB® (MathWorks, Natick, United States), each consisting of ~15 minutes of a 1024 Hz tone frequency- modulated (FM) at 5 Hz with modulation range (log-sinusoidal) 512–2048 Hz and a 20% duty cycle[128]. 420 rhythmic probes were created by adding 1.24 s of noise to the basic   69 stimulus, at pseudo-random times. Noise was generated de novo per probe, and spectrally matched to the FM but with random phase. A fixed signal-to-noise ratio value was chosen from the -4 to 4 dB range, per participant. 420 non-rhythmic type trials were additionally created in the same manner, except that the underlying FM was replaced with constant carrier frequency. Inter-probe time intervals were 1.6 s plus a discrete Poisson-distributed random delay (λ = 1.2 s); the exact onset time was rounded to a multiple of the stimulus period (0.2 s), so that all probe onset times kept constant phase with the main rhythm. Sound stimuli were delivered through Presentation® (NeuroBehavioral Systems, Berkeley, United States), equalized to be approximately flat from 40–3000 Hz, at a sound pressure level ~ 70 dB. Sounds were transmitted via E-A- RTONE® 3A tubes ( impedance 50 Ω) and E-A-RLINK® disposable foam intra- auricular ends (Etymotic Research, Elk Grove Village, United States) inserted in the ear canals.   Experimental design.  After a brief practice session, subjects were instructed to push one of a pair of buttons based on whether they detected a 5 Hz rhythm. In order of importance, participants were instructed to: (i) wait until probe ended before pressing the button, weighting accuracy over reaction time; (ii) respond only to the probe immediately presented; (iii) modify their choice by pressing the other button only if certain and still before the next trial. Trials that did not meet the requirements, and corrected trials, were excluded (median 6.8% and 1.3% of trials respectively). To avoid transient cortical dynamics associated with motor response execution[167], trials beginning less than 250 ms from the previous response were also excluded (median 6.3% of trials). To more   70 evenly distribute the proportion of correct answers across participants, the masker signal- to-noise ratio (SNR) was fixed in advance, from on of 0, ±1, ±2 or ±4 dB. Silent films were presented concurrently, which subjects were instructed to watch.   Data recording.  MEG data were collected with a 160-channel system (Kanazawa Technology Institute, Kanazawa, Japan) inside a magnetically-shielded room (Yokogawa Electric Corporation, Musashino, Japan). Sensors (15.5 mm diameter) were uniformly distributed inside a liquid-He Dewar, spaced ~25 mm apart. Sensors were configured as first-order axial gradiometers with 50 mm separation and sensitivity > 5 fT·Hz-1/2 in the white noise region (> 1 KHz). Three of the 160 sensors were magnetometers employed as environment reference channels. A 1 Hz high-pass filter, 200 Hz low-pass filter, and 60 Hz notch filter were applied before sampling at 1 KHz. Participants lay supine inside the magnetically shielded room under soft lighting, and were asked to minimize movement, particularly of the head. Every session had four experimental blocks. In the case of seven participants, the experiment had to be suspended early due to time constraints (mean 89% completion in these participants, minimum 75%); for one participant only 2 blocks out of 4 were recorded due to transfer failure. Two participants requested pauses during a block, which was terminated and later repeated in whole. Data processing.  A 1-30 Hz band-pass third order elliptic filter with at most 1 dB ripple and 20 dB stopband attenuation was applied and noise sources were removed as follows. Environment noise. Time-shifted principal component analysis[105] (TS-PCA) was applied to remove environmental noise, using the three reference magnetometers (Nlags =   71 43). Sensor-specific noise. Sensor-generated sources unrelated to brain activity were subtracted using sensor noise suppression (SNS)[107]. Spatial filtering. A per-participant data-driven model was used to synthesize spatial filters from the responses to the unmasked rhythmic sound stimulus via denoising spatial separation (DSS)[108]. The responses were structured as a matrix of dimensions T x N x K; where T is the number of samples (=1400), N is the number of usable recording segments (average=514.3), and K the number of active sensors (average=156.8). This spatial filter selects for the most reproducible aSSR component over trials, generating a single virtual sensor used in the remaining analysis. Data analysis. Trials were classified a posteriori, according to subjects’ reports, into one of four groups: rhythmic-trial perceived such (‘driven’) or as not as non-rhythmic (‘missed’); non-rhythmic trial perceived as such (‘absent’), or as rhythmic (‘filling-in’). Time-frequency analysis used a Morlet wavelet transform with 0.2 s scale, permitting estimation of spectral evoked power at the bandwidth of experimental interest (5 Hz). For evoked power and ITPC contrasts, statistical clusters were found during which there were significant differences across experiment conditions according to non-parametric permutation tests[168]. A measure of ‘neural discriminability’ ∆𝑃!!,! ≡ 𝑃!! − 𝑃!! 𝑑𝑡!!!! (1) is defined as the area between two evoked power curves P obtained each at conditions A and B for the i-th subject, and computed over a fixed time interval (T0 = 0.58 s and T1 =1.2 s post noise onset on average), as defined by statistical clusters of significance found at the group level for the given contrast AB. Measures for shifts in ITPC were computed in similar way. Perceptual sensitivity of a subject in detection is given by d-prime   72 analysis[130], where for each subject i, Hi the fraction of rhythmic probes labeled rhythmic, and Fi the fraction of non-rhythmic probes labeled rhythmic, undergo a z-transformation.[169] To investigate whether the observed pattern of percept-specific differences was due to unintended acoustical or statistical properties in the stimulus constructs, stimulus probes were analyzed a posteriori. No significant differences were found in stimulus temporal modulations when partitioned by percept, within rhythmic (p=0.85) nor non- rhythmic (p=0.84) probes (paired-sample t-tests, Appendix B Supplementary Fig 3). Subjects’ reported percepts corresponded to the physical acoustics (presence or absence of rhythm) approximately 5 times as often as not, resulting in data pools with differing signal-to-noise ratio improvement from averaging. Therefore inter-trial phase coherence measures included bias correction[170] as small sample sizes are especially prone to bias. The unbiased estimator is based on the squared ITPC (also defined as squared ‘modified resultant length’[170]), which may be negative after estimated bias subtraction. To investigate the possibility of related biases in the rhythmic evoked power measures, post hoc two-sided non-parametric permutation tests were performed by collecting, for each subject, all trials from the two conditions to be compared, and instantiating resampled of partitions of fixed size (original sample sizes per subject); the group-level test statistic obtained in the actual partition was then contrasted against those obtained at group level across the distribution of resampled instances. Using the 5 Hz evoked power difference between conditions in the same intervals of significance, it was found that responses to non-rhythmic probes show significantly greater power when reported perceived as rhythmic versus non-rhythmic (0.56 to 1.19 s ; p=0.007); a similar d 'i = z(Hi )− z(Fi )   73 result held for responses to rhythmic probes, which also show significantly greater power when reported perceived as rhythmic versus non-rhythmic (1.04 to 1.15 s ; p=0.034). Potential systematic differences resulting from the per-subject signal-to-noise (SNR) ratio were also investigated, but no evidence was found of differences, neurally (ρ=0.10, p=0.57) or behaviorally (ρ=0.33, p=0.054). One participant was excluded from the analysis due to zero reported perceptual differences from the acoustics. Data availability. Relevant data are available in a public repository accessible at http://hdl.handle.net/1903/19593.   74 Chapter VI Prior knowledge influences cortical latency and fidelity of the neural representation of missing speech Summary In naturally noisy listening conditions, for example at a cocktail party, noise disruptions may completely mask significant parts of a sentence, and yet listeners may still perceive the missing speech as being present. Here we demonstrate that speech-related dynamic auditory cortical activity, as measured by magnetoencephalography (MEG), which can ordinarily be used to directly reconstruct to the physical speech stimulus, can also be used to “reconstruct” acoustically missing speech. The extent to which this occurs depends on the extent that listeners are familiar with the missing speech, which is consistent with this neural activity being a dynamic representation of perceived speech even if acoustically absence. Our findings are two-fold: first, we find that when the speech is entirely acoustically absent, the acoustically absent speech can still be reconstructed with performance up to 25% of that of acoustically present speech without noise; and second, that this same expertise facilitates faster processing of natural speech by approximately 5 ms. Both effects disappear when listeners have no or very little prior experience with a given sentence. Our results suggest adaptive mechanisms of consolidation of detailed representations about speech, and the enabling of strong expectations this entails, as identifiable factors assisting automatic speech restoration over ecologically relevant timescales.   75 Introduction The ability to interpret speech elements across interruptions masking a conversation is a hallmark of human communication [171]. In many cases, possessing contextual knowledge poses clear informational advantages for a listener, so as to successfully disengage the masker and restore the intended template signal [135], [139], [172], [148]. Information can typically be obtained from multimodal sources and/or low-level auditory and higher-order linguistic analyses, although it remains unclear how and which factors are most effective in assisting speech restoration under natural conditions. For instance, it is possible to identify cortical network activity profiles consistent with phonemic restoration, the effect where missing phonemes in a signal may be heard [115], [133], in binary semantic decision tasks [148]; still, the description of factors that bias into either of two alternatives in the direction of perception remains unclear. To this end, there is evidence that restorative processes may be influenced by contributions from audiovisual integration cues [173], lexical priming [174], and within the auditory domain, predictive template matching [159] or even intentional expectations about temporal patterns in sound [160], [161]. It is clear that in order to lead to informational gain, potential contributors must be readily accessible before and during missing auditory input. Presumably, the mechanism would involve (i) generation of a provisional template about forthcoming speech, (ii) that the template is stored in a compatible format with the internal representation of ongoing sound, and (iii) that they are later subject to point-wise matching – in what has   76 been termed the zip metaphor [175]–[177]. In some cases, the informational value added by such putative mechanism in ameliorating the neural representation of speech may also involve speeding up cortical processing during integration [178]. Here we test how natural speech tokens spanning over several words may be represented cortically in the midst of masking noise, under varying levels of informational gain added by prior knowledge about the missed element. The low-frequency envelope of speech indexes slow acoustic energy changes over time and is known to entrain and phase-lock neural activity at the auditory cortex, as measured by magnetoencephalography (MEG) and electroencephalography (EEG) [25], [91], [179], [180]. Due to its characteristic timescale, the envelope is also related to prosodic attributes such as syllabic lengths and loudness, which themselves may include intonation, rhythm and stress cues. We hypothesize that by presenting the same verse units several times, it is possible to manipulate listeners’ ability to develop detailed predictions about forthcoming elements in long speech sentences, plausibly forming a template about them, that may serve for a form of point-wise matching at a later time when spontaneous maskers disrupt the same parts of the narrated story. This implies the possibilities that (a) the template about the envelope may be decoded from cortical signals in response to noise, and (b) because the template must be have been present in advance, that the mechanism could be facilitated at subsequent times by speeding up processing of the incumbent envelope token, at least indirectly. We apply neural coding methods to neural responses in order to reconstruct the original verse template envelope [181], an approach that has been successfully applied in auditory electrophysiology [16], [18], EEG/MEG [25], [55], [91] , electrocorticography [90], [148], [180], and fMRI   77 [182]; and also to provide estimates of the forward stimulus-response mapping [25], [91] under normal speech conditions. From such decoding performance we assess the extent to which prior knowledge about speech may enhance endogenous representations that assist restoration of intended speech signals. In the case of forward models, we address the cortical latencies involved in natural speech encoding under the same conditions. The latter is a relevant question at least because (i) reduced processing times have been observed in visual contexts that facilitate integration of detailed predictions with auditory representations of incoming speech [172], [178]; and (ii) within timescales of the order of seconds or minutes, task-related adaptive changes can occur to the shape of stimulus- response mappings, which would in turn suggest mediation by cortical plasticity [98], [183] as a biophysical basis for restorative mechanisms given the present task demands. We provide evidence that the speech temporal envelope may be better reconstructed if listeners have obtained sufficient knowledge about a particular speech sequence, and this effect extends to cases in which they are presented with noise instead. The data also show that cortical latencies in natural clean speech processing can be reduced by the order of milliseconds under similar conditions. Overall, the results suggest that formation of online templates about low-level features of frequently experienced speech may facilitate more efficient neural representations, by means of faster encoding and improved access to endogenous modulations time-locked to expected but missing speech, thus assisting its restoration.   78 Results  Reconstruction of missing speech timeseries from noise with context. Fixed-duration spectrally-matched static noise bursts were used to mask word sets within a narrated story. Each noise probe was designed to have the same spectral composition over time as the replaced speech segment (Fig. 1A) but without any supporting temporal modulations in the low-frequency (2-8 Hz) envelope (Ding and Simon, 2012a; Giraud et al., 2000). For natural speech without masking, these low-frequency fluctuations entrain auditory cortical activity as recorded by MEG and, given a suitable decoding model, can be used to reconstruct the envelope of the original speech signal. Such linear decoders were created using unmasked speech and reverse correlation to establish an optimal mapping from cortical activity to the original speech envelope. To test whether the acoustic presence of a target is a strictly necessary condition for such reconstruction, the listeners were exposed to extensive repetitions of some of the speech, and less extensive repetitions (or none at all) to the rest. Sentences that were maximally repeated over the hour-long session (Fig. 1B, left) resulted in greatest relative performance in reconstruction of the envelope of the missing speech: approximately 25% of the performance for the actual speech presented free of masking. Lesser amounts of repetition resulted in further reductions to relative performance down to baseline floor level for masked speech with which the listener had little or no prior experience. Because relative performance measures include data entries from clean speech reconstruction as references, it is important to verify whether reconstructions from noise alone independently reveal similar effects. Absolute effect sizes of repetition in reconstruction of the missing speech envelope were thus confirmed to display a similar pattern as with   79 relative performance (Fig. 1B, right). To determine whether decoding success of the linear model of the envelope did significantly change across conditions, the Mauchly test of sphericity was run to evaluate whether corrections would be necessary for a posterior repeated measures model. Results for independent reconstructions using exclusively noise-derived independent scores showed that this condition was not violated in the absolute effect of envelope reconstruction in noise epochs (χ2(5)=5.409; p=0.368). The subsequent four-level repeated measures ANOVA with subject and verse as predictors resulted in a significant main effect of repetition (F=3.332; p=0.023), with no interaction from subject (F=0.411; p=0.726), no interaction from verse (F=0.622; p=0.603), and no three-way interaction between repetition, verse and subject (F=1.229; p=0.304). Post hoc comparisons using the t-test with Bonferroni correction indicated that the average effect size at High repetition rates was significantly different than that at the Control condition (t(34)=4.319;p<1.3x10-4), and Low repetition rates (t(34)=3.918;p<4.1x10-4).     80 Figure 1. Cortical reconstruction of acoustically missing word-level speech envelope from noise by repeated replays of narrated story. (A) A set of speech materials from a poem were repeatedly presented to 35 listeners, but every 4-5 s the signal was replaced with spectrally-matched noise (three instances shown in spectrogram, bottom). This manipulation leads to loss of critical temporal modulation related to the missed words, as shown by the slow envelope (top). (B) For repeatedly presented identical material, over 30, 15, 7.5 minutes or less out of an hour-long MEG recording session (left), the missing dynamic speech envelope was reconstructed from responses to static the noise maskers, with performance up to 25% of that obtained under clean conditions (insets, right). This effect on relative performance was not due to the fractional contribution from clean speech reconstructions, as analog results were reproducing by using the absolute reconstruction effects only (right), by assessment of independent performance by noise- trained decoders only, suggesting influence of prior experience in low-level sensory encoding of the temporal envelope over connected words. Error bars indicate confidence intervals for the means (Bonferroni-corrected α-level). Twas thenight be fore Christ mas when se Nota crea ture was e ven amou se The sto ckings were care in ho pes that Sai ntNi chola s would s oon be there all thru the hou stirring not hung by the chim ney with A B Replay frequency High Medium Low Control Low Medium High Control Repetition rate 0 .04 .08 Ef fe ct siz e q Noise reconstruction *!*! 25%! 21%! 9%! 8%!   81 Expedited auditory cortical processing of frequent natural speech replays. temporal response function (TRF) is a functionally informative statistic which can be used to predict the neural response to a given stimulus. When applied to natural sound conditions, it reveals information similar to that of evoked responses to pure tones (identifiable peaks with different polarities at specific latencies, corresponding to distinct neural sources and processing stages) but directly derived from the neural processing of the speech [25], [55]. We examined the effect of extensive prior experience on the TRF temporal structure in general, and a specific peak, the TRF100, occurring 100-150 ms post envelope change in particular(Fig. 2A). A significant latency shift of 5.3 ± 2.2 ms was observed for TRF100-High versus TRF100-Control peaks (t(33)=2.387; p=0.023) indicating that occurrence of this processing cortical may become expedited for listeners, compared on a within-subject basis (Fig. 2B). Across participants, the differences between repeated (High, Medium and Low) and baseline (Control) levels, in terms of maxima in their cross-correlation functions, were shown to arise from significantly different distributions (D=0.294; p=0.043), suggesting that prior experience by repeated presentations effectively speeds up cortical processing even as early as 100 ms latency.   82 Figure 2. Frequent repetitions of natural speech speed-up their cortical processing. (A) Temporal response functions across participants reveal a common cortical processing step about 100-150 ms after unitary variations in the speech envelope, referred to as TRF100. (B) Depending on context, the same processing component step may occur at different times at the millisecond stage in high-resolution recordings, with processing of frequently-repeated speech occurring about 5 ms earlier than with novel or sparsely presented sentences, within subjects. (C) Across subjects, the distribution of relative delays is consistently biased towards positive (earlier) values for the most extreme repetition conditions. (D) These shifts are obtained by cross-correlating TRF100 signals obtained per condition in each subject. Discussion The perceptual phenomenon of sensory restoration relies on inference of the missing sections from a sensory signal. The results demonstrate that auditory cortical activity Subject TRFs Control High .1 .2 Time [s] Subject-wise TRF100 delays C vs Low C vs Medium C vs High -5 0 5 10 De lay [m s] Within-subject TRF100 cross-correlations Control vs Low Control vs High -50 0 50 Delay [ms] -30 0 30 Delay [ms] 0 Su bje ct Sorted delay distributions Control vs Low Control vs Medium Control vs High A B C S ub je ct S ub je ct *! *!*!   83 possesses critical envelope information to reconstruct missing fragments of speech replaced by noise, but only when previously and repeatedly exposed to the missing speech. Results suggest that access and maintenance of a detailed representation of the stimulus, under a template format compatible with the acoustic envelope, is enabled by prior experience, which may also additionally speed up cortical processing time, and together, point to the generation of a time-locked neural activity pattern consistent with the expected but absent sensory input. These findings complement those from designs based on perceptual reports at the phonemic level (e.g. <200 ms), suggesting that acoustic delivery is not a necessary condition for spectrogram reconstructability when interpretation of a phoneme is actively ongoing through a noise [148], as long as the immediate acoustic context is present. These results imply that neural activity matching processes must rely on endogenous activity, possibly as top-down restorative modulations of auditory cortex populations [144], [145]. Our data is consistent with the notion that this activity can be influenced by prior learning and storage of speech information, at the level of its explicit temporal structure. Under this interpretation, enhanced listeners’ expectations about forthcoming speech tokens may predispose them to restorative encoding, but when contextual information is poor or insufficient, neural dynamics default to failure in predicting of the missing stimulus. Spontaneous neural background activity known to influence perceptual processing in general, includes the ability to entrain to a complex, natural signals such as speech [184], to optimize behavioral performance of detection tasks [185], or even robustness of an illusory experience [149].   84 On the plausibility of auditory memory involvement Besides neural coding, adaptive capabilities of auditory cortical areas include analysis and storage of sound features relevance [186]. This process requires that memory traces be hold, in a format that is considered to develop from a low-level sensory code, held in register by up to 15 s, to categorical terms that are more efficient for storage over the long term [187]. While in sensory format, storage has been argued to assist in the ability to restore missing fragments of a sound source, e.g. as a internal replay of the fragment [188]; also, other potential perceptual effects of memory-based reactivations over auditory object representations, including attention, are an area of current research [189]– [191]. The auditory effect studied here can be considered to belong to the multimodal class of attractive temporal context effects [192], a group of facilitatory mechanisms including perceptual hysteresis [193], [194] and perceptual stabilization [195] effects in the vision literature. These are considered critical for improving invariance in the face of external demands imposed by discontinuously fluctuating, broadly cluttered environments. Conceptually, this group stands opposite to that of contrastive temporal context effects, which are mainly suppressive, habituation or fatigue-based biases that discount neural activity after repetitions, and effectively favor alternative perceptual manifolds for which neural activity has not yet been adapted [192], [194]. These may include semantic satiation effects, namely the subjective experience of increasingly meaningless words after fast and prolonged repeats [196], [197]. On access and format of stored auditory representations   85 Over repeated sound stimuli, attractive contextual effects may rely on forms of implicit auditory memory as they are considered to intervene regularly in sensory and perceptual encoding [198]. A clear example is the improved detection after sequential presentations of arbitrary noise structures, and its time-locked sensory potential covariates [199], [200]. Foreknowledge of acoustic features may adapt listeners to a likely communication source, as demonstrated by perceptual facilitation when advance notice about the identity of a forthcoming instrument play is given [201], and by preferential activation in auditory association areas specific to speaker familiarity [202]. The notion that higher expectations of a dynamic sound pattern influence the level of detail accessible in sensory representations, is supported by findings of differential activation in implicit memory tasks, with varying rates of sensory update: initially, short storage intervals may be associated with activation of posterior superior temporal lobe areas, and over time activity can be mediated by structures in inferior frontal cortex instead [203]. Evidence from these studies is consistent with the hypothesis of variable memory trace formats, where high temporal resolutions may be available for readout at sensory buffers, and coarser elsewhere at stores encoding for categorical higher-order input features(cf. [204], [205]). On the subjective conditions of listening in noise With regards to the cognitive state of listeners during masking, it is relevant to address whether the findings are consistent with conditions that normally lead to auditory imagery processes, which are (distinctly) analog to perceptual restoration phenomena. In masked circumstances, sensory imagery is postulated to involve ‘schemata’ or prior abstractions actively formed with perceptual input, better resolved with increased   86 familiarity and which remain online while an expected stimulus fails to be presented [206]. For these purposes, auditory imagery is defined as the persistence of an auditory experience without prompting by direct sensory input [207]; the methodological implication is its existence is judged either directly by subjective reports, or indirectly by tasks and measures hypothesized to involve imagery with reasonable probability [206]. This latter approach comprises the study of conditions or stimuli that may automatically evoke auditory imagery, including substantial prior experience. Our findings suggest that among these indirect measures it is plausible to include perceptual coding principles of the missing envelope of natural. Behaviorally, it is consistent with findings that the prevalence of auditory imagery episodes depends on the level of familiarity with original sound pieces [208], in natural sound classes (e.g. speech versus music) [209]. Neurally, the planum temporale is a major computational hub for which activation levels may correlate with self-reported levels of engagement with imagery, or with perceived vividness by listeners [210]. While there is some agreement that imagery and rehearsal, a related process, of natural complex sounds may be subserved by auditory association cortex areas (see reviewed in [206]), evidence on similar activation at primary areas is mixed (cf. [211]). There is dual evidence on the format of representations sustained during active rehearsing, under both auditory-specific (sometimes termed ‘echoic memory’) and modality-general codes; these two types have been shown to occur each over distinct locations on superior temporal cortical areas, over distinct timescales as transient (<5 s) versus sustained phases respectively [209], [212]. Therefore, our data are consistent with a common theme in auditory retrieval processes for which task-relevant stimuli and/or features may rely on maintenance of (re)activated   87 domains within the sensory representational space [213]. This is also supported by findings of retrieval processes in vision and hearing that involve reactivation of sensory regions active during perception [214], something in addition found with auditory verbal imagery [215], [216] thus pointing to the notion that both involve overlapping processes [206]. On the low-level envelope representation during masking Our suggestion that a key structural property of natural sound encoding lies with the acoustic envelope representation, is compatible with preservation of a temporal coding scheme in auditory imagery based on prior experience as a necessary context. However, while formation of auditory ‘images’ may entail activity consistent with that elicited by original sound input [217], preservation of properties such as temporal acuity of original stimuli may be deteriorated under imagery depending on factors such as context and experience [218]. This finding was consistent with our relative effect sizes in the ability to reconstruct of missing speech, which disappeared for relatively novel stimuli. In this sense, frequent “refreshing” echoes the auditory memory reactivation hypothesis which states that storage of individual sound features occurs embedded in the context of neighboring patterns and sequences that can be representable by the auditory system as regularities. Reactivation is then the automatic process whereby variable sound input is matched to constancies extracted previously, and proximity between prior ‘rule’ and current ‘update’ tokens increases memory likelihood [205]. This description, originated from oddball sequence studies, has direct analogy in the present design because across verses, the ability to form regularities differs by uneven repetition rates, and therefore verse sequences may differentially reinstate prior constancies attained throughout the   88 session. In this interpretation, envelope features over speech preceding a masker serve as referents enabling translation of verse regularities, to be learned and represented over the course of the experiment, into specific values under the same feature format [205], as is the acoustic envelope. This does not preclude that additional stimulus features may likely be extracted and synthesized along the timing information represented in the envelope [219], including higher-order linguistic elements. The idea suggests that restorative processes may be inclusive of learning or storage strategies of linguistically-informative template in formats alternative formats to the envelope (e.g. [172]). Although outside the scope of this study, it is likely that mechanistic accounts of the restoration effect may invoke multiple levels of language analysis. The envelope correlates chiefly indicates timing of specifics phonemic utterances in natural speech [91], [220], with evidence for restoration pointing to adaptive use of their temporal cues in assisting real-time, natural listening conditions. On adaptive changes to envelope encoding The accompanying effect that cortical processing timelines were changed under the same circumstances that promoted restoration suggests that active, task-related endogenous changes may be present in order to optimize low-level envelope processing with relevant experience. Plausibly, ‘speeding up’ is related to increased excitability among populations normally active at the later stages of the immediately preceding processing step (presumably as shown by the TRF50 component). This may have a modulatory effect on the early low-level analysis stages (or be a consequence of facilitation already occurring there) – something that could help improve prediction over representational formats held by auditory areas. Determining conditions under which the acoustic   89 temporal envelope is relevant to initiating this endogenous process, may in the future result in the technical ability to provide real-time noninvasive indices of the subjective states by which a person maintains in register a template auditory pattern. Overall, the results manifest the brain’s ability to form a model of a speech scene, independently of feed-forward, bottom-up sensory information, but driven by expectations and learned experience in general [221]. It will be interesting to address whether this may assist in strategies for potential stimulation principles when seek to circumvent some derived auditory peripheral damage, as in prosthetic devices. General methods Participants. 35 experimental subjects (19 women, 21.3 ± 2.9 years of age [mean ± SD]), with no history of neurological disorder or metal implants, participated in the study. Due to excessive artifact caused by misfit within the MEG helmet, data from an additional subject was not included. Each subject received monetary compensation proportional to the study duration (approximately 1.5 hours). Conduction of the experiment protocol was approved by the UMCP Institutional Review Board, and all experiments were performed in accordance with its relevant guidelines and regulations. Before each study session, informed written consent was obtained from participants. Stimuli and experimental design.  Sound stimuli were prepared with the MATLAB® software package (MathWorks, Natick, United States) at a sampling rate of 22.05 KHz, and consisted of a recorded poem (“A Visit from St. Nicholas”, Moore or Livingston, 1823) from an online database (http://archive.org/details/AVisitFromSt.Nicholas-ByClementClarkeMoore-   90 NarratedByGrantRaymond. All fourteen stanza (from now on, stimuli) in the poem were separated (see Appendix for materials) and silence gaps within each were reduced so as to be matched in duration (range: 13.1 – 13.6 s), and then presented into 4 blocks. For the first block, 64 stimuli were used by repeating several times those in the poem’s first half. A ‘High’ frequency stimulus was chosen by selected a stanza and repeating it for half of cases (32/64), and this was similarly done for another ‘Medium’ and ‘Low’ frequency stimuli, in a quarter and eighth of cases, respectively. The remainder of the block was filled with ‘Control’ stimuli, namely the four remaining stanza presented either 1, 2 or 4 times within the block. Stimuli were randomized in order and concatenated in time. For the second block the same procedure was followed using material from the second half of the poem. Blocks 3 and 4 consisted in the same stimuli used as in 1 and 2 respectively, but with randomized order again. The procedure was recreated de novo for each subject resulting in a total of 35 different stimulus sets of about 1 hour each in total duration. Importantly, the choice of stimuli at a given repetition level was titrated across participants, resulting in 7 groups of 5 listeners each that underwent the same ‘High’, ‘Medium’, ‘Low’, and ‘Control’ stimuli selection. For each stimuli, 2–4 spectrally-matched (SM) noise probes of 800 ms duration each were applied at pseudo-random times with a minimum 2.5 s between probe onsets. Noise onset times were selected from a pool of values indicating syllable onsets times, as per the envelope rising slope maxima. An expected 768 noise probe samples were presented per experiment, and each was individually constructed by randomization of phase values across the specific frequency-domain phase information contained in the underlying speech stimulus occurring at the same time as the masker noise, yielding a noise with   91 equal spectral amplitude characteristics[222]. The original speech content occurring during the same time was removed and substituted by the respective SM noise, at a power signal level matching that in the clean original. Subjects listened to the speech sounds while watching a silent film to keep the participant engaged. To maintain their attention upon the auditory stimulus, after each probe, they were report via a button press whether they understood what the speaker meant to during noise. Data recording. We record responses to selected speech sequences by using MEG, a non- invasive neuroimaging technique optimally retrieving neural activity from human cortex regions such as the auditory cortex on the temporal lobes. Such recordings may reflect direct entrainment to speech low frequency modulations, namely its acoustic energy envelope, with remarkable temporal resolution [25]. Data processing.  Pre-­‐processing  and  sensor  rejection.  The  time  series  of  K  raw  recordings    from  the  MEG  sensor  array  (sampling  frequency  1  KHz)  will  be  submitted  to  a  fast  implementation  of  independent  component  analysis  [223],  from  which  two  independent  components  will  be  used  as  surrogate  reference  channels  for  environment  noise  reduction  purposes.  Independent  components  will  be  selected  if  they  contain  the  largest  proportion  of  broadband  (0-­‐500  Hz)  power;  this  selection  will  be  done  by  finding  the  independent  component  yielding  the  most  power  at  each  spectral  bin  (of  fixed  linear  size,  determined  by  dataset),  and  then  computing  the  histogram  of  independent  components  that  most  frequently  outnumbered  all  others  in  power  across  the  spectrum.  Because  spectral  bins  are  linearly  spaced,  and  given  the  1/f  power  spectrum  of  typical  MEG  fluctuations,  this   € sk (t)   92 approach  weighs  favorably  unusual  components  that  consistently  show  extreme  power  at  higher  frequencies.  Environmental  noise  sources  arising  from  unwanted  electrical  signals  not  related  to  brain  activity  of  interest  will  be  reduced  by  time-­‐shifted  principal  component  analysis  (TS-­‐PCA).  This  technique  discards  environmental  sources  that  have  dissimilar  convolutive  properties  when  they  mix  at  reference  sensors  in  the  EEG  system,  in  contrast  with  the  convolutive  properties  of  sources  that  mix  at  the  data  sensors  in  the  array[105].  Provided  that  reference  sensors  record  noise  and  no  primary  sources  of  interest,  such  mismatch  is  exploited  as  a  basis  for  rejection:  projections  of  recordings  from  the  brain  sensor  array  which  do  match  in  their  convolutive  properties  with  those  from  reference  sensors  recordings  are  removed  via  PCA.  We  set  TS-­‐PCA  parameters  to  N=200  taps  (equivalent  to  the  range  ±100  ms  at  the  original  sampling  frequency),  and  regressor  principal  components  whose  variance  amount  to  less  than  10-­‐6  times  that  of  the  first  component  will  discarded  as  negligible  for  numerical  purposes.  This  combination  of  parameters  is  commensurate  with  a  reduction  to  less  than  3%  of  residual  noise  for  a  simulation  with  three  reference  sensors  in  a  MEG  system  [105].  Signal  delays  introduced  by  the  TS-­‐PCA  procedure  will  be  corrected.     Sensor  noise.  Sensor-­‐specific  sources  of  unwanted  electrical  signals  unrelated  to  brain  activity  of  interest  are  reduced  by  sensor  noise  suppression  (SNS).  We  substitute  each  channel  recording  by  its  projection  formed  by  the  orthogonal  basis  span  of  all  other  channels[107].  This  method  exploits  redundancy  from  a  dense  array  -­‐where  the  number  of  sensors  exceeds  the  number  of  brain  sources-­‐  by     93 rejection  of  sensor-­‐specific  components  whose  presence  cannot  be  explained  by  the  redundancy  manifold  laid  by  data  from  in  other  channels  –  potentially  including  sensor-­‐specific  noise  from  those  channels  themselves.  Collectively,  this  separation  does  not  necessarily  eliminate  all  sensor-­‐specific  noise,  since  at  each  substitution  it  can  be  imported  from  other  sensors,  yet  this  may  promote  instances  where  such  redistribution  will  add  these  components  in  incoherent  manner  and  thus  become  attenuated  [108]   Data analysis.  To  assess  low-­‐frequency  cortical  entrainment,  recordings  will  be  band-­‐pass  filtered  between  1  and  8  Hz  with  an  order-­‐2  Butterworth  design,  correcting  for  the  group  delay  created  by  the  filtering  procedure.  A  data-­‐driven  spatial  filter  will  be  derived,  following  trial-­‐by-­‐trial  repeatability  as  the  basis  for  a  source-­‐separation  model[108].  Spatial  filter  coefficients  from  the  most  reproducible  signal  component  (i.e.  having  greatest  evoked-­‐to-­‐total  power  ratio)  in  individual  subject  data,  as  obtained  by  denoising  source  separation,  will  be  applied  to  sensor  data  as  a  weighted  sum  forming  the  resulting  virtual  MEG  output  channel.     𝑠!!"" = 𝑎!,!𝑠!(𝑡)!!!!   This approach effectively improves signal quality, and the data-driven virtual sensor distributions will be estimated based on recordings to clean speech epochs only. This will ensure that neural activity recorded during noise probes has been projected to the span determined by the neuromagnetic source that represents the most reproducible processing modes of the original speech template stimulus.   94 Stimulus reconstruction. The ability to reconstruct speech from MEG epochs will be assessed. Aside from the component described in equation M.1, sources from the next three top reproducible components will be obtained and submitted to a trained linear decoder estimation procedure (Figure 4), mapping from these sources back to original template stimulus. These components are considered reproducible signals in contrast with the bottom rank bottom components, which may serve as a reference devoid of stimulus- related activity. In either case reconstruction produces a timeseries whose similarity with the original envelope was assessed via Pearson’s r correlation coefficient. These scores were contrasted, after reproducible (re) and, separately, reference (rf) response signals, for (1) reconstructions of clean speech from neural activity following clean speech; (2) reconstructions of clean speech from neural activity following noise. The referencing procedure was introduced to obtain a necessary baseline in decoding performance given that timeseries’ lengths varied across conditions as a result of the different repetition rates and verses involved, something that we observed may produce positive biases in r for shorter sequences, irrespectively of underlying relationship to the stimulus to be decoded. To compute absolute reconstruction effect sizes, each of the Pearson’s r pairs (reproducible versus reference activity) were transformed to Cohen’s q[224] indices by the transform 𝑞 = !! ln !!!!!!!! − ln !!!!!!!! . Relative effect sizes were computed by the fraction q2/q1 of absolute effect sizes given the stimulus presentation conditions above. Temporal response function of stimulus representation. The input-output relation between a representation S(t) of auditory input and the evoked cortical response r t is modeled by a temporal response function (TRF). This linear model is formulated as:   95 r!"#$ t = 𝑇𝑅𝐹 τ  𝑆(t− τ)! + 𝜖 t where ε t is the residual waveform, which is the contribution to the evoked response not explained by the linear system. As stimulus representation, the envelope was extracted by extraction of the instantaneous amplitude of each channel’s analytic representation via the Hilbert transform[225], their sampling rates were reduced to 1 KHz and transformed to dB-scale. Statistical analyses. For reconstructions, repeated measures ANOVA were run across all four levels: ‘Control’, and ‘Low’, ‘Medium’, and ‘High’ repetitions, in order to detect any differences between their related means overall. For temporal response functions, in each participant, activity related to the TRF100 component was obtained from the 100- 200 ms window and cross-correlations were performed on ‘Control’ versus all other repetition conditions. The resulting peak delays were submitted to a non-parametric one- tailed two-sample Kolmogorov-Smirnov test for differences in the underlying delay populations.                             96   Chapter V: Conclusions A cornerstone of human auditory cognition is the dynamic interplay between the sensory and perceptual bases of sound encoding, as performed by the auditory cortex – a key structure dedicated to analysis and inferences related to hearing. The study of each basis presents advantages and challenges. With regards to sensory encoding, access to a wealth of physiological characterizations about the stimulus-to- response mapping will continue to provide invaluable means to establishing a biological basis for computation. However these means almost always involve non-human animal models, implying a relatively limited range of cognitive tasks available, especially those related to speech. On the perceptual side, incentives exist in attaining a comprehensive understanding of subjective states and of processes relating a human listener more efficiently to her environment. This because it is crucial information for pressing mental health issues and communication disorders. Yet, assessments of “inner experience” can be problematic, as it involves activity difficult to tag in time and can be prone to intractable amounts of variability between subjects. This series of projects addressed these issues in part. First, the studies present a framework where analysis and representation of basic versus complex sound encoding models are, at the cortical level, substantially closer than previously assumed. Findings of regions where models overlap suggest a means to extract sensory, domain-general processing stages. Such models may also mirror the structure of encoding models derived across animal models in the electrophysiology literature. Therefore, an avenue for further biologically-informed hypotheses to enter neuroimaging research lies ahead in exploiting their joint accessibility to models of spectrotemporal coding.   97 Second, access to subjective information can be addressed by approaching perceptual coding models, which posit more several alternatives in representing the same stimulus. Across this manifold, the search for a neural representation closest to the receiver’s experience can be reduced for instance by setting up a sensory context that suggests what to expect and when. This tactic is compatible with the study of perceptual restoration phenomena, a set of strategies to fill-in missed information. In these conditions, temporally-patterned rhythmic sounds were found to be represented by auditory cortex also in to cases when they are only perceived to be so but indeed physically absent. This implies a means of access to endogenous, dynamic perceptual representations following the subjective experience of sound. Last, in realistic conditions natural speech sounds are sometimes clear and predictable, and by times ambiguous or uncertain. Dual sensory and perceptual coding mechanisms may aid to sustain stable hearing in the face of disruptions unrelated to our acoustic interactions; one such way is to speed up auditory cortical processing when conditions allow to infer from prior knowledge what occurs next, and to possibly invest that gain at later instances where uncertainty demands further explorations through the tree of perceptual possibilities. There is now evidence that adaptive modifications to the sensory coding model, in the form of cortical processing time facilitation, may stem from comprehensive knowledge about the speech sequence being listened to. The finding is accompanied, from a perceptual coding perspective, by the increased likelihood for restoration of prolonged missed speech sounds encoded in cortical activity. Overall, the results demonstrate how sensory and/or perceptual coding approaches may further expand enquiry windows about a listener’s personal experience of the communication- rich soundscape.   98 Appendix A Supplemental information for “Functional significance of spectro-temporal response functions obtained using magnetoencephalography” Relevant data are available in a public repository accessible at http://hdl.handle.net/1903/19601   Supplementary Fig 1. Example of equivalence between standard evoked potentials and temporal response function components. (A) Complete MEG sensor dataset from a representative subject (R1946) following tone delivery shows a typical ‘butterfly plot’ waveform pattern, when the data are processed as standard evoked potential (bandpass filter 1-15 Hz, averaging, and baseline correction, top). The standardized root mean square (RMS) of the sensor array reveals close similarity to the absolute value of the sparser temporal response function obtained for this subject via reverse correlation. (B) Representa)ve+MEG+waveform,+all+channels+ Representa)ve+MEG+waveform+RMS+versus+absolute+TRF+ Group+MEG+waveform+RMS+versus+absolute+TRF+ A" B"   99 Consistency of the TRF with the classic RMS is also apparent at the group level (N=15), albeit at improved contrast by means of both spatial filtering and cross-validated predictive model techniques across participants. Supplementary Fig 2. Stimulus    and  transfer  functions  differences  across   stimulus  classes.  (A)  Differences  in  distributions  of  envelope  and  envelope  onset   representations. When represented by their envelope, stimuli have distributions  that  show  variable  spread  over  different  classes,  as  a  consequence  of  their  different  statistical  structure. The  envelope-­‐onset  procedure  reduces  its  extent  by  referencing  rises  in  acoustic  energy  to  its  immediate  preceding  context,  increasing  the  similarity  among  stimuli  classes. (B) Differences in transfer function delay distributions. Average group latencies of the STRF50 and STRF100 components across the random multitone pattern (N=15), speech (N=12), and music processing (N=15) datasets shown. Across the three studies, lower frequencies entail longer delays. Dense multitone A" B"   100 patterns entail longer delays (sparse: 2 tones per second, dense: 10 tones per second). For speech, envelope (thick red) and envelope onset (thin red) representations exhibit a relative delay that is greatest at the early STRF50 but reduces for STRF100. Error bars indicate the standard error of the mean. Music shows the longest delays, although a subject group confound cannot be ruled out therefore absolute value comparisons across studies are for reference only. Supplementary Fig 3. Models of subject temporal response function principal peaks. (A) TRFs from the same subject display the timing of principal activity related to different stimulus features and classes, some of which are consistent with exponential growth/decay models. Processing timescales differ according to both stimulus feature and class being modeled. (B) Within the first 200 ms following both tone and speech stimulus onsets, early and late peak deflections of the neuromagnetic signal may be described as transient exponential decay/growth curves (cf. Fig 3C) for the one subject in which both stimuli were tested. Top: Following tone onsets, both deflections were fit by exponential models, achieving time constant estimates at high goodness of fit values (TRF50: τ=3.1 A" B" C" Representa)ve+temporal+response+func)ons+ Mul)tone+peaks+from+representa)ve+ Speech+peaks+from+representa)ve+ TRF50+ TRF100+ TRF50+ TRF50+ TRF100+ TRF100+   101 ms with 2.8-3.4 ms CI-95%; R2=0.989; TRF100: τ=3.5 ms with 3.2-3.8 ms CI-95%; R2=0.988). Bottom: Analog cortical activity described by the TRF and signal envelope from natural speech reveals expanded and similar temporal processing windows at early and late latencies respectively (TRF50: τ=6.2 ms; 5.0-8.2 ms CI-95%; R2=0.892; TRF100: τ=2.6 ms; 2.3-3.1 ms CI-95%; R2=0.967). Supplementary Fig 4. Representation format transformed from early to mid latency speech processing at individual level. In single subjects, TRF components were timed differently for the speech envelope and envelope onset representations, as indicated by grey lines. The difference between early stage components (left) was about 43 ms, which aligns with the average delay (black) between maxima in the representations. Individual response functions showed a considerably reduced delay at mid latency stage (after 100 ms, right), thus suggesting a transformation to the acoustic envelope-based representation by this time. ENV ONS Speech representation 20 30 40 50 60 70 80 90 Pe ak la te nc y [ m s] TRF50 ENV ONS Speech representation 100 110 120 130 140 150 160 170 TRF100   102 Supplementary Fig 5. Addition of static nonlinearity to multitone response properties. As intrinsic nonlinear response features may be potentially precluded by linear analysis, a static nonlinearity was estimated given the MEG response and response prediction by the linear STRF model approximation (data as in Fig 1B). The STRF prediction timeseries was binned according to its magnitude range (values normalized), and an average computed across all values in the MEG response timeseries that map to each bin. Although the graphical procedure produces an accelerating function of the linear contribution of improved goodness of fit over the linear prediction (R2=0.972 quadratic; R2=0.940 linear), cascading the linear prediction with this function marginally improves power explained by a <1.5% margin, suggesting that the linear portion accounts sufficiently for the original model’s predictive power.   103 Appendix B Supplemental information for “Dynamic cortical representation of perceptual filling-in of missing acoustic rhythm”   Supplementary Fig 1. Spatial filters associated with auditory steady-state responses. Spatial filters were obtained from participant datasets using responses from the unmasked   104 acoustic pulse train only; the magnetic field distribution corresponding to that filter is displayed for each subject (N=35). The procedure constructs a fixed virtual MEG sensor on the basis of the most reproducible component of each participant’s aSSR; neural dynamics during noise probes are investigated using this virtual sensor. The large majority of the field distributions are consistent with MEG evoked potentials originating from bilateral auditory cortex. The distribution from the representative subject in Fig. 1 is highlighted. Units are in z-scores.             Supplementary Fig 2. Neural representations of a rhythmic pattern embedded in noise. (a) Representative stimulus-locked neural activity as measured by virtual MEG sensor. After a transient noise-onset response, the acoustic presence of a rhythmic pattern (top) may elicit an auditory steady-state response (aSSR) weaker in magnitude relative to baseline levels; (bottom) the acoustic absence of the target rhythmic pattern entails a similar noise-specific onset response, but an apparent lack of the aSSR depending on perception. (b) Median data across subjects with color convention as in as in (A); grey indicates individual subjects.   105 Supplementary Fig 3. No systematic acoustic influence of ambiguous perception of stimuli. (a) Spectral analysis of all rhythmic [respectively, non-rhythmic] noise probes in the experiment, shows that as per stimuli design, spectral content in probes appears virtually identical regardless of a listener’s posterior report on their rhythmic [non- rhythmic] content. Spectra predominantly feature the 1024 Hz tone carrier, and FM interactions where expected (color code in (B)). (b) To assess for unforeseen random temporal modulations appearing systematically in the probe distributions, each probe trial was cross-correlated with a stimulus segment consisting of the basic pulse train without noise. 5 Hz modulations in the cross-correlation envelope are observed only when the pulse train was present were expected (i.e. rhythmic probes), because signal similarity peaks at periodic lags. Between identical-acoustics probe partitions, tests for pairwise differences with mean different than zero were rejected (paired-sample t-tests; rhythmic versus missed, p=0.85; filling-in versus absent, p=0.84).   106 Appendix C Supplemental information for “Prior knowledge influences cortical latency and fidelity of the neural representation of missing speech”     ’Twas the night before Christmas, when all through the house not a creature was stirring, not even a mouse. The stockings were hung by the chimney with care, in hopes that St. Nicholas soon would be there. The children were nestled all snug in their beds, while visions of sugar plums danced in their heads. And Mama in her 'kerchief, and I in my cap, had just settled our brains for a long winter's nap. When out on the lawn there arose such a clatter, I sprang from my bed to see what was the matter. Away to the window I flew like a flash, tore open the shutter, and threw up the sash. The moon on the breast of the new-fallen snow gave the lustre of midday to objects below, when, what to my wondering eyes should appear, but a miniature sleigh and eight tiny reindeer. With a little old driver, so lively and quick, I knew in a moment it must be St. Nick. More rapid than eagles, his coursers they came, and he whistled and shouted and called them by name. “Now Dasher! Now Dancer! Now, Prancer and Vixen! On, Comet! On, Cupid! On, Donner and Blitzen! To the top of the porch! To the top of the wall! Now dash away! Dash away! Dash away all!" As dry leaves that before the wild hurricane fly, when they meet with an obstacle, mount to the sky so up to the house-top the coursers they flew, with the sleigh full of toys, and St. Nicholas too. And then, in a twinkling, I heard on the roof the prancing and pawing of each little hoof. As I drew in my head and was turning around, down the chimney St. Nicholas came with a bound. He was dressed all in fur, from his head to his foot, and his clothes were all tarnished with ashes and soot. A bundle of toys he had flung on his back, and he looked like a peddler just opening his pack. His eyes--how they twinkled! His dimples, how merry! His cheeks were like roses, his nose like a cherry! His droll little mouth was drawn up like a bow, and the beard on his chin was as white as the snow. The stump of a pipe he held tight in his teeth, and the smoke it encircled his head like a wreath. He had a broad face and a little round belly, that shook when he laughed, like a bowl full of jelly. He was chubby and plump, a right jolly old elf, and I laughed when I saw him, in spite of myself. A wink of his eye and a twist of his head soon gave me to know I had nothing to dread. He spoke not a word, but went straight to his work, and filled all the stockings, then turned with a jerk. And laying his finger aside of his nose, and giving a nod, up the chimney he rose. He sprang to his sleigh, to his team gave a whistle, And away they all flew like the down of a thistle. But I heard him exclaim, 'ere he drove out of sight, "Happy Christmas to all, and to all a good night!"   108 References [1]   F.  Lopes  da  Silva,  “EEG:  Origin  and  Measurement,”  in  EEG  -­‐  fMRI,  C.  Mulert  and  L.  Lemieux,  Eds.  Springer  Berlin  Heidelberg,  2009,  pp.  19–38.  [2]   G.  Rizzolatti,  L.  Cattaneo,  M.  Fabbri-­‐Destro,  and  S.  Rozzi,  “Cortical  Mechanisms  Underlying  the  Organization  of  Goal-­‐Directed  Actions  and  Mirror  Neuron-­‐Based  Action  Understanding,”  Physiol.  Rev.,  vol.  94,  no.  2,  pp.  655–706,  Apr.  2014.  [3]   J.  Wu  and  Y.  C.  Okada,  “Physiological  bases  of  the  synchronized  population  spikes  and  slow  wave  of  the  magnetic  field  generated  by  a  guinea-­‐pig  longitudinal  CA3  slice  preparation,”  Electroencephalogr.  Clin.  Neurophysiol.,  vol.  107,  no.  5,  pp.  361–373,  Nov.  1998.  [4]   S.  Murakami,  A.  Hirose,  and  Y.  C.  Okada,  “Contribution  of  Ionic  Currents  to  Magnetoencephalography  (MEG)  and  Electroencephalography  (EEG)  Signals  Generated  by  Guinea-­‐Pig  CA3  Slices,”  J.  Physiol.,  vol.  553,  no.  3,  pp.  975–985,  diciembre  2003.  [5]   M.  Hämäläinen,  R.  Hari,  R.  J.  Ilmoniemi,  J.  Knuutila,  and  O.  V.  Lounasmaa,  “Magnetoencephalography—theory,  instrumentation,  and  applications  to  noninvasive  studies  of  the  working  human  brain,”  Rev.  Mod.  Phys.,  vol.  65,  no.  2,  p.  413,  1993.  [6]   F.  H.  Lopes  da  Silva,  “Electrophysiological  Basis  of  MEG  Signals,”  in  MEG:  An   Introduction  to  Methods,  P.  Hansen,  M.  Kringelbach,  and  R.  Salmelin,  Eds.  Oxford  University  Press,  2010,  pp.  1–23.  [7]   G.  D.  Kidd,  C.  R.  Mason,  V.  M.  Richards,  F.  J.  Gallun,  and  N.  I.  Durlach,  “Informational  Masking,”  in  Auditory  Perception  of  Sound  Sources,  W.  A.  Yost,  A.  N.  Popper,  and  R.  R.  Fay,  Eds.  Springer  US,  2008,  pp.  143–189.  [8]   E.  de  Boer  and  P.  Kuyper,  “Triggered  Correlation,”  IEEE  Trans.  Biomed.  Eng.,  vol.  BME-­‐15,  no.  3,  pp.  169–179,  Jul.  1968.  [9]   P.  Dayan  and  L.  F.  Abbott,  Theoretical  Neuroscience:  Computational  and   Mathematical  Modeling  of  Neural  Systems.  The  MIT  Press,  2005.  [10]  A.  M.  H.  J.  Aertsen  and  P.  I.  M.  Johannesma,  “Spectro-­‐temporal  receptive  fields  of  auditory  neurons  in  the  grassfrog,”  Biol.  Cybern.,  vol.  38,  no.  4,  pp.  223–234,  Nov.  1980.  [11]  A.  M.  H.  J.  Aertsen  and  P.  I.  M.  Johannesma,  “The  Spectro-­‐Temporal  Receptive  Field,”  Biol.  Cybern.,  vol.  42,  no.  2,  pp.  133–143,  1981.  [12]  J.  J.  Eggermont,  “Wiener  and  Volterra  analyses  applied  to  the  auditory  system,”   Hear.  Res.,  vol.  66,  no.  2,  pp.  177–201,  abril  1993.  [13]  J.  P.  Jones  and  L.  A.  Palmer,  “The  two-­‐dimensional  spatial  structure  of  simple  receptive  fields  in  cat  striate  cortex,”  J.  Neurophysiol.,  vol.  58,  no.  6,  pp.  1187–1211,  Dec.  1987.  [14]  J.  J.  Eggermont,  P.  I.  M.  Johannesma,  and  A.  M.  H.  J.  Aertsen,  “Reverse-­‐correlation  methods  in  auditory  research,”  Q.  Rev.  Biophys.,  vol.  16,  no.  03,  pp.  341–414,  1983.  [15]  J.  Z.  Simon,  D.  A.  Depireux,  D.  J.  Klein,  J.  B.  Fritz,  and  S.  A.  Shamma,  “Temporal  Symmetry  in  Primary  Auditory  Cortex:  Implications  for  Cortical  Connectivity,”   Neural  Comput.,  vol.  19,  no.  3,  pp.  583–638,  Feb.  2007.     109 [16]  D.  J.  Klein,  J.  Z.  Simon,  D.  A.  Depireux,  and  S.  A.  Shamma,  “Stimulus-­‐invariant  processing  and  spectrotemporal  reverse  correlation  in  primary  auditory  cortex,”  J.  Comput.  Neurosci.,  vol.  20,  no.  2,  pp.  111–136,  Feb.  2006.  [17]  D.  J.  Klein,  D.  A.  Depireux,  J.  Z.  Simon,  and  S.  A.  Shamma,  “Robust  Spectrotemporal  Reverse  Correlation  for  the  Auditory  System:  Optimizing  Stimulus  Design,”  J.  Comput.  Neurosci.,  vol.  9,  no.  1,  pp.  85–111.  [18]  F.  E.  Theunissen,  K.  Sen,  and  A.  J.  Doupe,  “Spectral-­‐Temporal  Receptive  Fields  of  Nonlinear  Auditory  Neurons  Obtained  Using  Natural  Sounds,”  J.  Neurosci.,  vol.  20,  no.  6,  pp.  2315–2331,  Mar.  2000.  [19]  S.  V.  David,  N.  Mesgarani,  and  S.  A.  Shamma,  “Estimating  sparse  spectro-­‐temporal  receptive  fields  with  natural  stimuli,”  Netw.  Comput.  Neural  Syst.,  vol.  18,  no.  3,  pp.  191–212,  enero  2007.  [20]  R.  E.  Schapire,  “The  strength  of  weak  learnability,”  Mach.  Learn.,  vol.  5,  no.  2,  pp.  197–227.  [21]  Y.  Freund,  “Boosting  a  Weak  Learning  Algorithm  by  Majority,”  Inf.  Comput.,  vol.  121,  no.  2,  pp.  256–285,  Sep.  1995.  [22]  J.  Friedman,  T.  Hastie,  and  R.  Tibshirani,  “Additive  logistic  regression:  a  statistical  view  of  boosting  (With  discussion  and  a  rejoinder  by  the  authors),”   Ann.  Stat.,  vol.  28,  no.  2,  pp.  337–407,  Apr.  2000.  [23]  T.  Zhang  and  B.  Yu,  “Boosting  with  Early  Stopping:  Convergence  and  Consistency,”  Ann.  Stat.,  vol.  33,  no.  4,  pp.  1538–1579,  2005.  [24]  M.  Sahani  and  J.  F.  Linden,  “How  Linear  are  Auditory  Cortical  Responses?,”  in   Advances  in  Neural  Information  Processing  Systems  15,  S.  Becker,  S.  Thrun,  and  K.  Obermayer,  Eds.  MIT  Press,  2003,  pp.  125–132.  [25]  N.  Ding  and  J.  Z.  Simon,  “Neural  coding  of  continuous  speech  in  auditory  cortex  during  monaural  and  dichotic  listening,”  J.  Neurophysiol.,  vol.  107,  no.  1,  pp.  78–89,  Jan.  2012.  [26]  S.  M.  N.  Woolley,  T.  E.  Fremouw,  A.  Hsu,  and  F.  E.  Theunissen,  “Tuning  for  spectro-­‐temporal  modulations  as  a  mechanism  for  auditory  discrimination  of  natural  sounds,”  Nat.  Neurosci.,  vol.  8,  no.  10,  pp.  1371–1379,  Oct.  2005.  [27]  D.  A.  Depireux,  J.  Z.  Simon,  D.  J.  Klein,  and  S.  A.  Shamma,  “Spectro-­‐temporal  response  field  characterization  with  dynamic  ripples  in  ferret  primary  auditory  cortex,”  J.  Neurophysiol.,  vol.  85,  no.  3,  pp.  1220–1234,  Mar.  2001.  [28]  S.  Kumar  and  W.  Penny,  “Estimating  neural  response  functions  from  fMRI,”   Front.  Neuroinformatics,  vol.  8,  p.  48,  2014.  [29]  A.  M.  H.  J.  Aertsen  and  P.  I.  M.  Johannesma,  “Spectro-­‐temporal  receptive  fields  of  auditory  neurons  in  the  grassfrog,”  Biol.  Cybern.,  vol.  38,  no.  4,  pp.  223–234,  1980.  [30]  J.  J.  Eggermont,  “Context  dependence  of  spectro-­‐temporal  receptive  fields  with  implications  for  neural  coding,”  Hear.  Res.,  vol.  271,  no.  1–2,  pp.  123–132,  Jan.  2011.  [31]  F.  E.  Theunissen  and  J.  E.  Elie,  “Neural  processing  of  natural  sounds,”  Nat.  Rev.   Neurosci.,  vol.  15,  no.  6,  pp.  355–366,  Jun.  2014.  [32]  J.  Laudanski,  J.-­‐M.  Edeline,  and  C.  Huetz,  “Differences  between  spectro-­‐temporal  receptive  fields  derived  from  artificial  and  natural  stimuli  in  the  auditory  cortex,”  PloS  One,  vol.  7,  no.  11,  p.  e50539,  2012.     110 [33]  D.  T.  Blake  and  M.  M.  Merzenich,  “Changes  of  AI  Receptive  Fields  With  Sound  Density,”  J.  Neurophysiol.,  vol.  88,  no.  6,  pp.  3409–3420,  Dec.  2002.  [34]  A.  F.  Meyer,  R.  S.  Williamson,  J.  F.  Linden,  and  M.  Sahani,  “Models  of  Neuronal  Stimulus-­‐Response  Functions:  Elaboration,  Estimation,  and  Evaluation,”  Front.   Syst.  Neurosci.,  vol.  10,  Jan.  2017.  [35]  P.  Gill,  J.  Zhang,  S.  M.  N.  Woolley,  T.  Fremouw,  and  F.  E.  Theunissen,  “Sound  representation  methods  for  spectro-­‐temporal  receptive  field  estimation,”  J.   Comput.  Neurosci.,  vol.  21,  no.  1,  pp.  5–20,  Aug.  2006.  [36]  P.  A.  Valentine  and  J.  J.  Eggermont,  “Stimulus  dependence  of  spectro-­‐temporal  receptive  fields  in  cat  primary  auditory  cortex,”  Hear.  Res.,  vol.  196,  no.  1–2,  pp.  119–133,  Oct.  2004.  [37]  M.  Villafañe-­‐Delgado,  “The  Cortical  Representations  of  Speech  in  Reverberant  Conditions,”  Thesis,  University  of  Maryland,  College  Park,  2013.  [38]  E.  Camenga  et  al.,  “Cortical  Representations  of  Music  in  Human  Listeners,”  presented  at  the  Midwinter  Meeting  of  the  Association  for  Research  in  Otolaryngology,  Baltimore,  2013.  [39]  Q.  Gaucher,  J.-­‐M.  Edeline,  and  B.  Gourévitch,  “How  different  are  the  local  field  potentials  and  spiking  activities?  Insights  from  multi-­‐electrodes  arrays,”  J.   Physiol.-­‐Paris,  vol.  106,  no.  3–4,  pp.  93–103,  May  2012.  [40]  B.  Gourévitch,  A.  Noreña,  G.  Shaw,  and  J.  J.  Eggermont,  “Spectrotemporal  receptive  fields  in  anesthetized  cat  primary  auditory  cortex  are  context  dependent,”  Cereb.  Cortex  N.  Y.  N  1991,  vol.  19,  no.  6,  pp.  1448–1461,  Jun.  2009.  [41]  T.  P.  Roberts  and  D.  Poeppel,  “Latency  of  auditory  evoked  M100  as  a  function  of  tone  frequency,”  Neuroreport,  vol.  7,  no.  6,  pp.  1138–1140,  Apr.  1996.  [42]  S.  Greenberg,  D.  Poeppel,  and  T.  Roberts,  A  Space-­‐Time  Theory  of  Pitch  and   Timbre  Based  on  Cortical  Expansion  of  the  Cochlear  Traveling  Wave  Delay.  1997.  [43]  B.  Lütkenhöner  and  O.  Steinsträter,  “High-­‐precision  neuromagnetic  study  of  the  functional  organization  of  the  human  auditory  cortex,”  Audiol.  Neurootol.,  vol.  3,  no.  2–3,  pp.  191–213,  Jun.  1998.  [44]  T.  P.  Roberts,  P.  Ferrari,  S.  M.  Stufflebeam,  and  D.  Poeppel,  “Latency  of  the  auditory  evoked  neuromagnetic  field  components:  stimulus  dependence  and  insights  toward  perception,”  J.  Clin.  Neurophysiol.  Off.  Publ.  Am.   Electroencephalogr.  Soc.,  vol.  17,  no.  2,  pp.  114–129,  Mar.  2000.  [45]  A.  M.  Mäkelä,  P.  Alku,  V.  Mäkinen,  J.  Valtonen,  P.  May,  and  H.  Tiitinen,  “Human  Cortical  Dynamics  Determined  by  Speech  Fundamental  Frequency,”   NeuroImage,  vol.  17,  no.  3,  pp.  1300–1305,  Nov.  2002.  [46]  A.  Salajegheh  et  al.,  “Systematic  latency  variation  of  the  auditory  evoked  M100:  from  average  to  single-­‐trial  data,”  NeuroImage,  vol.  23,  no.  1,  pp.  288–295,  Sep.  2004.  [47]  T.  W.  Picton,  Human  auditory  evoked  potentials.  San  Diego:  Plural  Pub,  2010.  [48]  C.  K.  Machens,  M.  S.  Wehr,  and  A.  M.  Zador,  “Linearity  of  cortical  receptive  fields  measured  with  natural  sounds,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  24,  no.  5,  pp.  1089–1100,  Feb.  2004.  [49]  S.  V.  David,  N.  Mesgarani,  J.  B.  Fritz,  and  S.  A.  Shamma,  “Rapid  synaptic  depression  explains  nonlinear  modulation  of  spectro-­‐temporal  tuning  in     111 primary  auditory  cortex  by  natural  stimuli,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  29,  no.  11,  pp.  3374–3386,  Mar.  2009.  [50]  G.  H.  Recanzone,  “Response  profiles  of  auditory  cortical  neurons  to  tones  and  noise  in  behaving  macaque  monkeys,”  Hear.  Res.,  vol.  150,  no.  1–2,  pp.  104–118,  Dec.  2000.  [51]  D.  P.  Phillips,  S.  E.  Hall,  and  S.  E.  Boehnke,  “Central  auditory  onset  responses,  and  temporal  asymmetries  in  auditory  perception,”  Hear.  Res.,  vol.  167,  no.  1–2,  pp.  192–205,  May  2002.  [52]  X.  Wang,  T.  Lu,  R.  K.  Snider,  and  L.  Liang,  “Sustained  firing  in  auditory  cortex  evoked  by  preferred  stimuli,”  Nature,  vol.  435,  no.  7040,  pp.  341–346,  May  2005.  [53]  L.  Qin,  S.  Chimoto,  M.  Sakai,  J.  Wang,  and  Y.  Sato,  “Comparison  between  offset  and  onset  responses  of  primary  auditory  cortex  ON-­‐OFF  neurons  in  awake  cats,”  J.  Neurophysiol.,  vol.  97,  no.  5,  pp.  3421–3431,  May  2007.  [54]  A.  J.  Power,  R.  B.  Reilly,  and  E.  C.  Lalor,  “Comparing  linear  and  quadratic  models  of  the  human  auditory  system  using  EEG,”  Conf.  Proc.  Annu.  Int.  Conf.  IEEE  Eng.   Med.  Biol.  Soc.  IEEE  Eng.  Med.  Biol.  Soc.  Annu.  Conf.,  vol.  2011,  pp.  4171–4174,  2011.  [55]  N.  Ding  and  J.  Z.  Simon,  “Emergence  of  neural  encoding  of  auditory  objects  while  listening  to  competing  speakers,”  Proc.  Natl.  Acad.  Sci.,  vol.  109,  no.  29,  pp.  11854–11859,  Jul.  2012.  [56]  L.  M.  Miller,  M.  A.  Escabí,  and  C.  E.  Schreiner,  “Feature  selectivity  and  interneuronal  cooperation  in  the  thalamocortical  system,”  J.  Neurosci.  Off.  J.  Soc.   Neurosci.,  vol.  21,  no.  20,  pp.  8136–8144,  Oct.  2001.  [57]  R.  C.  deCharms,  D.  T.  Blake,  and  M.  M.  Merzenich,  “Optimizing  sound  features  for  cortical  neurons,”  Science,  vol.  280,  no.  5368,  pp.  1439–1443,  May  1998.  [58]  M.  A.  Escabi  and  C.  E.  Schreiner,  “Nonlinear  spectrotemporal  sound  analysis  by  neurons  in  the  auditory  midbrain,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  22,  no.  10,  pp.  4114–4131,  May  2002.  [59]  Y.  Bitterman,  R.  Mukamel,  R.  Malach,  I.  Fried,  and  I.  Nelken,  “Ultra-­‐fine  frequency  tuning  revealed  in  single  neurons  of  human  auditory  cortex,”  Nature,  vol.  451,  no.  7175,  pp.  197–201,  Jan.  2008.  [60]  R.  L.  Jenison,  R.  A.  Reale,  A.  L.  Armstrong,  H.  Oya,  H.  Kawasaki,  and  M.  A.  Howard,  “Sparse  Spectro-­‐Temporal  Receptive  Fields  Based  on  Multi-­‐Unit  and  High-­‐Gamma  Responses  in  Human  Auditory  Cortex,”  PloS  One,  vol.  10,  no.  9,  p.  e0137915,  2015.  [61]  M.  A.  Escabí  and  H.  L.  Read,  “Representation  of  spectrotemporal  sound  information  in  the  ascending  auditory  pathway,”  Biol.  Cybern.,  vol.  89,  no.  5,  pp.  350–362,  Nov.  2003.  [62]  R.  S.  Williamson,  M.  B.  Ahrens,  J.  F.  Linden,  and  M.  Sahani,  “Input-­‐Specific  Gain  Modulation  by  Local  Sensory  Context  Shapes  Cortical  and  Thalamic  Responses  to  Complex  Sounds,”  Neuron,  vol.  91,  no.  2,  pp.  467–481,  Jul.  2016.  [63]  F.  L.  da  Silva,  “EEG:  Origin  and  Measurement,”  in  EEG  -­‐  fMRI,  C.  Mulert  and  L.  Lemieux,  Eds.  Springer  Berlin  Heidelberg,  2009,  pp.  19–38.  [64]  A.  J.  Noreña,  B.  Gourévitch,  M.  Pienkowski,  G.  Shaw,  and  J.  J.  Eggermont,  “Increasing  spectrotemporal  sound  density  reveals  an  octave-­‐based     112 organization  in  cat  primary  auditory  cortex,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  28,  no.  36,  pp.  8885–8896,  Sep.  2008.  [65]  S.  S.-­‐H.  Wang  et  al.,  “Functional  trade-­‐offs  in  white  matter  axonal  scaling,”  J.   Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  28,  no.  15,  pp.  4047–4056,  Apr.  2008.  [66]  S.  Da  Costa,  W.  van  der  Zwaag,  J.  P.  Marques,  R.  S.  J.  Frackowiak,  S.  Clarke,  and  M.  Saenz,  “Human  Primary  Auditory  Cortex  Follows  the  Shape  of  Heschl’s  Gyrus,”  J.  Neurosci.,  vol.  31,  no.  40,  pp.  14067–14075,  Oct.  2011.  [67]  S.  Kaur,  R.  Lazar,  and  R.  Metherate,  “Intracortical  pathways  determine  breadth  of  subthreshold  frequency  receptive  fields  in  primary  auditory  cortex,”  J.   Neurophysiol.,  vol.  91,  no.  6,  pp.  2551–2567,  Jun.  2004.  [68]  M.  Galarreta  and  S.  Hestrin,  “Frequency-­‐dependent  synaptic  depression  and  the  balance  of  excitation  and  inhibition  in  the  neocortex,”  Nat.  Neurosci.,  vol.  1,  no.  7,  pp.  587–594,  Nov.  1998.  [69]  J.  A.  Varela,  S.  Song,  G.  G.  Turrigiano,  and  S.  B.  Nelson,  “Differential  depression  at  excitatory  and  inhibitory  synapses  in  visual  cortex,”  J.  Neurosci.  Off.  J.  Soc.   Neurosci.,  vol.  19,  no.  11,  pp.  4293–4304,  Jun.  1999.  [70]  J.  Westö  and  P.  J.  C.  May,  “Capturing  contextual  effects  in  spectro-­‐temporal  receptive  fields,”  Hear.  Res.,  vol.  339,  pp.  195–210,  Sep.  2016.  [71]  C.  Liégeois-­‐Chauvel,  A.  Musolino,  J.  M.  Badier,  P.  Marquis,  and  P.  Chauvel,  “Evoked  potentials  recorded  from  the  auditory  cortex  in  man:  evaluation  and  topography  of  the  middle  latency  components,”  Electroencephalogr.  Clin.   Neurophysiol.  Potentials  Sect.,  vol.  92,  no.  3,  pp.  204–214,  May  1994.  [72]  T.  Onitsuka,  H.  Ninomiya,  E.  Sato,  T.  Yamamoto,  and  N.  Tashiro,  “The  effect  of  interstimulus  intervals  and  between-­‐block  rests  on  the  auditory  evoked  potential  and  magnetic  field:  is  the  auditory  P50  in  humans  an  overlapping  potential?,”  Clin.  Neurophysiol.  Off.  J.  Int.  Fed.  Clin.  Neurophysiol.,  vol.  111,  no.  2,  pp.  237–245,  Feb.  2000.  [73]  I.  Hertrich,  K.  Mathiak,  W.  Lutzenberger,  and  H.  Ackermann,  “Differential  impact  of  periodic  and  aperiodic  speech-­‐like  acoustic  signals  on  magnetic  M50/M100  fields,”  Neuroreport,  vol.  11,  no.  18,  pp.  4017–4020,  Dec.  2000.  [74]  M.  Chait,  J.  Z.  Simon,  and  D.  Poeppel,  “Auditory  M50  and  M100  responses  to  broadband  noise:  functional  implications,”  Neuroreport,  vol.  15,  no.  16,  pp.  2455–2458,  Nov.  2004.  [75]  D.  L.  Braff,  M.  A.  Geyer,  and  N.  R.  Swerdlow,  “Human  studies  of  prepulse  inhibition  of  startle:  normal  subjects,  patient  groups,  and  pharmacological  studies,”  Psychopharmacology  (Berl.),  vol.  156,  no.  2–3,  pp.  234–258,  Jul.  2001.  [76]  T.  Grunwald  et  al.,  “Neuronal  substrates  of  sensory  gating  within  the  human  brain,”  Biol.  Psychiatry,  vol.  53,  no.  6,  pp.  511–519,  Mar.  2003.  [77]  O.  Korzyukov  et  al.,  “Generators  of  the  intracranial  P50  response  in  auditory  sensory  gating,”  NeuroImage,  vol.  35,  no.  2,  pp.  814–826,  Apr.  2007.  [78]  D.  R.  Pereira  et  al.,  “Effects  of  inter-­‐stimulus  interval  (ISI)  duration  on  the  N1  and  P2  components  of  the  auditory  event-­‐related  potential,”  Int.  J.   Psychophysiol.,  vol.  94,  no.  3,  pp.  311–318,  2014.  [79]  S.  A.  Hillyard,  R.  F.  Hink,  V.  L.  Schwent,  and  T.  W.  Picton,  “Electrical  Signs  of  Selective  Attention  in  the  Human  Brain,”  Science,  vol.  182,  no.  4108,  pp.  177–180,  Oct.  1973.     113 [80]  V.  L.  Schwent,  S.  A.  Hillyard,  and  R.  Galambos,  “Selective  attention  and  the  auditory  vertex  potential.  I.  Effects  of  stimulus  delivery  rate,”   Electroencephalogr.  Clin.  Neurophysiol.,  vol.  40,  no.  6,  pp.  604–614,  Jun.  1976.  [81]  N.  Fujiwara,  T.  Nagamine,  M.  Imai,  T.  Tanaka,  and  H.  Shibasaki,  “Role  of  the  primary  auditory  cortex  in  auditory  selective  attention  studied  by  whole-­‐head  neuromagnetometer,”  Cogn.  Brain  Res.,  vol.  7,  no.  2,  pp.  99–109,  Oct.  1998.  [82]  N.  I.  Durlach,  C.  R.  Mason,  B.  G.  Shinn-­‐Cunningham,  T.  L.  Arbogast,  H.  S.  Colburn,  and  G.  K.  Jr,  “Informational  masking:  Counteracting  the  effects  of  stimulus  uncertainty  by  decreasing  target-­‐masker  similarity,”  J.  Acoust.  Soc.  Am.,  vol.  114,  no.  1,  pp.  368–379,  Jul.  2003.  [83]  R.  A.  Lutfi,  L.  Gilbertson,  I.  Heo,  A.-­‐C.  Chang,  and  J.  Stamas,  “The  information-­‐divergence  hypothesis  of  informational  masking,”  J.  Acoust.  Soc.  Am.,  vol.  134,  no.  3,  pp.  2160–2170,  Sep.  2013.  [84]  K.  Krumbholz,  R.  D.  Patterson,  A.  Seither-­‐Preisler,  C.  Lammertmann,  and  B.  Lütkenhöner,  “Neuromagnetic  Evidence  for  a  Pitch  Processing  Center  in  Heschl’s  Gyrus,”  Cereb.  Cortex,  vol.  13,  no.  7,  pp.  765–772,  Jul.  2003.  [85]  M.  Chait,  D.  Poeppel,  and  J.  Z.  Simon,  “Neural  Response  Correlates  of  Detection  of  Monaurally  and  Binaurally  Created  Pitches  in  Humans,”  Cereb.  Cortex,  vol.  16,  no.  6,  pp.  835–848,  Jun.  2006.  [86]  P.  F.  Sowman,  A.  Kuusik,  and  B.  W.  Johnson,  “Self-­‐initiation  and  temporal  cueing  of  monaural  tones  reduce  the  auditory  N1  and  P2,”  Exp.  Brain  Res.,  vol.  222,  no.  1–2,  pp.  149–157,  Aug.  2012.  [87]  T.  Sharpee,  N.  C.  Rust,  and  W.  Bialek,  “Analyzing  neural  responses  to  natural  signals:  maximally  informative  dimensions,”  Neural  Comput.,  vol.  16,  no.  2,  pp.  223–250,  Feb.  2004.  [88]  I.  M.  Carruthers,  R.  G.  Natan,  and  M.  N.  Geffen,  “Encoding  of  ultrasonic  vocalizations  in  the  auditory  cortex,”  J.  Neurophysiol.,  vol.  109,  no.  7,  pp.  1912–1927,  Apr.  2013.  [89]  E.  M.  Zion  Golumbic  et  al.,  “Mechanisms  underlying  selective  neuronal  tracking  of  attended  speech  at  a  ‘cocktail  party,’”  Neuron,  vol.  77,  no.  5,  pp.  980–991,  Mar.  2013.  [90]  N.  Mesgarani  and  E.  F.  Chang,  “Selective  cortical  representation  of  attended  speaker  in  multi-­‐talker  speech  perception,”  Nature,  vol.  485,  no.  7397,  pp.  233–236,  May  2012.  [91]  G.  M.  Di  Liberto,  J.  A.  O’Sullivan,  and  E.  C.  Lalor,  “Low-­‐Frequency  Cortical  Entrainment  to  Speech  Reflects  Phoneme-­‐Level  Processing,”  Curr.  Biol.,  vol.  25,  no.  19,  pp.  2457–2465,  Oct.  2015.  [92]  A.  Presacco,  J.  Z.  Simon,  and  S.  Anderson,  “Effect  of  informational  content  of  noise  on  speech  representation  in  the  aging  midbrain  and  cortex,”  J.   Neurophysiol.,  p.  jn.00373.2016,  Sep.  2016.  [93]  J.  Z.  Simon,  “The  encoding  of  auditory  objects  in  auditory  cortex:  Insights  from  magnetoencephalography,”  Int.  J.  Psychophysiol.,  vol.  95,  no.  2,  pp.  184–190,  Feb.  2015.  [94]  T.  P.  L.  Roberts  et  al.,  “MEG  Detection  of  Delayed  Auditory  Evoked  Responses  in  Autism  Spectrum  Disorders:  Towards  an  Imaging  Biomarker  for  Autism,”   Autism  Res.  Off.  J.  Int.  Soc.  Autism  Res.,  vol.  3,  no.  1,  pp.  8–18,  Feb.  2010.     114 [95]  L.  I.  Zhang,  S.  Bao,  and  M.  M.  Merzenich,  “Persistent  and  specific  influences  of  early  acoustic  environments  on  primary  auditory  cortex,”  Nat.  Neurosci.,  vol.  4,  no.  11,  pp.  1123–1130,  Nov.  2001.  [96]  B.  Delgutte,  “Two-­‐tone  rate  suppression  in  auditory-­‐nerve  fibers:  dependence  on  suppressor  frequency  and  level,”  Hear.  Res.,  vol.  49,  no.  1–3,  pp.  225–246,  Nov.  1990.  [97]  A.  J.  Noreña,  B.  Gourévitch,  N.  Aizawa,  and  J.  J.  Eggermont,  “Spectrally  enhanced  acoustic  environment  disrupts  frequency  representation  in  cat  auditory  cortex,”  Nat.  Neurosci.,  vol.  9,  no.  7,  pp.  932–939,  2006.  [98]  J.  Fritz,  S.  Shamma,  M.  Elhilali,  and  D.  Klein,  “Rapid  task-­‐related  plasticity  of  spectrotemporal  receptive  fields  in  primary  auditory  cortex,”  Nat.  Neurosci.,  vol.  6,  no.  11,  pp.  1216–1223,  Nov.  2003.  [99]  N.  M.  Weinberger,  “Dynamic  Regulation  of  Receptive  Fields  and  Maps  in  the  Adult  Sensory  Cortex,”  Annu.  Rev.  Neurosci.,  vol.  18,  no.  1,  pp.  129–158,  1995.  [100]   L.  Ma,  C.  Micheyl,  P.  Yin,  A.  J.  Oxenham,  and  S.  A.  Shamma,  “Behavioral  measures  of  auditory  streaming  in  ferrets  (Mustela  putorius),”  J.  Comp.  Psychol.,  vol.  124,  no.  3,  pp.  317–330,  2010.  [101]   R.  C.  Oldfield,  “The  assessment  and  analysis  of  handedness:  the  Edinburgh  inventory,”  Neuropsychologia,  vol.  9,  no.  1,  pp.  97–113,  Mar.  1971.  [102]   B.  R.  Glasberg  and  B.  C.  J.  Moore,  “Derivation  of  auditory  filter  shapes  from  notched-­‐noise  data,”  Hear.  Res.,  vol.  47,  no.  1–2,  pp.  103–138,  agosto  1990.  [103]   “LibriVox.”  [Online].  Available:  https://librivox.org/the-­‐light-­‐princess-­‐by-­‐george-­‐macdonald.  [Accessed:  23-­‐Nov-­‐2016].  [104]   H.  Kado  et  al.,  “Magnetoencephalogram  systems  developed  at  KIT,”  IEEE   Trans.  Appl.  Supercond.,  vol.  9,  no.  2,  pp.  4057–4062,  Jun.  1999.  [105]   A.  de  Cheveigné  and  J.  Z.  Simon,  “Denoising  based  on  time-­‐shift  PCA,”  J.   Neurosci.  Methods,  vol.  165,  no.  2,  pp.  297–305,  Sep.  2007.  [106]   A.  Hyvärinen  and  E.  Oja,  “Independent  component  analysis:  algorithms  and  applications,”  Neural  Netw.,  vol.  13,  no.  4–5,  pp.  411–430,  Jun.  2000.  [107]   A.  de  Cheveigné  and  J.  Z.  Simon,  “Sensor  noise  suppression,”  J.  Neurosci.   Methods,  vol.  168,  no.  1,  pp.  195–202,  Feb.  2008.  [108]   A.  de  Cheveigné  and  J.  Z.  Simon,  “Denoising  based  on  spatial  filtering,”  J.   Neurosci.  Methods,  vol.  171,  no.  2,  pp.  331–339,  Jun.  2008.  [109]   A.  Calabrese,  J.  W.  Schumacher,  D.  M.  Schneider,  L.  Paninski,  and  S.  M.  N.  Woolley,  “A  generalized  linear  model  for  estimating  spectrotemporal  receptive  fields  from  responses  to  natural  sounds,”  PloS  One,  vol.  6,  no.  1,  p.  e16104,  Jan.  2011.  [110]   N.  Schinkel-­‐Bielefeld,  S.  V.  David,  S.  A.  Shamma,  and  D.  A.  Butts,  “Inferring  the  role  of  inhibition  in  auditory  processing  of  complex  natural  stimuli,”  J.   Neurophysiol.,  vol.  107,  no.  12,  pp.  3296–3307,  Jun.  2012.  [111]   E.  J.  Chichilnisky,  “A  simple  white  noise  analysis  of  neuronal  light  responses,”   Netw.  Comput.  Neural  Syst.,  vol.  12,  no.  2,  pp.  199–213,  Jan.  2001.  [112]   T.  O.  Sharpee,  K.  D.  Miller,  and  M.  P.  Stryker,  “On  the  importance  of  static  nonlinearity  in  estimating  spatiotemporal  neural  filters  with  natural  stimuli,”  J.   Neurophysiol.,  vol.  99,  no.  5,  pp.  2496–2509,  May  2008.     115 [113]   R.  M.  Warren,  C.  J.  Obusek,  and  J.  M.  Ackroff,  “Auditory  Induction:  Perceptual  Synthesis  of  Absent  Sounds,”  Science,  vol.  176,  no.  4039,  pp.  1149–1151,  Jun.  1972.  [114]   A.  S.  Bregman,  Auditory  scene  analysis:  the  perceptual  organization  of  sound,  2.  paperback  ed.,  Repr.  Cambridge,  Mass.:  MIT  Press,  2006.  [115]   A.  Samuel,  “Phoneme  Restoration,”  Lang.  Cogn.  Process.,  vol.  11,  no.  6,  pp.  647–654,  diciembre  1996.  [116]   J.  A.  Bashford  and  R.  M.  Warren,  “Multiple  phonemic  restorations  follow  the  rules  for  auditory  induction,”  Percept.  Psychophys.,  vol.  42,  no.  2,  pp.  114–121,  Aug.  1987.  [117]   K.  Friston,  “A  theory  of  cortical  responses,”  Philos.  Trans.  R.  Soc.  B  Biol.  Sci.,  vol.  360,  no.  1456,  pp.  815–836,  Apr.  2005.  [118]   A.  Clark,  “Whatever  next?  Predictive  brains,  situated  agents,  and  the  future  of  cognitive  science,”  Behav.  Brain  Sci.,  vol.  36,  no.  3,  pp.  181–204,  Jun.  2013.  [119]   R.  P.  Rao  and  D.  H.  Ballard,  “Predictive  coding  in  the  visual  cortex:  a  functional  interpretation  of  some  extra-­‐classical  receptive-­‐field  effects,”  Nat.   Neurosci.,  vol.  2,  no.  1,  pp.  79–87,  Jan.  1999.  [120]   J.  Lyzenga,  R.  P.  Carlyon,  and  B.  C.  J.  Moore,  “Dynamic  aspects  of  the  continuity  illusion:  Perception  of  level  and  of  the  depth,  rate,  and  phase  of  modulation,”  Hear.  Res.,  vol.  210,  no.  1–2,  pp.  30–41,  Dec.  2005.  [121]   R.  P.  Carlyon,  C.  Micheyl,  J.  M.  Deeks,  and  B.  C.  J.  Moore,  “Auditory  processing  of  real  and  illusory  changes  in  frequency  modulation  (FM)  phase,”  J.  Acoust.  Soc.   Am.,  vol.  116,  no.  6,  p.  3629,  2004.  [122]   B.  Ross,  C.  Borgmann,  R.  Draganova,  L.  E.  Roberts,  and  C.  Pantev,  “A  high-­‐precision  magnetoencephalographic  study  of  human  auditory  steady-­‐state  responses  to  amplitude-­‐modulated  tones,”  J.  Acoust.  Soc.  Am.,  vol.  108,  no.  2,  pp.  679–691,  Aug.  2000.  [123]   R.  Schoonhoven,  C.  J.  R.  Boden,  J.  P.  A.  Verbunt,  and  J.  C.  de  Munck,  “A  whole  head  MEG  study  of  the  amplitude-­‐modulation-­‐following  response:  phase  coherence,  group  delay  and  dipole  source  analysis,”  Clin.  Neurophysiol.  Off.  J.  Int.   Fed.  Clin.  Neurophysiol.,  vol.  114,  no.  11,  pp.  2096–2106,  Nov.  2003.  [124]   R.  Draganova,  B.  Ross,  A.  Wollbrink,  and  C.  Pantev,  “Cortical  Steady-­‐State  Responses  to  Central  and  Peripheral  Auditory  Beats,”  Cereb.  Cortex,  vol.  18,  no.  5,  pp.  1193–1200,  May  2008.  [125]   Y.  Wang,  N.  Ding,  N.  Ahmar,  J.  Xiang,  D.  Poeppel,  and  J.  Z.  Simon,  “Sensitivity  to  temporal  modulation  rate  and  spectral  bandwidth  in  the  human  auditory  system:  MEG  evidence,”  J.  Neurophysiol.,  vol.  107,  no.  8,  pp.  2033–2041,  Apr.  2012.  [126]   H.  Luo,  Y.  Wang,  D.  Poeppel,  and  J.  Z.  Simon,  “Concurrent  encoding  of  frequency  and  amplitude  modulation  in  human  auditory  cortex:  MEG  evidence,”  J.  Neurophysiol.,  vol.  96,  no.  5,  pp.  2712–2723,  Nov.  2006.  [127]   A.  L.  Giraud  et  al.,  “Representation  of  the  temporal  envelope  of  sounds  in  the  human  brain,”  J.  Neurophysiol.,  vol.  84,  no.  3,  pp.  1588–1598,  Sep.  2000.  [128]   R.  E.  Millman,  G.  Prendergast,  P.  T.  Kitterick,  W.  P.  Woods,  and  G.  G.  R.  Green,  “Spatiotemporal  reconstruction  of  the  auditory  steady-­‐state  response  to     116 frequency  modulation  using  magnetoencephalography,”  NeuroImage,  vol.  49,  no.  1,  pp.  745–758,  Jan.  2010.  [129]   E.  Jacewicz,  R.  A.  Fox,  C.  O',  Neill,  and  J.  Salmons,  “Articulation  rate  across  dialect,  age,  and  gender,”  Lang.  Var.  Change,  vol.  21,  no.  2,  pp.  233–256,  Jul.  2009.  [130]   D.  M.  Green  and  J.  A.  Swets,  Signal  detection  theory  and  psychophysics,  Repr.  ed.  Los  Altos  Hills,  Calif:  Peninsula  Publ,  2000.  [131]   R.  M.  Warren,  “Perceptual  restoration  of  obliterated  sounds,”  Psychol.  Bull.,  vol.  96,  no.  2,  pp.  371–383,  Sep.  1984.  [132]   R.  M.  Warren,  “Perceptual  Restoration  of  Missing  Speech  Sounds,”  Science,  vol.  167,  no.  3917,  pp.  392–393,  1970.  [133]   A.  G.  Samuel,  “Phonemic  restoration:  Insights  from  a  new  methodology.,”  J.   Exp.  Psychol.  Gen.,  vol.  110,  no.  4,  pp.  474–494,  1981.  [134]   K.  R.  Kluender  and  R.  L.  Jenison,  “Effects  of  glide  slope,  noise  intensity,  and  noise  duration  on  the  extrapolation  of  FM  glides  through  noise,”  Percept.   Psychophys.,  vol.  51,  no.  3,  pp.  231–238,  May  1992.  [135]   A.  J.  Shahin,  C.  W.  Bishop,  and  L.  M.  Miller,  “Neural  mechanisms  for  illusory  filling-­‐in  of  degraded  speech,”  NeuroImage,  vol.  44,  no.  3,  pp.  1133–1143,  Feb.  2009.  [136]   A.  J.  Shahin,  J.  R.  Kerlin,  J.  Bhat,  and  L.  M.  Miller,  “Neural  restoration  of  degraded  audiovisual  speech,”  NeuroImage,  vol.  60,  no.  1,  pp.  530–538,  Mar.  2012.  [137]   L.  Riecke,  C.  Micheyl,  M.  Vanbussel,  C.  S.  Schreiner,  D.  Mendelsohn,  and  E.  Formisano,  “Recalibration  of  the  auditory  continuity  illusion:  sensory  and  decisional  effects,”  Hear.  Res.,  vol.  277,  no.  1–2,  pp.  152–162,  Jul.  2011.  [138]   E.  Vinnik,  P.  M.  Itskov,  and  E.  Balaban,  “β-­‐  And  γ-­‐band  EEG  power  predicts  illusory  auditory  continuity  perception,”  J.  Neurophysiol.,  vol.  108,  no.  10,  pp.  2717–2724,  Nov.  2012.  [139]   L.  Riecke,  M.  Vanbussel,  L.  Hausfeld,  D.  Başkent,  E.  Formisano,  and  F.  Esposito,  “Hearing  an  Illusory  Vowel  in  Noise:  Suppression  of  Auditory  Cortical  Activity,”  J.  Neurosci.,  vol.  32,  no.  23,  pp.  8024–8034,  Jun.  2012.  [140]   G.  M.  Bidelman  and  C.  Patro,  “Auditory  perceptual  restoration  and  illusory  continuity  correlates  in  the  human  brainstem,”  Brain  Res.,  vol.  1646,  pp.  84–90,  Sep.  2016.  [141]   B.  H.  Repp,  “Perceptual  restoration  of  a  ‘missing’  speech  sound:  auditory  induction  or  illusion?,”  Percept.  Psychophys.,  vol.  51,  no.  1,  pp.  14–32,  Jan.  1992.  [142]   J.  Verschuure  and  M.  P.  Brocaar,  “Intelligibility  of  interrupted  meaningful  and  nonsense  speech  with  and  without  intervening  noise,”  Percept.  Psychophys.,  vol.  33,  no.  3,  pp.  232–240,  Mar.  1983.  [143]   L.  Elfner  and  J.  L.  Homick,  “Some  Factors  Affecting  the  Perception  of  Continuity  in  Alternately  Sounded  Tone  and  Noise  Signals,”  J.  Acoust.  Soc.  Am.,  vol.  40,  no.  1,  pp.  27–31,  Jul.  1966.  [144]   C.  I.  Petkov  and  M.  L.  Sutter,  “Evolutionary  conservation  and  neuronal  mechanisms  of  auditory  perceptual  restoration,”  Hear.  Res.,  vol.  271,  no.  1–2,  pp.  54–65,  enero  2011.     117 [145]   C.  I.  Petkov,  K.  N.  O’Connor,  and  M.  L.  Sutter,  “Encoding  of  Illusory  Continuity  in  Primary  Auditory  Cortex,”  Neuron,  vol.  54,  no.  1,  pp.  153–165,  Apr.  2007.  [146]   C.  I.  Petkov,  K.  N.  O’Connor,  and  M.  L.  Sutter,  “Illusory  Sound  Perception  in  Macaque  Monkeys,”  J.  Neurosci.,  vol.  23,  no.  27,  pp.  9155–9161,  Oct.  2003.  [147]   L.  Riecke,  A.  J.  van  Opstal,  R.  Goebel,  and  E.  Formisano,  “Hearing  Illusory  Sounds  in  Noise:  Sensory-­‐Perceptual  Transformations  in  Primary  Auditory  Cortex,”  J.  Neurosci.,  vol.  27,  no.  46,  pp.  12684–12689,  Nov.  2007.  [148]   M.  K.  Leonard,  M.  O.  Baud,  M.  J.  Sjerps,  and  E.  F.  Chang,  “Perceptual  restoration  of  masked  speech  in  human  cortex,”  Nat.  Commun.,  vol.  7,  Dec.  2016.  [149]   L.  Riecke,  F.  Esposito,  M.  Bonte,  and  E.  Formisano,  “Hearing  Illusory  Sounds  in  Noise:  The  Timing  of  Sensory-­‐Perceptual  Transformations  in  Auditory  Cortex,”  Neuron,  vol.  64,  no.  4,  pp.  550–561,  Nov.  2009.  [150]   L.  Riecke,  A.  J.  van  Opstal,  and  E.  Formisano,  “The  auditory  continuity  illusion:  A  parametric  investigation  and  filter  model,”  Percept.  Psychophys.,  vol.  70,  no.  1,  pp.  1–12,  2008.  [151]   B.  Roß,  T.  W.  Picton,  and  C.  Pantev,  “Temporal  integration  in  the  human  auditory  cortex  as  represented  by  the  development  of  the  steady-­‐state  magnetic  field,”  Hear.  Res.,  vol.  165,  no.  1–2,  pp.  68–84,  Mar.  2002.  [152]   K.  V.  Nourski  and  J.  F.  Brugge,  “Representation  of  temporal  sound  features  in  the  human  auditory  cortex,”  Rev.  Neurosci.,  vol.  22,  no.  2,  pp.  187–203,  Apr.  2011.  [153]   C.  Thyrion  and  J.-­‐P.  Roll,  “Perceptual  Integration  of  Illusory  and  Imagined  Kinesthetic  Images,”  J.  Neurosci.,  vol.  29,  no.  26,  pp.  8483–8492,  Jul.  2009.  [154]   L.  Casini,  P.  Romaiguère,  A.  Ducorps,  D.  Schwartz,  J.-­‐L.  Anton,  and  J.-­‐P.  Roll,  “Cortical  correlates  of  illusory  hand  movement  perception  in  humans:  A  MEG  study,”  Brain  Res.,  vol.  1121,  no.  1,  pp.  200–206,  Nov.  2006.  [155]   T.  S.  Lee  and  M.  Nguyen,  “Dynamics  of  subjective  contour  formation  in  the  early  visual  cortex,”  Proc.  Natl.  Acad.  Sci.,  vol.  98,  no.  4,  pp.  1907–1911,  Feb.  2001.  [156]   M.  M.  Murray,  G.  R.  Wylie,  B.  A.  Higgins,  D.  C.  Javitt,  C.  E.  Schroeder,  and  J.  J.  Foxe,  “The  spatiotemporal  dynamics  of  illusory  contour  processing:  combined  high-­‐density  electrical  mapping,  source  analysis,  and  functional  magnetic  resonance  imaging,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  22,  no.  12,  pp.  5055–5073,  Jun.  2002.  [157]   L.  Montaser-­‐Kouhsari,  M.  S.  Landy,  D.  J.  Heeger,  and  J.  Larsson,  “Orientation-­‐selective  adaptation  to  illusory  contours  in  human  visual  cortex,”  J.  Neurosci.   Off.  J.  Soc.  Neurosci.,  vol.  27,  no.  9,  pp.  2186–2195,  Feb.  2007.  [158]   K.  Friston,  “Learning  and  inference  in  the  brain,”  Neural  Netw.,  vol.  16,  no.  9,  pp.  1325–1352,  Nov.  2003.  [159]   I.  SanMiguel,  A.  Widmann,  A.  Bendixen,  N.  Trujillo-­‐Barreto,  and  E.  Schröger,  “Hearing  Silences:  Human  Auditory  Processing  Relies  on  Preactivation  of  Sound-­‐Specific  Brain  Activity  Patterns,”  J.  Neurosci.,  vol.  33,  no.  20,  pp.  8633–8639,  May  2013.     118 [160]   S.  Nozaradan,  I.  Peretz,  M.  Missal,  and  A.  Mouraux,  “Tagging  the  Neuronal  Entrainment  to  Beat  and  Meter,”  J.  Neurosci.,  vol.  31,  no.  28,  pp.  10234–10240,  Jul.  2011.  [161]   I.  Tal  et  al.,  “Neural  Entrainment  to  the  Beat:  The  ‘Missing-­‐Pulse’  Phenomenon,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  37,  no.  26,  pp.  6331–6341,  Jun.  2017.  [162]   B.  S.  Oken,  M.  C.  Salinsky,  and  S.  M.  Elsas,  “Vigilance,  alertness,  or  sustained  attention:  physiological  basis  and  measurement,”  Clin.  Neurophysiol.  Off.  J.  Int.   Fed.  Clin.  Neurophysiol.,  vol.  117,  no.  9,  pp.  1885–1901,  Sep.  2006.  [163]   B.  J.  Farley  and  A.  J.  Noreña,  “Spatiotemporal  Coordination  of  Slow-­‐Wave  Ongoing  Activity  across  Auditory  Cortical  Areas,”  J.  Neurosci.,  vol.  33,  no.  8,  pp.  3299–3310,  Feb.  2013.  [164]   B.  S.  W.  Ng,  T.  Schroeder,  and  C.  Kayser,  “A  precluding  but  not  ensuring  role  of  entrained  low-­‐frequency  oscillations  for  auditory  perception,”  J.  Neurosci.   Off.  J.  Soc.  Neurosci.,  vol.  32,  no.  35,  pp.  12268–12276,  Aug.  2012.  [165]   H.  Luo  and  D.  Poeppel,  “Phase  Patterns  of  Neuronal  Responses  Reliably  Discriminate  Speech  in  Human  Auditory  Cortex,”  Neuron,  vol.  54,  no.  6,  pp.  1001–1010,  Jun.  2007.  [166]   C.  Chandrasekaran,  H.  K.  Turesson,  C.  H.  Brown,  and  A.  A.  Ghazanfar,  “The  Influence  of  Natural  Scene  Dynamics  on  Auditory  Cortical  Activity,”  J.  Neurosci.,  vol.  30,  no.  42,  pp.  13919–13931,  Oct.  2010.  [167]   K.  Yamanaka  and  Y.  Yamamoto,  “Lateralised  EEG  power  and  phase  dynamics  related  to  motor  response  execution,”  Clin.  Neurophysiol.,  vol.  121,  no.  10,  pp.  1711–1718,  Oct.  2010.  [168]   E.  Maris  and  R.  Oostenveld,  “Nonparametric  statistical  testing  of  EEG-­‐  and  MEG-­‐data,”  J.  Neurosci.  Methods,  vol.  164,  no.  1,  pp.  177–190,  agosto  2007.  [169]   N.  A.  Macmillan  and  C.  D.  Creelman,  Detection  theory:  a  user’s  guide.  Mahwah,  N.J.:  Lawrence  Erlbaum  Associates,  2005.  [170]   R.  Kutil,  “Biased  and  unbiased  estimation  of  the  circular  mean  resultant  length  and  its  variance,”  Statistics,  vol.  46,  no.  4,  pp.  549–561,  Aug.  2012.  [171]   E.  C.  Cherry,  “Some  Experiments  on  the  Recognition  of  Speech,  with  One  and  with  Two  Ears,”  J.  Acoust.  Soc.  Am.,  vol.  25,  no.  5,  pp.  975–979,  Sep.  1953.  [172]   V.  van  Wassenhove  and  C.  E.  Schroeder,  “Multisensory  Role  of  Human  Auditory  Cortex,”  in  The  Human  Auditory  Cortex,  D.  Poeppel,  T.  Overath,  A.  N.  Popper,  and  R.  R.  Fay,  Eds.  Springer  New  York,  2012,  pp.  295–331.  [173]   M.  J.  Crosse,  G.  M.  Di  Liberto,  and  E.  C.  Lalor,  “Eye  Can  Hear  Clearly  Now:  Inverse  Effectiveness  in  Natural  Audiovisual  Speech  Processing  Relies  on  Long-­‐Term  Crossmodal  Temporal  Integration,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  36,  no.  38,  pp.  9888–9895,  Sep.  2016.  [174]   E.  Sohoglu,  J.  E.  Peelle,  R.  P.  Carlyon,  and  M.  H.  Davis,  “Predictive  top-­‐down  integration  of  prior  knowledge  during  speech  perception,”  J.  Neurosci.  Off.  J.  Soc.   Neurosci.,  vol.  32,  no.  25,  pp.  8443–8453,  Jun.  2012.  [175]   A.  Bendixen,  M.  Scharinger,  A.  Strauß,  and  J.  Obleser,  “Prediction  in  the  service  of  comprehension:  Modulated  early  brain  responses  to  omitted  speech  segments,”  Cortex,  vol.  53,  pp.  9–26,  Apr.  2014.     119 [176]   S.  Grimm  and  E.  Schröger,  “The  processing  of  frequency  deviations  within  sounds:  evidence  for  the  predictive  nature  of  the  Mismatch  Negativity  (MMN)  system,”  Restor.  Neurol.  Neurosci.,  vol.  25,  no.  3–4,  pp.  241–249,  2007.  [177]   A.  Tavano,  S.  Grimm,  J.  Costa-­‐Faidella,  L.  Slabu,  E.  Schröger,  and  C.  Escera,  “Spectrotemporal  processing  drives  fast  access  to  memory  traces  for  spoken  words,”  NeuroImage,  vol.  60,  no.  4,  pp.  2300–2308,  May  2012.  [178]   V.  van  Wassenhove,  K.  W.  Grant,  and  D.  Poeppel,  “Visual  speech  speeds  up  the  neural  processing  of  auditory  speech,”  Proc.  Natl.  Acad.  Sci.  U.  S.  A.,  vol.  102,  no.  4,  pp.  1181–1186,  Jan.  2005.  [179]   A.-­‐L.  Giraud  et  al.,  “Representation  of  the  Temporal  Envelope  of  Sounds  in  the  Human  Brain,”  J.  Neurophysiol.,  vol.  84,  no.  3,  pp.  1588–1598,  Sep.  2000.  [180]   E.  M.  Zion  Golumbic  et  al.,  “Mechanisms  Underlying  Selective  Neuronal  Tracking  of  Attended  Speech  at  a  ‘Cocktail  Party,’”  Neuron,  vol.  77,  no.  5,  pp.  980–991,  Mar.  2013.  [181]   N.  Mesgarani,  “Stimulus  Reconstruction  from  Cortical  Responses,”  in   Encyclopedia  of  Computational  Neuroscience,  D.  Jaeger  and  R.  Jung,  Eds.  Springer  New  York,  2014,  pp.  1–3.  [182]   T.  Naselaris,  K.  N.  Kay,  S.  Nishimoto,  and  J.  L.  Gallant,  “Encoding  and  decoding  in  fMRI,”  NeuroImage,  vol.  56,  no.  2,  pp.  400–410,  May  2011.  [183]   S.  V.  David,  J.  B.  Fritz,  and  S.  A.  Shamma,  “Task  reward  structure  shapes  rapid  receptive  field  plasticity  in  auditory  cortex,”  Proc.  Natl.  Acad.  Sci.  U.  S.  A.,  vol.  109,  no.  6,  pp.  2144–2149,  Feb.  2012.  [184]   N.  Ding,  M.  Chatterjee,  and  J.  Z.  Simon,  “Robust  cortical  entrainment  to  the  speech  envelope  relies  on  the  spectro-­‐temporal  fine  structure,”  NeuroImage,  Nov.  2013.  [185]   M.  J.  Henry  and  J.  Obleser,  “Frequency  modulation  entrains  slow  neural  oscillations  and  optimizes  human  listening  behavior,”  Proc.  Natl.  Acad.  Sci.,  vol.  109,  no.  49,  pp.  20095–20100,  Nov.  2012.  [186]   N.  M.  Weinberger,  “Specific  long-­‐term  memory  traces  in  primary  auditory  cortex,”  Nat.  Rev.  Neurosci.,  vol.  5,  no.  4,  pp.  279–290,  Apr.  2004.  [187]   N.  Cowan,  “On  short  and  long  auditory  stores,”  Psychol.  Bull.,  vol.  96,  no.  2,  pp.  341–370,  Sep.  1984.  [188]   B.  G.  Shinn-­‐Cunningham,  “Object-­‐based  auditory  and  visual  attention,”  Trends   Cogn.  Sci.,  vol.  12,  no.  5,  pp.  182–186,  May  2008.  [189]   K.  C.  Backer  and  C.  Alain,  “Attention  to  memory:  orienting  attention  to  sound  object  representations,”  Psychol.  Res.,  vol.  78,  no.  3,  pp.  439–452,  2014.  [190]   K.  C.  Backer  and  C.  Alain,  “Orienting  attention  to  sound  object  representations  attenuates  change  deafness,”  J.  Exp.  Psychol.  Hum.  Percept.   Perform.,  vol.  38,  no.  6,  pp.  1554–1566,  Dec.  2012.  [191]   J.  F.  Zimmermann,  M.  Moscovitch,  and  C.  Alain,  “Attending  to  auditory  memory,”  Brain  Res.,  vol.  1640,  Part  B,  pp.  208–221,  Jun.  2016.  [192]   J.  S.  Snyder,  C.  M.  Schwiedrzik,  A.  D.  Vitela,  and  L.  Melloni,  “How  previous  experience  shapes  perception  in  different  sensory  modalities,”  Front.  Hum.   Neurosci.,  vol.  9,  Oct.  2015.     120 [193]   A.  Kleinschmidt,  C.  Büchel,  C.  Hutton,  K.  J.  Friston,  and  R.  S.  J.  Frackowiak,  “The  Neural  Structures  Expressing  Perceptual  Hysteresis  in  Visual  Letter  Recognition,”  Neuron,  vol.  34,  no.  4,  pp.  659–666,  May  2002.  [194]   C.  M.  Schwiedrzik,  C.  C.  Ruff,  A.  Lazar,  F.  C.  Leitner,  W.  Singer,  and  L.  Melloni,  “Untangling  perceptual  memory:  hysteresis  and  adaptation  map  into  separate  cortical  networks,”  Cereb.  Cortex  N.  Y.  N  1991,  vol.  24,  no.  5,  pp.  1152–1164,  May  2014.  [195]   J.  Pearson  and  J.  Brascamp,  “Sensory  memory  for  ambiguous  vision,”  Trends   Cogn.  Sci.,  vol.  12,  no.  9,  pp.  334–341,  Sep.  2008.  [196]   J.  Kounios,  S.  A.  Kotz,  and  P.  J.  Holcomb,  “On  the  locus  of  the  semantic  satiation  effect:  evidence  from  event-­‐related  brain  potentials,”  Mem.  Cognit.,  vol.  28,  no.  8,  pp.  1366–1377,  Dec.  2000.  [197]   M.  Pilotti,  J.  S.  Antrobus,  and  M.  Duff,  “The  effect  of  presemantic  acoustic  adaptation  on  semantic  ‘satiation,’”  Mem.  Cognit.,  vol.  25,  no.  3,  pp.  305–312,  May  1997.  [198]   J.  S.  Snyder  and  M.  K.  Gregg,  “Memory  for  sound,  with  an  ear  toward  hearing  in  complex  auditory  scenes,”  Atten.  Percept.  Psychophys.,  vol.  73,  no.  7,  pp.  1993–2007,  Oct.  2011.  [199]   T.  R.  Agus,  S.  J.  Thorpe,  and  D.  Pressnitzer,  “Rapid  formation  of  robust  auditory  memories:  insights  from  noise,”  Neuron,  vol.  66,  no.  4,  pp.  610–618,  May  2010.  [200]   T.  Andrillon,  S.  Kouider,  T.  Agus,  and  D.  Pressnitzer,  “Perceptual  Learning  of  Acoustic  Noise  Generates  Memory-­‐Evoked  Potentials,”  Curr.  Biol.,  vol.  25,  no.  21,  pp.  2823–2829,  Nov.  2015.  [201]   R.  G.  Crowder,  “Imagery  for  musical  timbre,”  J.  Exp.  Psychol.  Hum.  Percept.   Perform.,  vol.  15,  no.  3,  pp.  472–478,  Aug.  1989.  [202]   P.  B.  Birkett  et  al.,  “Voice  familiarity  engages  auditory  cortex,”  Neuroreport,  vol.  18,  no.  13,  pp.  1375–1378,  Aug.  2007.  [203]   B.  R.  Buchsbaum,  A.  Padmanabhan,  and  K.  F.  Berman,  “The  Neural  Substrates  of  Recognition  Memory  for  Verbal  Information:  Spanning  the  Divide  between  Short-­‐  and  Long-­‐term  Memory,”  J.  Cogn.  Neurosci.,  vol.  23,  no.  4,  pp.  978–991,  Apr.  2011.  [204]   N.  I.  Durlach  and  L.  D.  Braida,  “Intensity  Perception.  I.  Preliminary  Theory  of  Intensity  Resolution,”  J.  Acoust.  Soc.  Am.,  vol.  46,  no.  2B,  pp.  372–383,  Aug.  1969.  [205]   I.  Winkler  and  N.  Cowan,  “From  sensory  to  long-­‐term  memory:  evidence  from  auditory  memory  reactivation  studies,”  Exp.  Psychol.,  vol.  52,  no.  1,  pp.  3–20,  2005.  [206]   T.  L.  Hubbard,  “Auditory  imagery:  Empirical  findings,”  Psychol.  Bull.,  vol.  136,  no.  2,  pp.  302–329,  Mar.  2010.  [207]   M.  J.  Intons-­‐Peterson,  “Components  of  auditory  imagery,”  in  Auditory   Imagery,  D.  Reisberg,  Ed.  Psychology  Press,  2014,  pp.  45–72.  [208]   F.  Bailes,  “The  prevalence  and  nature  of  imagined  music  in  the  everyday  lives  of  music  students,”  Psychol.  Music,  vol.  35,  no.  4,  pp.  555–570,  Oct.  2007.  [209]   M.  Meyer,  S.  Elmer,  S.  Baumann,  and  L.  Jancke,  “Short-­‐term  plasticity  in  the  auditory  system:  Differential  neural  responses  to  perception  and  imagery  of     121 speech  and  music,”  Restor.  Neurol.  Neurosci.,  vol.  25,  no.  3/4,  pp.  411–431,  May  2007.  [210]   R.  J.  Zatorre,  A.  R.  Halpern,  and  M.  Bouffard,  “Mental  Reversal  of  Imagined  Melodies:  A  Role  for  the  Posterior  Parietal  Cortex,”  J.  Cogn.  Neurosci.,  vol.  22,  no.  4,  pp.  775–789,  Apr.  2009.  [211]   N.  Bunzeck,  T.  Wuestenberg,  K.  Lutz,  H.-­‐J.  Heinze,  and  L.  Jancke,  “Scanning  silence:  Mental  imagery  of  complex  sounds,”  NeuroImage,  vol.  26,  no.  4,  pp.  1119–1127,  Jul.  2005.  [212]   B.  R.  Buchsbaum,  R.  K.  Olsen,  P.  Koch,  and  K.  F.  Berman,  “Human  dorsal  and  ventral  auditory  streams  subserve  rehearsal-­‐based  and  echoic  processes  during  verbal  working  memory,”  Neuron,  vol.  48,  no.  4,  pp.  687–697,  Nov.  2005.  [213]   J.  Kaiser,  “Dynamics  of  auditory  working  memory,”  Front.  Psychol.,  vol.  6,  May  2015.  [214]   M.  E.  Wheeler,  S.  E.  Petersen,  and  R.  L.  Buckner,  “Memory’s  echo:  vivid  remembering  reactivates  sensory-­‐specific  cortex,”  Proc.  Natl.  Acad.  Sci.  U.  S.  A.,  vol.  97,  no.  20,  pp.  11125–11129,  Sep.  2000.  [215]   P.  K.  McGuire,  D.  A.  Silbersweig,  R.  M.  Murray,  A.  S.  David,  R.  S.  Frackowiak,  and  C.  D.  Frith,  “Functional  anatomy  of  inner  speech  and  auditory  verbal  imagery,”  Psychol.  Med.,  vol.  26,  no.  1,  pp.  29–38,  Jan.  1996.  [216]   S.  S.  Shergill,  E.  T.  Bullmore,  M.  J.  Brammer,  S.  C.  Williams,  R.  M.  Murray,  and  P.  K.  McGuire,  “A  functional  study  of  auditory  verbal  imagery,”  Psychol.  Med.,  vol.  31,  no.  2,  pp.  241–253,  Feb.  2001.  [217]   P.  Janata,  “Brain  electrical  activity  evoked  by  mental  formation  of  auditory  expectations  and  images,”  Brain  Topogr.,  vol.  13,  no.  3,  pp.  169–193,  2001.  [218]   P.  Janata  and  K.  Paroo,  “Acuity  of  auditory  images  in  pitch  and  time,”  Percept.   Psychophys.,  vol.  68,  no.  5,  pp.  829–844,  Jul.  2006.  [219]   R.  Näätänen  and  I.  Winkler,  “The  concept  of  auditory  stimulus  representation  in  cognitive  neuroscience,”  Psychol.  Bull.,  vol.  125,  no.  6,  pp.  826–859,  Nov.  1999.  [220]   S.  J.  Kayser,  R.  A.  A.  Ince,  J.  Gross,  and  C.  Kayser,  “Irregular  Speech  Rate  Dissociates  Auditory  Cortical  Entrainment,  Evoked  Responses,  and  Frontal  Alpha,”  J.  Neurosci.  Off.  J.  Soc.  Neurosci.,  vol.  35,  no.  44,  pp.  14691–14701,  Nov.  2015.  [221]   A.  Pouget,  J.  M.  Beck,  W.  J.  Ma,  and  P.  E.  Latham,  “Probabilistic  brains:  knowns  and  unknowns,”  Nat.  Neurosci.,  vol.  16,  no.  9,  pp.  1170–1178,  2013.  [222]   D.  Prichard  and  J.  Theiler,  “Generating  surrogate  data  for  time  series  with  several  simultaneously  measured  variables,”  Phys.  Rev.  Lett.,  vol.  73,  no.  7,  p.  951,  1994.  [223]   A.  Hyvarinen,  “Fast  and  robust  fixed-­‐point  algorithms  for  independent  component  analysis,”  IEEE  Trans.  Neural  Netw.,  vol.  10,  no.  3,  pp.  626–634,  May  1999.  [224]   J.  Cohen,  Statistical  power  analysis  for  the  behavioral  sciences.  Hillsdale,  N.J. :  L.  Erlbaum  Associates,  1988.  [225]   J.  S.  Bendat  and  A.  G.  Piersol,  “The  Hilbert  Transform,”  in  Random  Data:   Analysis  and  Measurement  Procedures,  4th  edition.,  John  Wiley  &  Sons,  Inc.,  2010,  pp.  473–503.