ABSTRACT Title of dissertation: SOME STATISTICAL AND DYNAMICAL MODELS FOR THE ANALYSIS OF MICROBIAL ECOSYSTEMS AND THEIR GENOMIC DATA Senthilkumar Muthiah Doctor of Philosophy, 2019 Dissertation directed by: Professor Héctor corrada Bravo Department of Computer Science Embedded within their genetic makeup and ecology, microbes harbor unparalleled stories on natural selection, evolution and biomedicine. In modern biology, such stories are elucidated through rigorous interrogation of microbial ecosystems with a variety of theoretic and experimental techniques. These range from abstract, isolated mathemati- cal models to high-resolution sequencing technologies that probe every single nucleotide of a cell’s DNA. It is clear that inferences thus obtained are markedly sensitive to the unforeseen technical variability introduced during an experiment, and are limited by the tractability and robustness of the models in generating sound hypotheses. We have devel- oped statistical and computational tools to advance statistical inference for microbial ge- nomics by overcoming a subset of technical biases, and have explored certain interesting cases of microbial interactions and their evolution by developing tractable mathematical models. Compositional bias induced by the sequencing machine. A DNA sequencing ma- chine produces only percentage measurements (fraction molecules of a given type) of the DNA molecules in its input. When contrasting measurements from different inputs, one therefore obtains confounded inferences on absolute concentrations (molecules per unit volume). We theoretically analyze this compositional bias problem with significant gen- erality, and exploit it to develop an empirical Bayes approach to solve it under certain assumptions with particular emphasis on microbial sequencing technologies. Suicidal attributes of prokaryotic adaptive immunity. The recently discovered CRISPR systems provide the first examples of bacterial and archaeal adaptive immune systems op- erating against invading viruses over ecological time scales. Equally surprising as their adaptive nature, is their ability to induce high rates of host autoimmunity. We theoretically analyze the ecological and evolutionary dynamics of such a costly defense mechanism in simplified models of prokaryote-phage coevolution. We show that by allowing for regulated post-infection activation, CRISPRs can function by exploiting a dual defense strategy of abortive infection and anti-viral resistance. Additional statistical and analytic extensions for some related questions on cluster- ing and multi-resolution analysis also appear. SOME STATISTICAL AND DYNAMICAL MODELS FOR THE ANALYSIS OF MICROBIAL ECOSYSTEMS AND THEIR GENOMIC DATA by Senthilkumar Muthiah Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Héctor Corrada Bravo, Chair/Advisor Professor Eric V. Slud Professor Sridhar Hannehalli Professor Mihai Pop Professor Doron Levy ©c Copyright by Senthilkumar Muthiah 2019 Foreword Material from parts I and II of this thesis, respectively appear in the following re- search publications. 1. Kumar, M.S., Joshua B. Plotkin, and Sridhar Hannenhalli. "Regulated CRISPR Modules Exploit a Dual Defense Strategy of Restriction and Abortive Infection in a Model of Prokaryote-Phage Coevolution." PLoS Computational Biology (2015). † Authors’ contributions: Conceived and designed study: MSK SH. Performed the experiments: MSK. Analyzed data: MSK. Contributed analytical tools: MSK JBP. Wrote the paper: MSK JBP SH. 2. Kumar, M.S., Eric V. Slud, Kwame Okrah, Stephanie C. Hicks, Sridhar Hannen- halli, and Héctor Corrada Bravo. "Analysis and Correction of Compositional Bias in Sparse Sequencing Count Data." BMC Genomics (2018). † Authors’ contributions: Conceived and designed study: MSK HCB. Performed the experiments: MSK. Contributed analytical tools: MSK EVS KO HCB. Data analysis and interpretation: MSK EVS KO HCB. Wrote the paper: MSK SH SCH EVS HCB. ii Acknowledgments Deeply satisfiying scientific research often starts off with a well defined question and a fragment of an idea to address it, and ultimately burgeons to a full fledged scientific paper with rigorous logical discipline. As fascinating as that process sounds, working through it, a graduate student can experience sky-highs and hellish-lows that cast severe self-doubts. Looking back now, as I witness the winter migration of a giant flock of ca. 8000 arctic snow geese along the Delaware Bay, I know, for sure, I am able to navigate through them because of a loving family, encouraging friends and wisdomatic mentors. I wish to thank my mentors Mukund Thattai and Russell S. Schwartz for inspiring confidence in scientific research; for me, it all started with them. Profs. Kevin Chen, Eduardo Sontag, and P.S. Thiagarajan made sure that that enthusiasm did not plateau off. I am indebted to Profs. Markus Deserno, Sridhar Hannenhalli, Doron Levy, Joshua Plotkin, Niranjan Nagarajan, and Eric V. Slud for their insightful lessons, thorough brainstorming sessions, and fantastic collaborations that altogether elevated the quality of my research experience beyond what I had wished for. Finally, Prof. Mihai Pop and my otherworldly gift of a PhD advisor, Prof. Héctor Corrada Bravo taught and patiently guided me the rest of the way ensuring that I graduate happily, all the while helping me lead a very comfortable life with enriching advice and uninterrupted funding. Thanks also to all my friends Mathieu Almeida, Rajat Anand, Lokheshvar Balaku- mar, Claudia Bancila, Tristan Bereau, Denis Bertrand, Manikandan Chandran, Faezeh Dorri, Mohamed Geunady, Luis Guardado, Broc Gullet, Sarika Hegde, Joyce Hsiao, Keith Hughitt, Asif Javed, Debashis & Subhra Kar, Aparna Lakshmanan, Byoungkoo iii Lee, Rajarajan Loganathan, Piötr Mardziel, Lee Mendelowitz, Vasanth P. Murari, Vinutha Nagaraj, Sanjanaa Nagarajan, Satyajeet Ojha, Kwame Okrah, Nathaniel Olson, Joseph N. Paulson, Constantine Pop, Elisabet Pujadas, Navneet Rai, Anand Babu Rajendran, Vi- gneshwar Ramakrishnan, Sathish Babu Shanmugam, Prasanth Selvarajan, John Smith, Gautam Singh, Gao Song, Arvind & Sharanya Suresh, Hisham Talukder, Kun Wang, Andreas Wilm, Chengxi Ye and others for their critical comments, encouragement, and memorable life-events. No words can express the gratitude I feel toward my beloved parents, wife, sister, and grandmother as the love they have shown, and the sacrifice they have endured over the years for my livelihood, betterment, and happiness, will forever be unmatched. It is only fitting that I dedicate this thesis to my family, friends and mentors. M. Senthil Kumar College Park, MD 2018 iv v Table of Contents Foreword ii Acknowledgements iii List of Tables ix List of Figures x I Sequencing technology induced systematic biases 1 1 On the fundamental role of DNA sequencing in modern biology, and its troubling output characteristic. 2 2 An analysis of sequencing technology induced compositional bias in generating confounded concentration inferences. 10 2.1 A sequencing experiment . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 When can we hope to reconstruct X0 from Y with compositional bias correction? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3 On the generality of compositional correction factors, and some strategies to esti- mate them. 23 3.1 The generality of compositional correction factors in explaining technical variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Estimation strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Simulation analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 A scaling normalization technique for estimating compositional bias from sparse relative frequency data. 44 4.1 Classic scale normalization techniques suffer with sparse 16s count data . 48 4.2 The proposed technique (Wrench) reconstructs precise group-wise com- positional factor estimates . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.1 Wrench has better normalization accuracy in experimental data . 55 vi 4.3 Inferences following compositional correction show improved coherence with experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Compositional scale factor estimates imply substantial technical biases, indicating importance of further experimental studies . . . . . . . . . . . 65 4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5.1 An approach (Wrench) for compositional correction of sparse, genomic count data . . . . . . . . . . . . . . . . . . . . . . . . 67 4.6 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 II Adaptive immunity in prokaryotes 86 5 The curious case of prokaryotic adaptive immunity. 87 6 Ecological dynamics of autoimmune CRISPR induced prokaryote-phage coevo- lution. 90 6.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.1 Behavior of a simple prokaryotic immune system with regulated autoimmunity . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.2 A detailed model for CRISPRs incorporating their adaptive abil- ity and regulation . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Population dynamics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3 Spacer and protospacer concents in free and infected cells . . . . . . . . . 106 6.3.1 Simulations and bifurcation analysis. . . . . . . . . . . . . . . . 111 6.3.2 SND absence is extremely lethal in the absence of regulation . . . 111 6.3.3 A simple constraint determines CRISPR maintenance in the model 113 6.3.4 Coevolutionary dynamics under the assumption of equilibrated spacer levels over CRISPR evolutionary time scales . . . . . . . . 114 6.3.5 Four characteristic regimes of CRISPR activity . . . . . . . . . . 115 6.3.6 Elimination of abortive infection improves coexistence of phages 117 6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 III Appendix 129 7 Ecological equivalence as a modeling strategy for metagenomic count data. 130 7.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2 Data likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.3 Posteriors for ψ and Z . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.3.1 Conditional posterior for ψ . . . . . . . . . . . . . . . . . . . . 136 7.3.2 Conditional posteriors for Z . . . . . . . . . . . . . . . . . . . . 137 7.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 vii 8 Evolutionary invasion analysis of altruistic post-infection suicidal genotypes in a well-mixed epidemiological model. 150 8.1 Evolutionary Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 9 Multi-resolution analysis with bifurcation analysis of smoothing spline models. 157 9.1 Smoothing Splines Models . . . . . . . . . . . . . . . . . . . . . . . . . 158 9.2 Two specific instances of the problem . . . . . . . . . . . . . . . . . . . 160 9.2.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.2.2 Cubic smoothing splines . . . . . . . . . . . . . . . . . . . . . . 162 9.2.3 Deriving the solution of the cubic smoothing spline problem . . . 163 9.3 Proposed strategy for multi-resolution analysis of case-control longitudi- nal data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 9.4 Model construction for longitudinal case-control data . . . . . . . . . . . 165 9.4.1 Estimation and Notation . . . . . . . . . . . . . . . . . . . . . . 167 9.5 Bifurcation analysis of γ(t,λ ) with λ as the control parameter . . . . . . 168 9.5.1 Confidence intervals for t̂ given λ . . . . . . . . . . . . . . . . . 168 9.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.6.1 Metagenomic time series . . . . . . . . . . . . . . . . . . . . . . 169 9.6.2 Genome-Wide DNA Methylation Signals . . . . . . . . . . . . . 170 Bibliography 175 viii List of Tables 3.1 Scaling normalization approaches derive their technical bias estimates from ratio of proportions. . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Example simulations illustrate the limitations of current techniques. . . . 55 4.2 Correlations of compositional scales with orthogonal measurements on concentrations/technical biases. . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Descriptions of variables and parameters in model 1. . . . . . . . . . . . 95 6.2 Description of the different variables used in the detailed model. . . . . . 102 ix List of Figures 1.1 Probing the central dogma with a DNA sequencer. . . . . . . . . . . . . . 5 1.2 Compositional bias: Contrasting relative frequencies lead to confounded concentration inferences . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Compositional bias introduced by sequencing technology. . . . . . . . . . 11 3.1 Scaling normalization techniques in genomics from the perspective of compositional bias correction. . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Simulation strategy for evaluating current normalization and differential expression analysis toolkits for compositional correction. . . . . . . . . . 34 3.3 Total sum based normalization, like RPKM/Rarefication, under a Uni- form fold change distribution. . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Total sum based normalization, like RPKM/Rarefication, under a Gaus- sian fold change distribution. . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Confounded inference with total sum and reference normalization strategies. 39 3.6 Reference normalization (TMM/DESeq/Median) under a Uniform fold change distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.7 Reference normalization (TMM/DESeq/Median) under a Gaussian fold change distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Importance of compositional bias correction in sparse metagenomic data. 46 4.2 Estimation of compositional correction scales from sparse count data. . . 49 4.3 Adding pseudocounts leads to biased normalization. . . . . . . . . . . . . 50 4.4 Ignoring zeroes can introduce bias in normalization, when zeroes pre- dominantly arise from under-sampling. . . . . . . . . . . . . . . . . . . . 52 4.5 Wrench scales outperform competing approaches in reconstructing com- positional changes and in differential abundance testing. . . . . . . . . . 56 4.6 Simulation performance in a balanced design. . . . . . . . . . . . . . . . 57 4.7 Simulation performance in an unbalanced design. . . . . . . . . . . . . . 58 4.8 Wrench scales lead to reduced false positive calls. . . . . . . . . . . . . . 60 4.9 Wrench normalized data lead to better downstream inferences. . . . . . . 65 4.10 Importance of compositional correction in common bulk RNAseq studies. 68 4.11 Wrench retains potential biological information, and indicates importance of compositional correction in general practice. . . . . . . . . . . . . . . 84 x 4.12 Benchmarking analysis of the small scale, high coverage Argyropolous et al., miRNA dataset for deviation from expected fold changes in the clustered symmetric DE without global changes in expression ratiometric A versus B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1 Bifurcation analysis of a simple model of a prokaryotic immune system with regulated autoimmunity side effect. . . . . . . . . . . . . . . . . . . 94 6.2 A detailed model of CRISPR dynamics. . . . . . . . . . . . . . . . . . . 103 6.3 Reactions influencing total spacer and protospacer densities. . . . . . . . 107 6.4 SND absence is lethal due to accumulation of self-targeting spacers. . . . 124 6.5 The (δ ,β ,GC) space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.6 Elimination of ABI allows for improved phage densities. . . . . . . . . . 126 6.7 Qualitative behavior of regulated CRISPR modules. . . . . . . . . . . . . 127 6.8 Decoupled behavior of a spacer deletion system. . . . . . . . . . . . . . . 128 7.1 A plate model illustration of the proposed generative process underlying metagenomic counts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.2 Prior distributions based on a tree of relationships among taxa. . . . . . . 141 7.3 Tree priors improve taxonomy enrichments. . . . . . . . . . . . . . . . . 145 7.4 Equivalence classes capture environmental gradients. . . . . . . . . . . . 146 7.5 Equivalence classes of OTUs as better hypotheses generators. . . . . . . . 148 8.1 Evolution of host abortive infection potential. . . . . . . . . . . . . . . . 154 9.1 Long term and short-term differences in microbial time series pre- and post- travel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.2 Scale specific genome-wide differences in DNA methylation in lung can- cer tissue relative to controls. . . . . . . . . . . . . . . . . . . . . . . . . 173 9.3 Enrichment of transcription factor binding sites in hypo-methylated regions.174 xi Part I Sequencing technology induced systematic biases 1 Chapter 1 On the fundamental role of DNA sequencing in modern biology, and its troubling output characteristic. That phenotypic variance in biological traits is a consequence of underlying ge- netic changes was suggested concretely in the early 1900s by the work of G. Mendel, T.H. Morgan, R.A. Fisher and others [10–14]. Much of this phenotypic manifestation of genetic information is attributed to the central dogma of molecular biology [15–18], a foundational principle based on three key players: cellular genes in deoxy-ribonucleic acid (DNA) forms are first transcribed to their corresponding ribonucleic-acid (RNA) forms, which are subsequently translated to protein products. Molecular biologists have continued to disentagle the mechanistic basis of the central dogma, and in doing so, have not only specified new roles to existing players, but have also added new players to the story [19–22]. RNA and protein mediated regulation are examples of the former, while the epic epigenetic machinery and their growing list of potential consequences are exam- ples of the latter [23–27]. The players and the interactions among them are, then, well poised to generate variability and stabilize organismal phenotypes over ecological and evolutionary time scales [14, 28–31]. 2 While measurements on phenotypes are more easily obtained, the inverse problem of identifying the underlying genotypic determinants have continued to be challenging to this day [32]. Key experimental techniques and technologies have been developed along the way to help researchers address their questions on the cental dogma, the genes and their interactions efficiently. Knockout experiments are perhaps the most revealing, in making the first steps toward causal characterizations [33–37]. By carefully generating mutant organisms that are deficient in a target gene, and contrasting their behavior against wild-type controls, significant progress can be made in isolating the key functional roles played by the gene. This approach has been very effective in prokaryotes (bacteria and archaea), flies and small animals with shorter generation times, and with phenotypes that are largely determined by a single gene/locus in the genome [38–40]. In larger organ- isms like humans, although derived cell lines from specific tissues still allow for effective implementation of knockout designs, a more general approach to identifying multi-locus traits can be envisioned if one can access the underlying genomic sequence accurately in its entirety and measure the corresponding phenotypes. If such a procedure can be estab- lished, it can be viewed as a natural multi-variate knockout experiment that exploits the observed stochastic genetic variation in extant populations. Such is the utility, offered by the remarkable Nobel-prize winning DNA sequencing technology [41–46]. Briefly, accessing genotypes with DNA sequencing works as follows. The input genomic DNA is broken into short random pieces. Each piece is amplified at an average gain, save some technical artifacts, and its nucleotides read off. Most laboratory machines produce a few million such short sequences, each around three hundred base pairs length. However, what we seek is a full description of the genomes present in our input. So one 3 resorts to algorithmically stiching the output sequences based on their overlaps, as the set of overlaps of a given output sequence signifies its possible local neighborhoods in the input. In this way, one obtains more relevant larger genomic segments like genes or even entire genomes. More recent, expensive, pocketable machines can produce roughly full length bacterial genomes, in the order of of mega base pairs of DNA. Interestingly, a sequencer’s output is not only useful for identifying distinct se- quences in the input, but it also allows quantification of their relative frequencies. Re- gardless of their output statistics and costs, the fundamental input-output characteristic of high throughput DNA sequencing machines remain the same: they produce distinct sequences in abundances proportional to their input relative frequencies, in the sense that if T total short sequences are generated by the machine, an input DNA sequence with a relative frequency q is represented qT times in the output, on average [3, 47–55]. It is this fundamental fact of sequencing based quantification that allows for much mischief from derived technologies that exploit DNA sequencing protocols. We revisit this point on quantification after introducing a few derived technologies next. Remarkably, the ability to sequence DNA allows creative opportunities to identify not only a cell’s genomic sequence, but also get a snapshot of other macromolecules and their behavior within a cell. This is illustrated in Fig. 1.1. All the biochemist has to do is encode the entity of interest in a DNA form so that the signal can be read off by the DNA sequencer. For instance, the enzyme reverse transcriptase catalyzes the conversion of RNA to DNA. If we can manage to transform the entire set of expressed RNAs in a cell to their corresponding DNA forms with such an enzymatic reaction, then with the aid of a DNA sequencer, one can expect to sequence and quantify the entire set of expressed 4 Regulation by binding RNA Proteins Bisulfite Enrichment for binding signals treatment Reverse Transcriptase ChipSeq RNASeq DNA Sequencer Figure 1.1: Probing the central dogma with a DNA sequencer. The dashed lines map a gene expression program regulated with a positive feedback motif. A few genomic tech- niques that allow researchers to investigate the pathway’s distinct steps are illustrated. Methylated cytosine residues along the gene body is indicated with tiny filled circles, and unmethylated residues with an open circle. RNAs in a single cell, measuring only a couple of microns. This is the idea that drives RNAseq technologies that aim to quantify gene expression, perhaps the most exploited technology built around DNA sequencing [3, 56–62]. Similarly, if one manages to enrich an input sample with only segments in human DNA that are bound by a particular protein molecule, sequencing the enriched sample with a DNA sequencer then yields information on the identity of the protein’s binding sites. This is the idea behind the ChipSeq technol- ogy [63–67]. In fact, one can go further and aim to identify signals at a single nucleotide 5 Gene Methylation Signal level in the DNA! Millions of cytosine residues in the human DNA are found harboring an extra methyl group. A particular treatment allows transformation of such methylated residues to uracils, while unmethylated cytosines are retained as cytosines, leaving the rest of the DNA material more or less intact [68]. The resulting position specific methy- lation information is then read off by processing/assembling the output sequences from the DNA sequencer. This is the engineering feat behind the bisulfite sequencing technol- ogy [68–70]. Needless to say, other variations of such derived technolgies exist, each with its own target measurements of interest. At the time of this writing, many consortia like ENCODE [71, 72], TCGA [73–76], GTex [77–82], HMP [83–86], and MetaHit [87–91] with several million dollars in public funds have been established with contributors from all over the world. Their purpose, for the most part, is to exploit sequencing based tech- nologies to produce and analyze associated (genotypic) data for diverse phenotypes of public health interest. The data produced is publicly available for the world’s researchers to use. We hope it is clear to the reader that sequencing in various disguises have and will play a fundamental role in helping researchers identify and measure signals of diverse biological origins. We indicated in the paragraph before last that sequencing technology allows only relative frequency, and not absolute abundance/concentration1, quantifications of the in- put molecules2. While there are questions in biology where relative frequency measure- ments are useful (e.g., quantitative geneticists have traditionally tracked allele frequencies in characterizing their long term evolution and fixation in a target population [92, 93], al- 1Henceforth, the term concentration is used to mean the absolute abundance of a molecule in the units of number of molecules per unit volume of the input. We contrast this with relative frequency/relative abundance, which is used to mean the fraction molecules of a given type in the input. 2More generally, input molecules that are measured in an experiment will be termed features. 6 though one could argue that effective population size measures are still needed to track the fates of low relative frequency mutants), there are at least three fundamental reasons for why feature-wise concentration measurements are attractive and should be sought. Wild-type Knock-outs A case-control experiment Induced change in knock-outs Absolute abundance of S A S AB B genes What a S S sequencing A machine sees A B B What a sequencing B machine S A B S A outputs Figure 1.2: Compositional bias: Contrasting relative frequencies lead to confounded concentration inferences. Suppose we want to compare gene expression measurements from wild-type and knock-out genotypes of a particular cell type. Suppose genes S and A have similar expressed RNA concentrations (number of molecules per unit volume of the cell) in wild-types and knock-out cells, while B has increased in its concentration in the knock-outs due to biological reasons. Because a sequencing machine’s output allows relative frequency quantification only, an increase in B leads to reduced abundance measurements from other genes. An analyst might reason A and S to be significantly reduced in abundance, while, in reality they did not. 7 First, it must be irrevocably emphasized that concentrations introduce far less ambi- guities both in generating sound hypotheses, and in deriving sound scientific conclusions. We shall illustrate this with a few examples. Consider the thought experiment in Fig. 1.2, where knock-out genotypes experience an increased concentration of gene B’s RNAs alone. If we run the derived RNAseq samples through a sequencer (which only provides relative frequency measurements), we will find that the output relative frequencies cor- rectly indicate that gene B’s expression has increased. But the output would also suggest that genes S and A have decreased in their expression. The latter conclusion is false, ir- relevant, and is purely caused because of the bias ( hereafter, referred to as compositional bias ) induced by the relative frequency based quantification system. Compositional bias is caused solely because relative frequencies by definition are constrained to sum to 1, and are therefore anti-correlated. Had the experimentalist tracked concentrations, such confounded inferences would not have arisen in the first place. Some well known micro- bial markers of Crohn’s disease based on host intestine associated microbial abundance markers turned out to be artifacts of relative frequency based quantification, and had no immediate relevance to the underlying biology [94]. In an RNASeq experiment contrast- ing genes’ expression values in mice liver and kidney, a decreased expression of house keeping genes were attributable to the increased concentration of a few dominant genes in the liver tissue samples [52]. In fact, in any RNAseq experiment, genes with shorter lengths can appear to be lowly expressed simply because their longer length counterparts contribute more to the sequencing machine’s output [50]! In the era of modern biology, where we attempt to base hypotheses and conclusions on millions of molecular features that are quantified using a DNA sequencer, how can we attribute any measured relative 8 frequency change to an underlying biologically relevant concentration change? In appen- dices A and C, we outline a couple of our own research programs where such concerns limit serious progress. Second, key biological phenomena exist in which the absolute concentrations of the players have more meaningful roles, than their relative frequency descriptions. For instance, intracellular gene expression kinetics and their noise characteristics are largely driven by absolute RNA and protein numbers, not their relative frequencies [95–99]. Sim- ilarly, cystic fibrosis patients can exhibit very stably associated microbial relative frequen- cies in the same way as healthy controls, yet suffer from increased absolute total microbial loads [100, 101]. Finally, we must acknowledge that absolute measures are attractive simply be- cause they are more general, allowing immediate access to relative frequency measures if needed. This generality is bound to be favorable in questions involving as yet unexplored biological systems and mechanisms. Given the major impact sequencing and its derived technologies have in modern biology, sequencing machine induced compositional bias in molecular quantification is certainly a major cause for concern in designing experiments. In chapters I.2, I.3 and I.4, we aim to analyze, and estimate the sequencing machine induced compositional bias under certain assumptions. 9 Chapter 2 An analysis of sequencing technology induced compositional bias in generating confounded concentration inferences. In the previous chapter, we mentioned that sequencing technology has been instru- mental in measuring diverse biological signals. We also stressed that this remarkable flex- ibility comes with atleast one tradeoff: output from a DNA sequencer only retain relative frequencies of the input molecular features, and not their absolute concentrations. When contrasting feature-wise relative frequencies across distinct biological sample sources1, truly null features can exhibit non-zero apparent contrasts. This artifact is shared by all relative frequency quantification systems, and the DNA sequencer is no exception. We illustrated the artifact in Fig. 1.2, and gave it the name compositional bias. In this chapter, we will analyze compositional bias in significant generality. In par- ticular, our interest would be in deriving the conditions under which a relative frequency measurement system like the DNA sequencer would yield unbiased concentration infer- ences. We will also identify the compositional correction factor, which when estimated correctly would remedy the compositional bias problem. It is not surprising that this 1an analysis known as differential abundance analysis or differential expression analysis in the biology literature. 10 quantity is a measure of the total feature load in the input sample. 2.1 A sequencing experiment DE Test 3 DE Test 2 (on Absolute (on “Absolute” abundances) abundances) Contamination, X 0 extraction, gj amplification Xgj & other technical biases q If technical biases perturb gj Sequencing Ygj feature abundances similarly across conditions, compositional correction yields DE inference on X0 (green). Otherwise, on X (blue). DE Test 1 (on Relative abundances) Figure 2.1: Compositional bias introduced by sequencing technology. As a sample j from group g of interest is prepared for sequencing, its true internal feature concentra- tions (organized as a vector) X0g j is transformed by various technical biases to Xg j. A sequencing machine introduces compositional bias by generating counts Yg j proportional to the input absolute abundances in Xg j according to proportions qg j = [Xg ji/(∑k Xg jk)], i and k indexing features. Directly performing a differential abundance test on Y (DE Test 1), by using normalization factors (discussed in text) proportional to that of total se- quencing output (e.g., R/FPKM/subsampling in metagenomics [3–6]) amounts to testing for changes in relative abundances (frequencies) of features in X, in general (not X0). For inferring differences in absolute abundances (concentrations), we need to reconstruct X0 from Y to perform our inference (DE Test 3). For compositional bias correction in partic- ular, we care about reconstructing X j from Y (DE Test 2). We show more formally later that compositional correction can reconstruct X0 if technical biases (including contami- nation) are comparable across treatment and control groups. Fig. 2.1 illustrates a general sequencing experiment and sets up the problem of com- 11 positional bias correction. We imagine a set of samples/observations j = 1 . . .ng arising from biological conditions g = 1 . . .G (e.g., cases and controls). The true concentra- tions of features in every input sample is organized as a vector X0g j·, are perturbed by various technical sources of variation as the sample is prepared for sequencing. These technical sources of variation include any unforeseen contaminants, and/or specific bi- ases introduced in a measurement pipeline [51, 60, 61, 102–107]. For instance, when surveying microbial taxa by sequencing 16S ribosomal RNAs, taxonomy specific biases in the relative frequencies can arise by variation in the ribosomal RNA extraction effi- ciencies [108, 109], binding preferences of DNA amplification agents and even the target ribosomal RNA’s Guanine-Cytosine content [107]. All these cause systematic, differen- tial amplification across the surveyed microbial taxa. The end result after all such nui- sance perturbations is a transformed concentration vector Xg j·, the net total concentration of which is denoted by Tg j = ∑i Xg ji = Xg j+, where the + indicates summing over that subscript. This is the input to the sequencer, which introduces compositional bias by pro- ducing sequencing reads proportional to the absolute feature abundances represented in Xg j·. The output short sequences are processed and organized as counts for each feature in a vector Yg j·, which now retain only relative abundance/relative frequencies of features in Xg j· as q̂g ji = Yg ji/Yg j+ = Yg ji/τg j. Here τg j = Yg j+ is the total number of sequences produced by the machine ( sample depth ) for sample g j. We discuss the question of recovering X0g j· for all g and j later in the text. For now, we shall restrict our attention to reconstructing X from Y , as it is in this step, the sequencing machine induces the compositional bias we are interested in. Because we are ignoring all other technical biases inherent to the experiment/technology (i.e., the 12 process from X0→ X), our discussions apply to all derived technologies based on DNA sequencing. 2.1.1 Analysis Given only the feature-wise relative frequencies output by a sequencer (Y ), our goal is to identify the conditions under which we can achieve both (a) unbiased estimates of true underlying concentration fold changes (contrasts), and (b) unbiased inferences on the estimated concentration contrasts for all features i, in Fig. 2.1 ), when using classical general linear models often exploited in genomics. We briefly summarize the steps in our analysis below. • Lemma 2.1 provides the condition for obtaining unbiased concentration fold change (or contrast) estimates from relative frequencies. It serves to define the composi- tional correction factor. • Conditions for achieving unbiased inferences with independent feature-wise gen- eral linear models are derived in two steps as follows: – Lemma 2.2 uses the idea that for any given feature i in the input, contrasting its frequencies with a linear model (between two experimental conditions, say) would yield accurate concentration inferences for the feature, when the rest of the features do not undergo any concentration change. This fact is reflected as a linear constraint that relates the feature-wise proportions to the compositional correction factor. – Theorem 2.3 generalizes Lemma 2.2, and asks when the constraint derived in 13 Lemma 2.2 would apply to all features in the input. We thus recover condi- tions to obtain accurate inferences for all feature-wise concentration contrasts with relative frequencies. • Finally, Therorem 2.4 combines Lemma 2.1 and Theorem 2.3 to recover the con- ditions for achieving both unbiased contrast estimates and their inferences with relative frequencies. Theorem 2.5 generalizes the model dealt with in the afore- mentioned lemmas and theorems in a straightforward fashion. Model For simplicity, we shall first consider the generative process in eqn. 2.1, and derive some consequences. Xg j· ∼Multinomial((Tg j,qg·) ) (2.1) X Y |X τ ∼Multinomial τ g j·g j· g j·, g j g j, Xg j+ We will note later that the conclusions also hold when the assumption on a fixed proportions vector qg· for all samples at sage X is relaxed by requiring very general mo- ment conditions. The Multinomial assumption on X follows for example from a Poisson assumption on the expression of features Xg ji [47, 99, 110]. For our analysis, we only consider features truly expressed in the control group (g = 1, regardless of them being observed or not in a sequencing experiment) as we can only estimate fold changes for features occurring in the control group, and index them with i = 1 . . . p. Let φg be the summed proportion of features internally expressed only in group g but not in the control group (regardless of whether they are observed or not). For 14 interestingness, we assume p > 1. Clearly, 0 < q1i < 1 for all i. Fold changes are defined as ratios of marginal expectations. Define feature-wise concentration fold changes at stage X , ν E[X= g1i]gi E[X ] . The corresponding apparent contrasts ξgi from relative abundances11i at stage Y is defined as: ξ E[q̂g+i]gi = . Denoting E[Tg1] as the marginal average of the totalE[q̂1+i] abundances Tg j, from model 2.1, we have E[Xg ji] = E[Tg1] ·qgi for all j = 1 . . .ng. Under model 2.1, the fold changes can be re-written as: νgi = E[Xg1i]/E[X11i], and ξgi = qgi/q1i. In the entire process, we only get to observe Yg j· for all j = 1 . . .n and g = 1 . . .G. Lemma 2.1. Under assumptions 2.1, for all features i, νgi = ξgi, if and only if Λg = E[Tg1] = 1. Λ−1E[T ] g will be termed as the compositional correction factor.11 Proof. The proof follows directly from the definition of fold changes νgi associated with the ith feature’s concentrations. E[Xg1i] E[Tg1]qgi qgi E[q̂ ]ν g+igi = = ≡ Λg · = Λg = Λgξgi (2.2)E[X11i] E[T11]q1i qi1 E[q̂1+i] which is equal to νgi iff Λg = 1. Lemma 2.2. Under assumptions 2.1, when applying the standard log-linear mean model on the total sum normalized data independently for each feature i, logE[Yg ji/τg j] = µi + αgi with µi quantifying logged control group proportions, logq1i, and αgi quantifying the log-fold change of relative abundances, logξgi, there is a necessary and sufficient condition under which αgi = 0 ⇐⇒ logνgi = 0, the log- fold change associated with 15 concentrations. Furthermore, this condition is given as: [ ] 1 Λgφ1−q g + ∑ νgkq1k = 1 1i k,k=6 i Proof. Following Lemma 2.1, re-write the proportion in group g as: −1 Λ −1 −1 q Λ g νgiq1i Λg νgiq1i νgiq1i gi = g νgiq1i = = =1 φg +∑k qgk Λgφg +νgiq1i +∑k,k 6=i νgkq1k ≡ 1ν (2.3) 1+ g\i (1−q1i)νgi q1i where we have set: [ ] 1 νg\i = Λgφg + ∑ νgkq1k (2.4)1−q1i k,k=6 i Substituting eqn. 2.3 in the assumed linear model: logE[Yg ji|τg j] = logqg ji + logτg j = µi+α ν gi+ logτg j, and noting µ = logq , α = log qgi i 1i gi q , it is clear that α g\i 1i gi = 0 ⇐⇒ ν =gi 1. It is thus seen that νg\i = 1, is a necessary and sufficient condition for the statement αgi = 0 ⇐⇒ νgi = 1 to hold. Theorem 2.3. Under the model above, there exists a unique vector of fold changes ν∗g· under which ∀i = 1 . . . p, αgi = 0 ⇐⇒ νgi = 1. Furthermore, each i = 1 . . . p entry of ν∗g· is given as: [( )( ) ( ) ( )( )] 1−q η 1 1−q η ν∗ 1i g 1k ggi(Λg,φg,q1·) = −1 + −1q1i 1−q1i 1− p ∑k q1k 1−q1i (2.5) 16 with ηg = Λg ·φg. Proof. We want to study the conditions under which νgi = 1∀i = 1 . . . p. Substituting this in equation 2.4 from lemma 2.2, and stacking the constraints for all i, we get a linear system: Qν = γ where, Q is a p× p matrix with Q q(i, j) = 1 j1−q if j 6= i and 0 otherwise. ν =1i [|νgi| η ]pi=1, a p× 1 vector, and γ = [|γ p g gi|]i=1, a p× 1 column vector with γgi = 1− 1−q ,1i where ηg = Λgφg, a non-dimensional parameter. A unique solution for this equation is obtained directly as ν∗ = [|ν∗ p −1gi|]i=1 = Q γ if Q is invertible. We now show that Q is invertible by observing that the column vectors of the p× p square matrix Q are linearly independent. If we denote the columns from left to right as K1, . . .K p p, for linear dependence, we want the statement ∑ j=1 α jK j = 0 =⇒ α j = 0 to hold. Identifying each of the column’s projections on all p dimensional unit vectors e j, and noting all q1i ∈ (0,1), we can write: { } 1 1 1 1 K j = q1 j{ e + · · ·+ e + e + · · ·+ e1−q 111 1−q } j−1 1−q j+1 p1, j−1 1, j+1 1−q1,p p = q1 j ∑ 1 1e − e i=1 1−q i 1i 1−q j 1 j 17 Generating the required linear combination of these column vectors Ki, we find: p p p ∑ 1 p 1 α jK j = 0 =⇒ ∑ α jq1 j ∑ ei = ∑ α jq1 je j j=1 j=1 i=1 1−q1i j=1 1−q1 j 1 p =⇒ ∀ i, 1−q ∑ 1 α jq1 j = αiq1i 1i j=1 1−q1i p =⇒ ∀ i, ∑ α jq1 j = αiq1i j=1 Summing the last equation over all i = 1 . . . p, we get: p p 2 ∑ αiq1i = ∑ αiq1i i=1 i=1 Because all q1i ∈ (0,1), the above equation can only be true if αi = 0 ∀i. Hence, the vectors are lineary independent; Q is full rank, and invertible. Indeed, we can go further and derive the solution analytically. Notice that Q = rqT1·−D where r is a p× 1 vector with the ith component equal to 1 1−q and q1· is a1i p× 1 vector of control proportions. Notice all 0 < q1i < 1. D is a p× p diagonal ma- trix with diagonal entries given by 1−q1 jq ∀ j = 1 . . . p. If we set F = D− rq T 1·, we then1 j want Q−1 = −F−1. Denoting U = −r, and V = qT1·, we can write, F−1 = (D+UV )−1. Woodbury identity then(yield)s F−1 = D−1−D−1U( (I+V D)−1U)−1V D−1, a p× p matrix with F−1 1−q(i, j) = 1 j 1 if i =6 j, and 1−q1i 1+ 1q1i 1−p q1i 1−p if i = j. The exact solution for the fold changes satisfying the linear constrains are then given by ν∗g· =−F−1γ , with ν∗gi given by eqn. 2.5 above. Theorem 2.4 (Validity of Total Sum Normalization in Reconstructing X). Under the model above, the vector of feature-wise fold changes under which relative frequencies 18 (total sum normalized data) can yield unbiased inferences ( correct fold changes and non-zero significance ) of concentrations of all i = 1 . . . p features in group g at stage X is given by ν∗ ∗g (1,φg,q1), where νg (Λg,φg,q1) is defined in Theorem 2.3. Proof. Proof follows directly from Lemma 2.1 and Theorem 2.3. Theorem 2.5 (Relaxing the fixed group-specific proportions assumption). The results derived in lemmas 2.1, 2.2, and theorems 2.3 and 2.4 hold under the following more general model as well. In this model, qgi and φg are defined as marginal expectations of sample-wise relative frequencies, which are themselves assumed to be independent of Tg j. {q̃g j·, φ̃g j} ∼ f (·) such that φ̃g j ∈ (0,1), q̃g ji ∈ (0,1), φ̃g j +∑ q̃g ji = 1, i with E[q̃q ji] := qgi, E[φ̃g j] := φg, Tg j independent o f q̃g ji, φ̃g j. (2.6) Xg j·|Tg j, q̃g ji ∼Multinomial((Tg j, q̃g j·) ) X Yg j·|Xg j·,τg j ∼Multinomial τ g j· g j, Xg j+ Here f is some unspecified distribution function (e.g., Dirichlet) that allows con- strained sampling of observation-specific relative frequencies such that they sum to 1, with finite feature-wise marginal expectations. Proof. One only needs to note that with E[q̃g ji] = qgi, and E[φ̃g j] = φg forall j = 1 . . .ng 19 samples in group g, we obtain: E[q̂g+i] = qgi. (2.7) E[Xg1i] = E[E[Xg ji|Tg j, q̃g ji]] = E[Tg j q̃g ji] = E[Tg j]qgi. (2.8) φg +∑pi=1 qgi = E[φ̃g1]+∑ p i=1 E[q̃g1i] = E[φ̃g1 +∑ p i=1 q̃g1i] = 1. (2.9) Equations 2.7 and 2.8 above are needed for lemma 2.1, and equation 2.9 is needed for the lemma 2.2 and theorem 2.4 to go through. The result in theorem 2.4 was also verified numerically. As an example, suppose q1· = [0.25,0.25,0.1,0.1,0.3]T . For Λg = 1, and φg = 0.05, the fold changes that need to be achieved for unbiased inference is given by: ν∗g· = [0.95,0.95,0.88,0.88,0.96]T implying that downregulation across features can be detected well as the unique features will compete for sequencing output. For Λg = 1, and φg = 0.4, no feasible solution exists. For the case φg = 0, the optimal solution is trivial: νgi = 1 for all i i.e., no perturbation in any of the features. Providing additional constraints by fixing at least one of the fold changes yields the single, constrained solution on the rest of the fold changes: the solution vector ν∗ is obtained by replacing ηg =Λgφg in the above equation with ηg =∑ ∗k∈F νgkqkg where F is the set of features for which the fold changes are fixed apriori to ν∗gk, and restricting i to the rest of the features (other than those in F) present in the control group in the above derivation. Notice that there is an uncountable number of values (non-negative real values) the fold changes of features in the constraint set F can take. They will impose a particular value of ηg, and conditioned on this value, the fold changes the rest of the features can take in group g so that the linear model above achieves unbiased contrast 20 estimates and inferences are unique. The conclusion of theorem 2.4/its generalized version in 2.5 is an unfortunate result as it says that to obtain unbiased concentration inference across all features with relative frequencies alone, the feature-wise concentration fold changes must behave in a unique fashion, and therefore appears unlikely to occur in practice. Notice that fold-change ν∗gi can never be < 0. Thus, a feasible solution need not exist for arbitrary parameter values of ηg = Λgφg implying that unbiased inference may not always be possible. It is also interesting to note that unless the fold change of total feature content in group g (Λg) is somehow maintained the same across conditions despite contaminants present at propor- tion φg, achieving unbiased inference with normalization techniques based on the total sum is not possible. Group-specific expression of features are a major source of com- positional bias and their sufficiently high expression can effectively wash out the signal. In metagenomic surveys, it is often the case that a large number of features are observed with a positive count in very few samples. Although this does not necessarily mean they are actually present in only a few observations, we can expect this to be the case with samples arising from diverse ecosystems. In summary, strict unbiased inference with a DNA sequencer’s output relative fre- quencies may or may not be possible depending on the underlying value of ηg; when possible, it can only occur under a unique set of fold changes. In practice, RNAseq exper- iments are performed across diverse tissues of various origins, and metagenomic surveys are constantly carried out across ecosystems. Thus, unbiased-ness in inference need not hold. 21 2.2 When can we hope to reconstruct X0 from Y with compositional bias correction? So far, we have concerned ourselves only with characterizing the conditions that enable accurate characterization of feature-wise concentrations in the sequencing input (X in Fig. 2.1), given only feature-wise frequencies. We can now ask when compositional correction is guaranteed to recover the true concentrations X0 in the original source before it was marred by technical variation. In the next chapter, simple algebraic derivations reveal that as long as a feature i’s unwanted technical variation in group g is comparable, on average, to that of the control group, compositional bias correction reconstructs its true concentration, i.e., E[X0g1i] (eqn. 3.1). With this result, we recover a slightly more general condition than the often cited assumption on some familiar genomic data normalization techniques [4, 6]. We not only want the technical biases to affect all the features the same way within a sample, but if any contamination is introduced we want those biases to also behave appropriately according to the above condition. We also emphasize that in-silico post-processing of sequencing count data for contaminants (for example, by excluding sequencing reads mapping to potential cotaminant reference sequences) does not help in compositional bias correction because they have already caused information loss by competing with other native features for being sequenced. 22 Chapter 3 On the generality of compositional correction factors, and some strategies to estimate them. In the previous chapter, we addressed the question of inferring feature-wise concen- trations with feature-wise relative frequencies output from a DNA sequencing machine. We found that in such a problem, a single, linear bias term denoted as Λ−1, called the compositional correction factor, underlies all the confounded feature-wise concentration inferences. Among others, we arrived at two important conclusions: 1. That by transforming concentrations to frequencies, sequencing machine introduces one of the many unforeseen technical biases in our truly intended experiment to measure concentrations. 2. That by appropriately measuring or estimating the compositional correction factor, one can arrive at accurate feature-wise concentration inferences. 23 3.1 The generality of compositional correction factors in explaining technical variation We now note that compositional factors are far more general in their utility than merely serving to describe compositional bias induced by the sequncer. Indeed, they can very well account for other sources of technical variation as well. To see this, we refer the reader back to Fig. 2.1 and its notation, and notice that the process X0g j·→ Xg j· succinctly accounts for all unwanted, technical perturbations in the concentration of any feature i, in sample g j. For each feature i, on average, these perturbations are described by the corresponding fold changes defined relative to the true concentrations in the original sample source: µ E[X= g1i]gi 0 .E[Xg1i] We already saw that the concentration fold change at the time of input to sequencer (stage X in Fig. 2.1) is given as: ν E[X= g1i]gi E[X ] . With some algebra below, we observe how11i this apparent fold change is correlated with all technical perturbations abstracted away in the process X0g j·→ Xg j· : E[X 0 ν g1i ] E[Xg1i] gi = · E[X0g1i] E[X11i] E[X0g1i] = µgi · µ1i ·E[X011i] µgi E[X 0 (3.1) · g1i ] = µ1i E[X011i] µgi = ×ν0gi =:µ1i ︸ζ︷g︷i︸ × ︸ν︷0g︷i,︸ true technical fold change true biological fold change. It is correctly observed above that if technical biases are comparable across cases 24 and control conditions, the first factor is one, and the apparent fold change measured at stage X equals the true biological fold change at stage X0. Our formula for compositional correction factors is then altered correspondingly as Λ = νT q = (ζ ◦ν )Tg g· g· g· g· qg·. Here ◦ denotes element-wise product. Thus we see that compositional correction factors can account for more general technical biases introduced in a sequencing experiment. Given this significance, it is only fitting that we consider their estimation in detail. 3.2 Estimation strategies Strategy 1: Measure total feature load We had indicated in the previous chapter that the compositional correction factor for each experimental group g, Λ−1g , is inversely re- lated to the group’s average total feature content. So a clear strategy is to measure, if possible, the total DNA in the input sample could serve to estimate Λ−1g ; subsequently multiplying the estimates to the DNA sequencer’s output relative frequencies should then restore feature-wise concentrations. However, this strategy only reconstructs concentra- tions at the time of input to the sequencer (the vector Xg j· in Fig. 2.1), and completely ignores all other technical variation introduced in the data. So, unless techinical biases are comparable across conditions (as described in the previous subsection), it will be lim- ited in its practical utility. Here is a more concrete approach. Suppose we know that some feature k is un- changed in its concentration across conditions. From eqn. 2.2, we see that for any feature i, q −1gi = Λg νgiq1i. Because the fold change for the unperturbed feature (at stage X) 25 Technical/ A Biological Absolute Abundance S A S AB B Condition 1 Condition g What a S S Sequencing A RPKM Machine Sees A ÷ Total B CPM B Sum FPKM What a Rarefaction(metagenomics) Sequencing B Machine S A B S A Outputs B Sparse Approximate Non-Sparse this by assuming ÷ S a large fraction CSS of features remain MEDIAN unchanged. Scran DESeq Wrench TMM A S A S B B Condition 1 Condition g Figure 3.1: Scaling normalization techniques in genomics from the perspective of com- positional bias correction. (A) Features S and A have similar absolute abundances in two experimental conditions, while B has increased in its absolute abundance in condition g due to technical/biological reasons. Because of the proportional nature of sequencing, increase in B leads to reduced read generation from others (compositional bias). An an- alyst would reason A and S to be significantly reduced in abundance, while, in reality they did not. (B) Knowing S is expressed at the same concentration in both conditions allows us to scale by its abundance, resolving the problem. DESeq and TMM, by exploit- ing rerefence strategies across feature count data (described below), approximate such a procedure, while techniques that are based only on library size alone like RPKM and rarefication/subsampling can lead to unbiased inference only under very restrictive con- ditions. Currently available approaches for sparse data settings are indicated. Wrench is the proposed technique in the next chapter. νgk = 1 for all groups g = 1 . . .G, we obtain q −1gk = Λg q1k, which immediately suggests Λ−1g can be computed as ratios of proportions of this internal control feature. Further- E[Y ] q Λ−1more, if we calculate the transformation log g ji ν q E[Y ] = log gi q = log g gi 1i −1 = µi +αgk Λ q gi,g jk g 1k where with appropriate side conditions on the contrasts, the intercept estimates µi = 26 (logq1i− logq1k). Our contrast variable then estimates αgi = logνgi, which is 0 only under the null νgi = 1. Thus, the traditional data normalization idea of "dividing by a feature that does not change across conditions" automatically corrects for compositional bias induced through sequencing technology [4]. This is discussed further below. No- tice that we do not necessarily need the internal control feature to have the same internal concentration across conditions. As long as we know their sample-wise absolute concen- trations, their fold changes across conditions are also known, and these simply enter the above formulation as known constants that simply offset the linear models. (That is, we can write: q −1gk = Λg ν̂gkq1k, where ν̂gk is the now known fold change associated with the feature in group g. ) These insights lead us to the following two estimation strategies: Strategy 2: Introduce spike-in control features If all we need is a feature that is expressed at known abundances across conditions, why not inject it ourselves at the time of sequencing? Two potential techniques exist in the experimental literaure, one of which cannot protect us against compositional bias. In the ERCC spike-in protocol [111], widely used in various bulk tissue and some single cell RNAseq studies [112], a fixed amount of total RNA extract is obtained, and subsequently suspended in solution along with known concentrations of a chosen control feature (spike-ins). Because this procedure adds the spike-ins to the extract, an already compositional source, our inferences are limited to questions on relative abundances; a statement about differences in absolute abundances cannot be made unless the samples themselves behave according to the narrow conditions established in the previous chapter. An alternative, more effective strategy is to add known concentrations of barcodes/spike-in to the entire sample’s suspension [113] (Fig. 3.1B). 27 This problem has also been noted by Stegle et al., in the context of designing scRNAseq experiments [114]. Strategy 3: Post-process abundance data with reference normalization strategies In the absence of internal control features like the spike-ins, effective correction for com- positional bias can still be hoped for [52]. Here, is the central idea, which is so significant that it will appear repeatedly in our discussions: If most features do not change in their absolute abundances relative to the control group, the fundamental eqn. 2.2 should hold true for most features with νgi = 1. Thus, an appropriate summary statistic of these ratios of proportions could serve as an estimate of Λ−1g . With this idea in place, a normalization procedure for deriving sample-specific com- positional scale factors Λ−1g j can be devised. One only needs to carry out the above proce- dure by pretending that every sample arises from its own experimental group. Indeed, as illustrated in Table 3.1, scale normalization methods in genomics can be viewed in this light, where some control set of proportions ("reference") is defined, and the Λ−1g j estimate is derived for every sample j based on the ratio of its proportions to that of the reference. This central idea being the same, the robustness of these methods are dependent on how well the assumptions hold with respect to the chosen reference, and the choice of the estimation algorithm. To illustrate this idea further, we present the following derivation of a DESeq-like normalization strategy (refer Table 3.1 and Fig. 3.1). We use the same notation as in the last chapter and Table 3.1. Because each sample is considered to arise from its own group, the index j does not play any role here. We can fix j = 1, and let g = 1 . . .n index the 28 samples. Let i = 1 . . . p index the features. For a given sample g j then, Yg ji indicates the measured sequencing count of the ith feature in the sample; νg ji, the feature’s true concentration fold change in the sample; τg j = Yg j+ the sample’s sequencing depth; qg ji the feature’s proportion in the sample. Finally, let q1 ji denote the proportion of feature i in a control sample indexed with g = 1. If one assumes that feature-wise count distributions follow a log-normal distribu- tion, we obtain a DESeq-like estimator for compositional correction factors as below. Al- ter eqn. 2.2 with a multiplicative log-normal error term, and write for feature i in sample g j, Y −1g ji ∼ Λg j · τg j ·νg jiq1 ji ·LN(0,σ 2 i ), i = 1 . . . p, j = 1 . . .n ≡ µ 2g ji ·LN(0,σi ) where, LN(0,σ2i ) refers to a log-normal random variate that when logged has a mean of 0 and a variance of σ2i . Then: ( ) ∏Yg ji ∼ ∏µg ji LN(0,nσ2i )[g ] g1 ( ) n 1/n ∏Yg ji ∼ ∏µg ji LN(0,σ2i ) g g Y µ =⇒ d g jig ji = ( ) 1 ∼ ( g ji ) 1 ·LN(0,σ2i ) ∏ Y ng g ji ∏ ng µg ji ( Λ−1g j ) τ ν= 1 · ( g j ) · ( g ji1 ) 1 ·LN(0,σ2i ) Λ−1 n ∏ τ n ν n∏g g g j ∏g g jig j = kΛΛ−1g j · kττg j · kννg ji ·LN(0,σ 2 i ) 29 in which we have collected the constant denominator (independent of g j) terms separately into three k terms with corresponding subscripts. Now, d̃ d= g jig ji k τ ∼ kΛΛ −1 g j ·τ g j kννg ji ·LN(0,σ2 2 2 i ), with expectation given by k −1 σ /2 −1 ΛΛg j · kννg ji · e i ∝ Λg j ·νg ji · eσi /2. So if atleast a median fraction of features do not change on average relative to the reference sample, setting νg ji = 1 should hold for those features. We then arrive at: d̃ s̃ g jig j = median ∝ Λ−1i eσ 2/2 g j (3.2) i and so s̃g j serves as an estimator of Λ−1g j . This is simply DESeq normalization factors presented in table 3.1, altered only by feature-wise variances. In summary, the fact that compositional factors are linear technical biases shared by all measured features, makes it possible to take advantage of the class of scale normaliza- tion techniques in the genomics literature to estimate them [4, 52, 115, 116]. All of these approximate the aforementioned spike-in strategy by assuming that most features do not change on average across samples/conditions (Fig. 3.1). For the same reason, we have given such an interpretation to approaches like centered logarithmic transforms (CLR) from the theory of compositional data, which many analysts favor when working with relative abundances [117–123]. We must note that scaling normalization techniques have the same limitation as strategy 1 described above. Reconstructing X0 from Y It is worth emphasizing again that the aforementioned reference normalization strategies do not restrict compositional factors to only reflect biology-induced global abundance changes; in reality, if feature-wise perturbations (νgi) 30 Technique Proposed Abundance Measure, Scale factor Signal for Compositional Scale in yg ji , Total Sum τ ·Λ −1 g j g j Λ−1g j = 1 y [ g ji−1 ,τg j·Λg jTMM ( )] qg jiqg ji , ratio of proportions − ∑i:y >0 ∩ i∈trimmed set for j wi j log q q 1 jiΛ 1 = e i j 1 jig j yg ji yg ji C·τg j· −1 ∝ Λ −1 , DESeq g j τg j·Λg j qg ji Λ−1 qg ji 1 , ratio of proportionsg j = mediani 1 [∏k q ] n [∏k qik] n ik yg ji τ ·Λ−1 ,g j qMedian g j g ji− , ratio of proportionsΛ 1g j = mediani qg ji ∝ median qg ji 1/p i 1/p yg ji τ ·Λ−1 , qUpper quartile g j g j g ji− , ratio of proportionsΛ 1g j = upper quartilei qg ji ∝ upper quartile qg ji 1/p i 1/p ( ) ( ) ( ) qg ji log yg ji ≡ log qg ji ≡ log yg ji1 1 −1 , 1/p , CLR Transformation [∏ y ] p [ [∏ q] ] p τg j·Λi g ji i g ji [ ] g j closely tracks Median factors above;1 1 with Λ−1 p= ∏ q p ∝ ∏ qg ji ratio of proportionsg j i g ji i 1/p yg ji , τ −1 Scran g j ·Λg j { } qg ji − q q p q , ratio of proportionsΛ 1 1 ji n ji ++ig j = fit linear models to q , . . . ,++i q++i i=1 yg ji τ −1 , Wrench g j·Λg j qg ji−1 1 q q , ratio of proportionsΛ g ji ++ig j = p ∑i wi j q++i Table 3.1: Scaling normalization approaches derive their technical bias estimates from ratio of proportions. For each scaling normalization technique, we present the transfor- mation they apply to the raw sequencing count data (second column) to produce normal- ized counts. The third column shows how all techniques use statistics based on ratio of proportions to derive their scale factors. i = 1 . . . p indexes features, each sample is con- sidered to arise from its own singleton group: g = 1 . . .n and j = 1, τg j the sample depth of sample j, qg ji the proportion of feature i in sample j, wi j represents a weight specific to each technique, and q++i is the average proportion of feature i across the dataset. In the second column, the first row in each cell represents the transformation applied on the raw count data by the respective normalization approach. They all adjust a sample’s counts based on sample depth (τg j) and a compositional scale factor Λ−1g j . Continued on the following page. 31 Table 3.1: Continued from previous page. As noted in the third column, the estima- tion of Λ−1g j is based on the ratio of sample-wise relative abundances/proportions (qg ji) to a reference that are all some robust measures of central tendency in the count data. The logarithmic transform accompanying CLR should not worry the reader about its rel- evance here, in the following sense: the log-transformation often makes it possible to apply statistical tests based on normal distributions for the rescaled data; this is in-line with applying log[-norma]l assumptions on the rescaled data obtained with the rest of the techniques. C −1/n= ∏ j τg j is a constant factor independent of sample, and its pres- ence does not matter. For the same reason, Median and Upper Quartile scalings and CLR transforms, can be thought to base their estimates on a reference that assigns equal mass to all the features or if the reader wishes, a more complicated reference that behaves proportionally. When most features are zero, values arising from classical scale factors can be severely biased or undefined as we shall illustrate in the next chapter. Wrench is the scale normalization strategy we propose to overcome this problem. are also of technical origin, they can well be correlated with other sources of technical variation, and can be seen to estimate technical variation beyond what is accounted for by sample depth adjustments. This was described below eqn.3.1. Thus, it is interesting to ask under what conditions compositional factors arising from scaling techniques (in- cluding our proposed technique in this work) can reconstruct X0, the true concentrations in the source samples. From eqn. 3.1, it is clear that accurate compositional correction techniques can reconstruct true average concentration for any feature i when technical biases perturbing the feature is comparable between the treatment and control groups. 3.3 Simulation analyses In this section, we naturally ask how several genomic normalization techniques fare in estimating compositional correction factors. Our analysis below is limited to methods that provide interpretable estimates of fold-changes. We therefore do not consider differ- 32 ential abundance inferences arising from rank-based methods. We also leave the analysis of non-linear normalization techniques for future work. We note that traditional genomic normalization techniques [3, 4] like library size scaling, total sum/total count, reads per kilobase of transcript, per million mapped reads (RPKM), fragments Per kilobase of transcript per million mapped reads (FPKM), Counts per million (CPM), subsampling, rarefication based approaches are simple arbitrarily rescalings of relative frequencies, and for the purpose of this part of the thesis, one and the same. References for other genomic normalization techniques discussed will appear as appropriate. Figure 3.2 illustrates our simulation strategy. Given the set of control proportions q1i for features i = 1 . . . p, and the fraction of features that are perturbed across the two conditions (1−π), we sample the set of true log fold changes ( logνgi ) from a fold change distribution for the random (1−π) fraction of features that have been chosen to be per- turbed. The fold change distribution (FCD) is a two-parameter distribution chosen either as a two-parameter Uniform or a Gaussian. Based on the expressions from eqn. 2.3, the target proportions were then obtained as q ν= giq1igi ∑k ν . Conditioned on the total numbergkq1k of sequencing reads τ , the sequencing output Yg j for all i were obtained as a multinomial with proportions vector q = [|q |]pg· gi i=1. We set the control proportions q1· from various experimental sequencing datasets. With this setup, we can vary π , and the two parameters of the FCD, and ask, how various normalization and testing procedures compare in terms of their performance. Performance was quantified based on the sensitivity and specificity values in detecting truly perturbed features at a Benjamini-Hoschberg false discovery rate of .1. 33 A B Some experimental dataset π zg a µ q ν a b1 b σ qg σ Yg τ µ C Figure 3.2: Simulation strategy for evaluating current normalization and differential expres- sion analysis toolkits for compositional correction. (A) Simulation set up. q1·,qg· represent the control and case proportion vector of all the features. q1· is obtained from a given ex- perimental dataset. π represents the fraction of features that do not change across conditions. Zgi ∼ Bernoulli(π) for all i represents the set of indicator variables that denote if a feature is not differentially expressed. Conditioned on Zg·, the logged vector of fold changes logν is sampled from a two-parameter fold change distribution, with νgi set to 1 whenever Zgi is 1. Here i indexes the individual entries of the vector. 34 Figure 3.2: Continued from previous page. The sampled fold changes and control pro- portions are normalized to yield the case proportions. A multinomial draw for a fixed sample depth τ (20M reads) then yields the desired simulated sequencing output. The two fold change distributions, Uni f (aν ,bν) and a N(µν ,σ2ν ), considered in our study are shown in (B). Example M-A plots resulting from simulations when 75% (i.e., π = 0.75) of the features are fixed across conditions, with the rest perturbed according to log fold changes sampled from Normal(0,1) and Uni f (−4,4) fold change distributions respec- tively are shown in (C). Each point in the M-A plot corresponds to a feature, and plots its grand average (A axis), against their empirical fold changes. Both are in log2-scale. With the above setup, we do not strictly enforce constant average total feature abun- dance across simulated cases and controls. We would like to keep the parameter variations sufficiently general that this condition roughly holds under some settings, while letting us appreciate the relative merits of reference normalization strategies under others. In summary, for a given set of control proportions, we vary i) the fraction of features that change across conditions, ii) the shape, iii) mean and iv) variance of the fold change distribution that underlies the perturbation of features in the case-group, v) normalization approach and vi) testing technique. We also varied the control proportions themselves from various experimental datasets, and our results were similar. Our simulations are fairly general and should allow us to robustly characterize the performance of the current normalization and differential expression analysis practices in genomics. Library size/Subsampling based approaches Figure 3.3 plots the performance mea- sures for a negative binomial based testing suite (edgeR software [124]) for a uniform fold change distribution after total sum normalization. Sensitivity values in detecting true un- derlying concentration changes never go beyond 65%, and heavy false positive rates are incurred even when 95% of the features remain unchanged across conditions. Figure 3.4 35 shows the performance under the Gaussian fold change distribution. In contrast to the uni- form case above, we find sensitivities go up to 85%, but false positives are also accrued at higher rates. It would appear that higher variances and means lead to better performance, but as Figure 3.5 shows, many of these truly significant features were called significant for the wrong reason: wrong signs of fold changes. Higher means and variances of fold change distributions are therefore conditions that lead to heavily confounded inference under proportion based normalization strategies. These results were similar across testing platforms, and across testing techniques. It is useful to also summarize the relevant results from previous chapter here. Total count/library size normalized data is equivalent to relative frequencies. We devoted the previous chapter to ask under what conditions, inferences made with relative frequencies alone would continue to reflect concentration changes in an unbiased manner. We for- mally analyzed its influence within the framework of linear models, a widely used statis- tical framework within several count data packages commonly used in genomics. Under the most natural adjustments based on the total count (e.g., unaltered reads per kilobase of transcript, per million mapped reads (RPKM)/ fragments Per kilobase of transcript per million mapped reads (FPKM)/ Counts per million (CPM)/subsampling/rarefication based approaches), we found that these conditions can be precisely characterized mathe- matically and are extremely limited in their applicability in general experimental settings. It may be tempting to argue that one can resort to total count-based normalization if total feature content is the same across conditions. However, as was shown in the last chapter, it is easy to see that this assumption is only valid when strict constraints on the levels of tech- nical perturbation of feature abundances and sequence-able contaminants are respected, 36 TotSUM TotSUM ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π ● ● ● ●● ● ● ● ●● ● 0.95 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● TotSUM ● ● ● −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 log(aν) log(aν) log(aν) TotSUM TotSUM ● ● ●● ●● ● ●● ●● ●● ● ●● π● ● ● ● ● ● ● ●● 0.75 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● TotSUM −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 log(aν) log(aν) log(aν) TotSUM TotSUM ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● π ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● 0.5 ● ● ● ● ● ● TotSUM −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 log(aν) log(aν) log(aν) log(bν) 1 2 4 8 12 Figure 3.3: Total sum based normalization, like RPKM/Rarefication, under a Uniform fold change distribution. The figure plots various performance metrics of the edgeR package as a function of the fraction of features that remain unchanged across conditions (π), and the lower (aν ) and upper bounds (bν ) of a Uniform fold change distribution. Control proportions (q1·) were obtained from rat liver tissue of the rat bodymap [7]. Extremely high false positive rates result with higher variance and asymmetrically located fold change distributions (i.e., with positive or negative means) due to compositional bias. The results were similar across commonly used dif- ferential abundance testing platforms, and for the Gaussian fold change distribution (Fig. 3.4). 37 Sensitivity Sensitivity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Precision Precision Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A TotSUM TotSUM ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● TotSUM ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 π π π log(σν) 0.25 0.5 1 2 4 B TotSUM TotSUM ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● TotSUM 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 π π π log(µν) 0 0.5 1 2 4 Figure 3.4: Total sum based normalization, like RPKM/Rarefication, under a Gaussian fold change distribution. The figure plots various performance metrics of the edgeR package as a function of the fraction of features that remain unchanged across conditions (π), and the mean (µν ) and standard deviation (σν ) of the Gaussian fold change distribution for the same control proportions (q1·) as in figure 5. (A) (σν ,π) variations at µν = 0. (B) (µν ,π) variations at σν = 1. It would appear that higher fold change distribution variances and means lead to better perfor- mance, but these are also associated with higher false positive rates and as Fig. 3.5 shows, large fraction of these calls had wrong signed fold changes. Higher means and variances of fold change distributions are therefore instances that lead to heavily confounded inference. The results were similar across commonly used differential abundance testing pipelines. 38 Sensitivity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Precision Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A π 0.95, µ 4, σ 1 π 0.75, µ 4, σ 1 B π 0.95, µ 0, σ 4 π 0.75, µ 0, σ 4 Sampled LFCs TotSum TMM Figure 3.5: Confounded inference with total sum and reference normalization strategies. For all features whose reconstructed fold changes had wrong signs when called significant, together with false negatives, we plot the sampled fold changes (first column) and deviations in the edgeR reconstructed fold changes from those of the true values after total sum (second column) and TMM (third column) normalizations. The corresponding parameter values for the simulations are shown alongside the plots. Larger deviations from the horizontal line at 0 imply higher confounding in inference. Asymmetric FCDs, which give rise to feature specific fold changes biased to be more positive or negative, can easily trick inference based on total sum based normalization approaches. TMM and other voting based strategies behave in a more robust fashion. However, when larger fraction of features (25%) varies across conditions, their performance becomes highly sensitive to the underlying FCDs. The color indicates the density of points, with blue, green and red indicating low, medium and high densities respectively. 39 an assumption that can be very easily violated in metagenomic experiments [125–127], which usually feature high intra- and inter-group feature diversity. Reference normalization and robust fold-change estimation techniques We now compare and contrast the aforementioned total count/library size adjustments (i.e., relative frequency measurements) with a few reference based techniques (reviewed in Table. 3.1) in overcoming compositional bias at high sample depths. Also, many widely used ge- nomic differential abundance testing toolkits enforce prior assumptions on reconstructed fold changes, and moderate their estimation. This made us wonder about the robustness of these testing techniques in overcoming the false positives that would otherwise be cre- ated without compositional bias correction. With an exhaustive set of simulations at high coverage sample depths (similar to bulk RNAseq) with 20M reads per sample, by and large, we found that all testing packages behaved the same way, and the key ingredient to overcome compositional bias always was an appropriate normalization technique . We also found that reference based normalization procedures outperformed library size based techniques significantly, re-emphasizing the analytic insights we mentioned previously. Figures 3.6 and 3.7 demonstrate the performance of TMM normalization, a refer- ence based normalization strategy. In contrast to the above total sum-based normalization procedure, the false positive rates with TMM were maintained low, if not at zero, for a variety of parameter settings. At higher FCD means and variances, they also lead to wrong reconstruction of fold change signs but with a highly desirable twist: as long as the fraction of perturbed features across conditions is small, the fold change distribution is correctly centered throughout the abundance distribution except for those features with 40 very low abundances leading to very low false positive rates Fig. 3.5. For all normaliza- tion techniques, as the amount of features that change across conditions increases, false positive rates increase. In the next chapter, we consider applying these techniques to the problem of com- positional bias correction in metagenomic survey data. The data pose an interesting chal- lenge, as the microbial abundance measurements resulting from them can be extremely sparse. 41 TMM TMM ● ● ●● ●● ● ● ● ● ● ● ● π ● ● ● ● ● ● ●● ● ●● ● ● 0.95● ● ● ● ● ● ● ●● ● TMM● ● −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 log(aν) log(aν) log(aν) TMM TMM ●● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● π● ● ●● ● ● ● 0.75 ● ● ● ● ● ● ● ● ●● ● TMM −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 log(aν) log(aν) log(aν) TMM TMM ● ●● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● π ● ● ● ●● ● ● ● ● ● 0.5 ● ● ● ● ● ●● ● ● ●● TMM −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 −12 −10 −8 −6 −4 −2 log(aν) log(aν) log(aν) log(bν) 1 2 4 8 12 Figure 3.6: Reference normalization (TMM/DESeq/Median) under a Uniform fold change distribution. The figure plots various performance metrics of the edgeR package with TMM nor- malization as a function of the fraction of features that remain unchanged across conditions (π), and the lower (a) and upper bounds (b) of a Uniform fold change distribution. Control propor- tions (q1·) were obtained from rat liver tissue of the rat bodymap [7]. In contrast to what was observed with total sum approaches, the false positive rates are maintained at low levels for a larger range of parameters. Sensitivity values still remained low. High false positive rates result with higher variance and asymmetrically located (with respect to 0) fold change distributions. The results were similar across testing platforms, for median based normalization techniques like DESeq/Median scaling, and for the Gaussian fold change distribution. 42 Sensitivity Sensitivity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Precision Precision Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 A TMM TMM ●● ●● ●● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● TMM ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 π π π log(σν) 0.25 0.5 1 2 4 B TMM TMM ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● TMM● 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 π π π log(µν) 0 0.5 1 2 4 Figure 3.7: Reference normalization (TMM/DESeq/Median) under a Gaussian fold change distribution. The figure plots various performance metrics of the edgeR package as a function of the fraction of features that remain unchanged across conditions (π), and the mean (µν ) and standard deviation (σν ) of the Gaussian fold change distribution for the same control proportions (q1·) as in figure 3.6. (A) (σν ,π) variations at µν = 0. (B) (µν ,π) variations at a constant σν = 1. When the fraction of unperturbed features is large, in contrast to what was observed with total sum approaches, higher fold change distribution variances and means lead to better performance. Figure 3.5 shows, many of these calls had wrong signed fold changes. Higher means and variances of fold change distributions are therefore cases that lead to heavily confounded inference. The results were qualitatively similar across commonly used differential abundance testing pipelines. 43 Sensitivity Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Precision Precision 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Chapter 4 A scaling normalization technique for estimating compositional bias from sparse relative frequency data. From previous chapters, we recall that: 1. The output of a sequencing machine only retain relative frequencies, and not the concentrations of the measured features. We call this unwanted technical bias in- troduced in our experiment as compositional bias. Compositional bias is present in the output of all derived technologies in genomics like RNAseq, ChipSeq etc., which exploit a DNA sequencing machine for quantification purposes. 2. Compositional bias can be corrected by estimating compositional correction fac- tors. These factors are more general in that they correlate with other unwanted technical variation infused in the data, beyond compositional bias, as well. 3. Compositional correction factor is a linear technical bias shared by all features mea- sured in a sequencing experiment. This fact makes it possible to exploit various scale normalization techniques in genomics to estimate them. In this chapter, we consider the problem of estimating compositional correction factors for metagenomic 16s surveys, another derived technology based on sequencing. 44 Recognizing that 16s ribosomal RNAs (rRNA) are relatively specific for every prokary- otic Genus, Carl Woese and George Fox suggested that a simple strategy for identifying prokaryotic genera in a microbial sample is to sequence and catalogue the 16s rRNA sequences in it [128–133]. In fact, Woese & Fox demonstrated the promise of such a technology, rather dramatically, by adding a whole new domain of life – the archaea – to the phylogenetic tree of life! [129, 134] The number of times a given prokaryotic genera’s 16S sequence is found in the sequencing output serves as a measure of its frequency. This is the idea behind 16s marker gene surveys [135–138], which have now found widespread utility in biomedical research [139] and natural history studies involving large-scale oceanic microbial ecosys- tems [8, 138]. Like with other derived technologies, compositional effects are observable in the count data from the large-scale Tara oceans metagenomics project [8], (Fig. 4.1), in which a few dominant taxa are attributable to global differences in the between-oceans fold-change distributions. We demonstrate that our strategy of adapting traditional genomic normalization techniques (discussed in the previous chapter) for estimating compositional bias fail with 16S survey data. This is mainly because a large fraction of features (the distinct 16s sequences) in 16S count data are very sparsely observed in the output. Given that all reference based normalization techniques base their compositional scale factor estimates on ratios of proportions, the large fraction of zeroes in the 16s survey data lead to mostly zero valued compositional correction factor estimates: DESeq failed to provide a solution for all the samples in a 16s survey of our interest, and TMM based its estimation of scale factors on very few features per sample (as low as 1). The median approach simply re- 45 A TR: (SO) Southern Ocean [MRGID:1907] TR: (SAO) South Atlantic Ocean [MRGID:1914] −16 −14 −12 −10 −8 −6 −14 −12 −10 −8 −6 −4 A A B BR: (SO) Southern Ocean [MRGID:1907] BR: (SO) Southern Ocean [MRGID:1907] (SAO) South Atlantic Ocean [MRGID:1914] (SAO) South Atlantic Ocean [MRGID:1914] −16 −14 −12 −10 −8 −6 −4 −16 −14 −12 −10 −8 −6 −4 A A Figure 4.1: Importance of compositional bias correction in sparse metagenomic data. (A) M-A plots of 16S reconstructions (from high sequencing depth, whole metagenome shotgun sequencing experiments) from two technical replicates each from the Tara oceans project [8] generated for the Southern and South Atlantic Oceans. In all subplots, x-axis plots for each feature, its average of the logged proportions in the two compared samples; y-axis plots the corresponding differences. The red dashed line indicates the median log fold change, which is 0 across the technical replicates. (B) M-A plots of the same repli- cates but plotted across the two oceans. The median of the log-fold change distribution is clearly shifted. A few dominant taxa in the South Atlantic Ocean (circled) are attributable for driving this overall apparent differences in the observed fold changes. The Tara 16s dataset, reconstructed from very deep whole metagenome shotgun experiments of oceanic samples, albeit boasting of an average 100,000 16S contributing reads per sample, still encourages a median 88% feature absence per sample. turned zero values. CLR transforms behaved similarly. When one proceeds to avoid this problem by adding pseudo-counts, owing to heavy sparsity underlying these datasets, the 46 M M −5 0 5 10 −6 −4 −2 0 2 4 M M −10 −5 0 5 −5 0 5 transformations these techniques imposed mostly reflected the value of pseudocount and the number of features observed in a sample. A recently established scaling normalization technique, Scran [6], tried to overcome this sparsity issue in the context of single cell ri- bonucleic acid sequencing (scRNAseq) count data – which also entertains a large fraction of zeroes – by decomposing simulated pooled counts from multiple samples. That ap- proach, developed for relatively high coverage single cell RNAseq, also failed to provide solutions for a significant fraction of samples in our datasets (as high as 74%). Further- more, as we illustrate later, compositional bias affects data sparsity, and normalization techniques that ignore zeroes when estimating normalization scales (like CSS [5], and TMM) can be severely biased. The relatively low sequencing depth per sample (as low as 2000 reads per sample), large number of features and their diversity across samples thus pose a serious challenge to existing normalization techniques. In this chapter, we develop a compositional bias correction technique (Wrench) for sparse count data based on an empirical Bayes approach that borrows information across features and samples. We demonstrate its improved performance in metagenomic 16S survey data. Based on the distribution of compositional scale factor estimates arising from several publicly available large scale 16S count datasets, we argue that detailed experiments specifically addressing the influence of compositional bias in metagenomics are needed. 47 4.1 Classic scale normalization techniques suffer with sparse 16s count data In Fig. 4.2, we plot the feature-wise compositional scale estimates (i.e., ratio of sample proportion to that of the reference; third column entries in Table. 3.1), obtained from TMM and DESeq for a sample in two different 16S microbiome datasets. TMM computes a weighted average over these feature-wise estimates, while DESeq proposes the median. The first column corresponds to a bulk RNAseq study of the rat body map [7]; the second corresponds to those from a 16S metagenomic dataset [139]. Strikingly, while a large number of features agree on their scale factors for a sample arising from bulk RNAseq for both TMM and DESeq strategies, the sparse nature of metagenomic count data makes robust estimation of their scale factors extremely difficult. Furthermore, large variance is also observed across the scale factors suggested by the individual features. Clearly, a moderated estimation procedure is warranted. One might wonder if adding pseudocounts to the orginal count data (a common pro- cedure in metagenomic data analysis [118,140]) effectively deals away with the problem. However, as shown in Fig. 4.3, with large number of features absent per sample, these scale factors roughly reflect the value of the pseudocount, and are systematically scaled down in value as sequencing depth, which is strongly correlated with feature presence, increases. This result suggests that addition of pseudocounts to data need not be the right strategy for deriving normalization scales based on CLR [141] or other similar methods, especially when the data is sparse. The alternate idea of only deriving scale factors based on positive values alone, are also associated with problems as we will see later in the text. 48 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Adrenal Diarrhea ● 0 5000 15000 25000 Features ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5000 15000 25000 Features Figure 4.2: Estimation of compositional correction scales from sparse count data. On the left column, we plot the feature-wise ratio (Λg ji) estimates adjusted for sample depth from each feature i in one of the samples from the Adrenal tissue of the rat body map dataset (bulk RNAseq), and on the right column, we plot the same values arising from a sample in the Diarrheal dataset (16S metagenomics). The top and bottom rows corre- spond to the scales estimated using TMM and DESeq respectively. In the case of bulk RNAseq data, large numbers of individual feature estimates agree on a compositional scale factor. Simple averaging, or some robust averaging would help us obtain the scale factor exactly. Continued on next page. 4.2 The proposed technique (Wrench) reconstructs precise group-wise compositional factor estimates To overcome the issues faced by existing techniques, we devised an approach based on the following observations and assumptions. First, aggregated group/condition-wise feature count distributions are less noisy than sample-wise feature count distributions, 49 DESeq log(Λ̂ig) TMM log(Λ̂ig) ^ −2 0 2 4 −6 −4 −2 0 2 Figure 4.2: Continued from previous page. A similar robust behavior is observed with all the tissues available in the bodymapRat dataset (considered later in text). On the second column, we plot the feature-wise ratio values from a metagenomic 16S marker gene survey of infant gut microbiota. There is no general agreement among the features on the scale factors, and simple averaging will not work. We note that what we have shown are fairly good cases. Several samples entertain only a few tens of shared species with an arbitrary reference sample within the dataset. In this work, we aimed to model this variability and estimate the scale factors robustly by borrowing information across features and samples. A B C MOUSE MOUSE MOUSE −0.09 −0.06 10.0 11.0 12.0 −0.09 −0.06 10.0 11.0 12.0 LUNG LUNG LUNG −0.20 −0.10 0.00 8 10 14 −0.20 −0.10 0.00 8 10 14 DIARRHEA DIARRHEA DIARRHEA −0.08 −0.04 10.0 11.5 13.0 −0.08 −0.04 10.0 11.5 13.0 TARA TARA TARA −0.30 −0.20 −0.10 15.5 16.5 17.5 −0.30 −0.20 −0.10 15.5 16.5 17.5 Pseudo Log2(Fraction Ftrs. Zero) Log2(Sample Depth) ExperimentalLog2(Fraction Ftrs. Zero) Log2(Sample Depth) Figure 4.3: Adding pseudocounts leads to biased normalization. For each of the four microbiome count datasets (rows: Mouse, Lung, Diarrheal and Tara Oceans ), we plot (A) CLR and (B) DESeq compositional scales obtained after adding a pseudo count value of 1, as a function of fraction of features that are zero in the samples (first column) and the sample depth (second column). The observed behavior was not sensitive to the value of pseudocount used. A similar plot was alos generated for a pseudocount value of 10−7. (C) shows the total number of pseudocounts added, which is essentially the number of features observed in a dataset, and the total actual counts observed in the dataset divided by their sum i.e., the total implied sequencing depth after pseudocounts addition. Continued on next page. 50 Log2(CLR Compositional Scale w. Pseudocount) −18 −16 −16 −14 −22 −18 −14 −15.0 −13.5 −12.0 −18 −16 −16 −14 −22 −18 −14 −15.0 −13.5 −12.0 Log2(DESeq Compositional Scale w. Pseudocount) −0.5 0.0 0.5 −0.3 −0.1 0.1 −1.2 −0.6 0.0 −0.3 −0.1 0.1 −0.5 0.0 0.5 −0.3 −0.1 0.1 −1.2 −0.6 0.0 −0.3 −0.1 0.1 Fraction Count Observations Arising 0.2 0.6 1.0 0.2 0.6 1.0 0.2 0.6 1.0 0.2 0.6 1.0 Figure 4.3: A large fraction of sequencing depth in the new pseudocounted dataset is now arising from pseudocounts than the true experimental counts, when the data is excessively sparse. Indeed, if the pseudocount value is altered to a very low positive fraction value, the boxplots will reflect reversed locations, but this plot is only used to stress the level of alteration made to a dataset. Only in the Tara Oceans project, where the sample depth is 100K reads, do the boxplots shift. However, at a roughly median 90% features absent, that data when altered by pseudocounts, also leads to biased scaling factors as seen in (A) and (B). and it may be useful to Bayes-shrink sample-wise estimators towards that of group-wise global estimates. Second, zero abundance values in metagenomic samples are predom- inantly caused by competition effects induced by sequencing technology (illustrated in Fig. 3.1), and therefore can be indicative of large changes in underlying compositions1 with respect to a chosen reference. Indeed, ignoring sterile/control samples, the median fraction of features recording a zero count across samples in the mouse, lung, diarrheal, human microbiome project [142] and (the very high coverage) Tara oceans [8] datasets were: .96, .98, .98, .98 and .88. These respectively had median sample depths of roughly 2.2K, 4.5K, 3.3K, 4.4K and 100K reads. In direct contrast, this value for the high cov- erage bulkRNAseq rat body map across 11 organs at a median sample depth of 9.7M reads, is .33. Large number of features, extreme diversity, and time-dependent dynamic fluctuations in microbial abundances can result in such high sparsity levels in metage- nomic datasets. When working within the fundamental assumption that most features do not change across conditions, such extraordinary sparsity levels can then be attributed, by and large, to competition among features for being sequenced. As we illustrate in Fig. 4.4, zero observations in a sample are correlated with compositional changes, and truncated 1the idea being that in the limit Λg → ∞, feature-wise relative frequency ratios that reflect Λ−1g , → 0. Ref table 3.1 for discussions. 51 analyses that ignore them (as is done with TMM / DESeq / metagenomic CSS normal- ization techniques) effectively leads to loss of information and results that are opposite to what is expected. True Control ~50X absolute growth (upward change) in the first feature Proportions results in this set of True Case proportions ..1 ..0910 .1 .01 .1 .01 .1 .01 Count > 0 Observed .1 .01 Observed Sequencing Counts Sequencing Counts Count = 0 from Control .1 .01 from Case Samples Samples .1 .01 .1 .01 .1 .01 .1 .01 .1 .01 1. Choose reference set of proportions as that from controls. Λ-1 2. “Positive-only” ratio of proportions (used by TMM/DESeq/CSS) in Case group, case (roughly, ratio of proportions in first feature, as that is the only one that estimates is mostly expressed) = .9/.1 = 9 3. Zero strategy, naive averaging of positive ratios over zeros as well: ( (.9/.1) + 0)/10=.9 Then: X TMM/DESeq/CSS prediction = 1/9 =.11X (downward) change in absolute abundance X Scran fails to reconstruct for case group samples owing to heavy occurrence of zeros. Zero strategy’s prediction = 1/.09 = 1.11X (upward) change in absolute abundance Figure 4.4: Ignoring zeroes can introduce bias in normalization, when zeroes predomi- nantly arise from under-sampling. An artificial example with 10 features and two groups ("controls" and "cases"), when one of the features undergoes a roughly 50X expansion (a log2 fold change of 5.64) in cases compared to controls. This drives the relative frequen- cies of the rest of the 9 features relatively low in the case group. As a result features that are largely present in the controls are not observed in the case group at moderate sequencing depths. Scaling normalization strategies that derive scales based only on the positive count values, can underestimate compositional changes as shown. We now give a brief overview of the technique (Wrench) proposed in this work. 52 More details are presented in the Methods section at the end of this chapter. With average proportions across a dataset as our reference, we model our feature-wise proportion ratios as a hurdle log-normal model2, with feature-specific zero-generation probabilities, means and variances. For the purpose of metagenomic applications, and analytic convenience, we slighty relax the standard assumption that most features do not change across condi- tions by assuming that the feature-wise log-fold changes arise from a zero mean Gaussian distribution, a common assumption in differential abundance analysis [5, 143, 144]. The analytical tractability of the model allows us to standardize the feature-wise values within and across samples, and derive the compositional scale estimates by basing heavy weights on less variable features that are more likely to occur across samples in a dataset. In ad- dition, to make the computed factors robust to low sequencing depths and low abundant features, we employ an empirical Bayes strategy that smooths the feature-wise estimates across samples before deriving the sample-wise factors. Such situations are rather com- mon in metagenomics, and some robustness to overcome heavy sampling variations is desirable. Table. 4.1 succinctly illustrates where current state of the art fails, while more com- prehensive simulations illustrating the effectiveness of the proposed approach presented in Fig. 4.5. To generate table 4.1, roughly, we simulated two experimental groups, with 54K features whose proportions were chosen from the lung microbiome data, and let 35% of features change across conditions (see Methods for details on simulations). The net true compositional change resulting from each simulation, and their corresponding 2the random variable assumes a value of zero with probability π and a positive value based on its specific log-normal distribution with probability (1−π) 53 reconstructions by the various techniques when the count data are generated at different sequencing depths are shown. The following observations form the theme of these, and the more elaborate simulations summarized in Fig. 4.5: 1) TMM/CSS, because they fo- cus on positive-valued observations only, are restricted in the range of scales they can reconstruct. 2) Scran can yield accurate estimators at very large sequencing depths when high feature-wise coverages are achieved. Unfortunately, this behavior is highly depen- dent on the underlying feature proportions and their diversity. 3) Wrench estimators offer better alternatives for under-sampled data, and as we shall observe below in their em- pirical performances, they can still offer robust protection against compositional bias at higher coverages. Similar results were obtained when Wrench was compared to pseu- docounted CLR. In addition, Figs. 4.6, and 4.7 explore simulation performance as a function of group-wise sample size in balanced and unbalanced designs, where we find the performance to stabilize between roughly 10−20 samples, depending on the fraction of features that change across conditions. We briefly note a key ingredient about our simulation procedure. Simulating se- quencing count data as independent Poissons / Negative Binomials – as is commonly done in benchmarking pipelines – does not inject compositional bias into simulated data. From the perspective of performance comparisons for compositional correction, doing so is therefore inappropriate. A renormalization procedure after assigning feature-wise fold- changes is necessary. Alternatively, if absolute concentrations are generated, subsampling to a desired sample depth needs to be performed. 54 Net Compositional Average Λ CLR TMM CSS Scran W W W WChange ( g) Sample Depth 0 1 2 3 36.86X 1M 1.36 1.45 5.41 22.57 19.32 31.44 30.65 32.01 12.08 7.75X 10K .95 3.05 1.47 5.30 6.32 6.31 6.70 (14/40 samples failed) Table 4.1: Example simulations illustrate the limitations of current techniques. Shown are the group-wise true and reconstructed compositional scales from the methods com- pared on two simulated examples, each at different sequencing depths and at different to- tal true concentration changes for a roughly 54K features with control group proportions derived from the Lung microbiome. Low-coverage and/or high compositional changes are problematic for current techniques due to the sparsity they cause in the count data. W1, . . .W3 are Wrench estimators proposed in the Methods section that adjust the base estimator W0 for feature-wise zero-generation properties. All are presented here for com- parison purposes. Our default estimator is W2. 4.2.1 Wrench has better normalization accuracy in experimental data Below, we show five different results illustrating the improvements Wrench offers over existing techniques in experimental data. The first two show that Wrench leads to reduced false positive calls in differential abundance inference, while the other three demonstrate the improved quality of positive associations. Reduction of false positives We used two approaches to compare the performance of Wrench in reducing false positive calls in differential abundance inference. Each of these analyses was performed across all biological groups with atleast 15 samples in the mouse (2 diet types), Diarrheal (2 groups), Tara (5 oceans), HMP (JCVI, 16 body sites), and HMP (BCM, 16 body sites) and averaged the results across these 41 experimental groups. We ignored the lung microbiome for these analyses as Scran had particular difficulty making direct comparisons hard. Owing to the heavy sparsity in these datasets, Scran failed to provide scales for 53 out of 72 samples of the lung microbiome, 10 out of 132 55 A B 0 0 0.0 Sensitivity Specificity FDR −1 −0.5−1 −1.0 1.0 1.0 1.0 −2 −1.5 f =.1 −2 −2.0 0.8 0.8 0.8 −3 −3 −2.5 0.6 0.6 0.6 −4 −3.0 −4 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0.0 0.0 0.0 −1 −1 −1 1.0 1.0 1.0 −2 f=.25 −2 −2−3 0.8 0.8 0.8 −3 −4 −3 0.6 0.6 0.6 −4 −5 −4 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0.0 0.0 0.0 −1 −1 −1 1.0 1.0 1.0 f=.35 −2 −2 −2 0.8 0.8 0.8 −3 −3 −3 0.6 0.6 0.6 −4 −4 −4 0.4 0.4 0.4 −5 0.2 0.2 0.2 0.0 0.0 0.0 Average Total Reads 4K Average Total Reads 4K 10K 100K Sensitivity Specificity FDR Sensitivity Specificity FDR 1.0 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 f =.1 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 f=.25 0.6 0.6 0.60.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 0.8 0.8 0.8 0.8 0.8 0.8 f=.35 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 Average Total Reads 10K Average Total Reads 100K Figure 4.5: Wrench scales outperform competing approaches in reconstructing compositional changes and in differential abundance testing. Multiple iterations of two group simulations are simulated with various fractions of features perturbed across conditions (rows, f in figures), total number of reads. Their average accuracy metrics in reconstruction and differential abundance testing are plotted. The control proportions were set to those obtained from the mouse microbiome dataset. Continued on next page. observations of the mouse microbiome, 6 out of 992 samples of the diarrheal dataset. Notice that Wrench not only recovers compositional scales for these samples, but also at 56 Log2( Reconstructed Abundance Change/True Abundance Change ) Tmm Css Scran Tmm W0 Css W1 Scran W2 W0 W3 W1 W2 W3 Tmm Tmm Css Css Scran Scran W0 W0 W1 W1 W2 W2 W3 W3 Figure 4.5: Continued from previous page. (A) Average log ratios of reconstructed to true concentration changes. Each row corresponds to a particular setting of f , and each column a particular setting of average sequencing depth. Scran also suffered from being unable to provide scales for samples in each simulation set (sometimes as high as 60% of the samples at 4K and 10K average reads). (B) Average sensitivity, specificity and false discoveries at FDR .1 of detecting true differential concentration abundances. W0 is the regularized Wrench estimator without sparsity adjustments and W1, ..W3 are various adjusted estimators compared here. For details on this and simulations, see Methods. Behavior was similar for other parameteric variations (variances of global and sample- wise fold change distributions, number of samples) of simulations. f=.1 f=.25 f=.35                                               5 15 25 35 5 15 25 35 5 15 25 35 Sample Size Sample Size Sample Size                                                 5 15 25 35 5 15 25 35 5 15 25 35 Sample Size Sample Size Sample Size                                               5 15 25 35 5 15 25 35 5 15 25 35 Sample Size Sample Size Sample Size Figure 4.6: Simulation performance in a balanced design. We plot the performance metrics as a function of sample size and fraction of features f that are perturbed in cases. The sample depth was fixed to 10K reads on average per sample. TMM is provided for reference. Legend: Red, Wrench; Black: TMM. 57 FDR Specificity Sensitivity 0.2 0.6 0.80 0.90 1.00 0.4 0.6 0.8 FDR Specificity Sensitivity 0.2 0.4 0.6 0.8 0.80 0.90 1.00 0.4 0.6 0.8 FDR Specificity Sensitivity 0.2 0.4 0.6 0.8 0.75 0.85 0.95 0.3 0.5 0.7 f=.1 f=.25 f=.35                                          20 40 60 80 20 40 60 80 20 40 60 80 Controls Sample Size Controls Sample Size Controls Sample Size                                           20 40 60 80 20 40 60 80 20 40 60 80 Controls Sample Size Controls Sample Size Controls Sample Size                                         20 40 60 80 20 40 60 80 20 40 60 80 Controls Sample Size Controls Sample Size Controls Sample Size Figure 4.7: Simulation performance in an unbalanced design. We plot the performance metrics as a function of sample size and fraction of features f that are perturbed in cases. The total number of case samples were fixed to 20, and the number of control samples were varied to simulate unbalanced designs. So in the plot, a sample size of 20 corresponds to a sample size of 20 for the case sample, and therefore reflects a balanced design. The rest represent unbalanced designs. The sample depth was fixed to 10K reads on average per sample. TMM is provided for reference. Legend: Red, Wrench; Black: TMM. 58 FDR Specificity Sensitivity 0.2 0.4 0.6 0.8 0.86 0.92 0.98 0.5 0.7 0.9 FDR Specificity Sensitivity 0.2 0.4 0.6 0.8 0.85 0.90 0.95 1.00 0.4 0.6 0.8 FDR Specificity Sensitivity 0.2 0.4 0.6 0.8 0.85 0.90 0.95 0.4 0.6 0.8 magnitudes that were coherent with other samples from similar experimental groups (see next subsection) indicating some validity for the computed normalization factors. First, a standard resampling analysis was performed. For every given experimental group, two artificial groups are repeatedly constructed via resampling (without replace- ment), and the total number of significant calls made during differential abundance anal- ysis is recorded in each repetition. For each iterate, we compute the log2(FOther/FWrench) ratio, where FOther is the total number of significant calls made by a competing method (Total Sum / TMM / Scran / CSS ) and FWrench is the total number of significant calls made by Wrench. If Wrench is superior these logged ratios should be > 0. The average of these ratios across all the experimental groups mentioned above is plotted in Fig. 4.8A, and we find Wrench meeting the goal. Although total sum does not show a significant difference in this analysis, as illustrated next, it is insufficient in capturing the null variation in the data. We next exploited the offset-covariate approach introduced in [6]. For every fea- ture/OTU within a homogenous experimental group, two generalized linear models are fitted: in model (a) Wrench normalization factors as offset, and those of a competing method as covariate. In model (b), normalization factors from a competing method as offset, and those of Wrench as covariate. The number of features for which the covariate term was called significant is recorded in both (a) and (b). We will denote them respec- tively as CWrench and COther. If Wrench sufficiently captures the variation in data, the number of times the covariate term from a competing method is called significant will be low. That is: the logged ratio log2(COther/CWrench) must be > 0. The average of these values across all the experimental groups mentioned above is plotted in Fig. 4.8B, and we 59 A B Resampling Offset−Covariate Total Sum Tmm Css Scran Total Sum Tmm Css Scran Figure 4.8: Wrench scales lead to reduced false positive calls. (A) The average of log2(FOther/FWrench) values obtained over artificial two group splits of homogeneous experimental group data is shown and (B) the average of log2(COther/CWrench) values across 41 metagenomic experimental groups are shown. Standard error bars are shown. In both plots, positive values for a method imply reduced accuracy relative to Wrench. FOther: total number of diffferentially abundant features found by a competing method (total sum, TMM, CSS or Scran). FWrench: to- tal number of differentially abundant features found by Wrench. COther: total number of features where the covariate term for Wrench normalization factors were found to be significant when com- peting method is used as offset. CWrench: total number of features where the covariate term for a competing method’s normalization factors were found to be significant, when Wrench is used as covariate. find Wrench to improve upon other techniques. Improved association discoveries To compare the quality of associations achieved with the various normalization methods, we re-analyzed the Tara Oceans 16S microbiome dataset. Even though the contribution of true compositional changes and other technical biases are not identifiable from the compositional scales without extra information, we asked if the reconstructed scales correlate with orthogonal information on absolute abun- dances, and other measures of technical biases. The results are summarized in Table 4.2. 60 Average Relative Normalization Bias 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Average Relative Normalization Bias 0.0 0.5 1.0 1.5 2.0 2.5 Dataset Type CLR TMM CSS Scran W0 W1 W2 W3 Tara Oceans [8] 16s (from Whole Metagenome) 0 (−2.65×10−6) 0.26 0.15 0.52 .58 .54 .53 .53 Rat BodyMap [7] Bulk RNAseq -0.36 0.22 0.16 0.18 .20 .19 .20 .26 Embryonic Stem Cells [62] UMI/scRNAseq -0.70 .70 .67 .67 .71 .70 .70 .68 Table 4.2: Correlations of compositional scales with orthogonal measurements on con- centrations/technical biases. Correlations of logged reconstructed abundance factors (1/compositional correction factor) with logged total flow cytometry cell counts is shown for the Tara project. Correlations of logged normalization factors with logged total ERCC counts are shown in the case of the rat body map and embryonic stem cells datasets. Given the high sparsity in these datsets, CLR factors computed by adding pseudocounts, essen- tially had no information on technical biases. W1, . . .W3 are estimators proposed in the Methods section that adjust the base estimator W0 for feature-wise zero-generation prop- erties. All are presented here for comparison purposes. The default Wrench estimator (W2) compares well at low and high coverage settings. For more details on these and the distinction in terminology between compositional correction factors and normalization factors, refer Materials and Methods. Interestingly, in the very high coverage Tara Oceans metagenomics project, Wrench and Scran estimators achieve comparable correlations (>50%) with absolute flow cytometry measurements of microbial counts from the Tara Oceans project. Scran failed to recon- struct the scales for 3 samples. TMM and CSS had substantially poor correlations. Sim- ilarly, Wrench normalization factors had comparable/slightly better correlations to the total ERCC spike-in counts in bulk and single cell RNAseq datasets. In direct contrast, CLR scale factors (the geometric means of proportions) computed with pseudocounts were either uncorrelated or highly anti-correlated with the aforementioned measurements reflecting technical biases. These results reaffirm that there are advantages to exploiting specialized compositional correction tools even with microbiome datasets teeming with microbes of extraordinary diversity. We next analyzed the quality of differential abundance inference arising from com- peting normalization techniques, by performing two sets of enrichment analyses. 61 Detailed tables presenting the following results are provided as a supplement addi- tional file 2 in our related publication [145]. In the first procedure, we extracted broad genus-level functional annotations from the Faprotax database [146], and tested for their enrichment in positively associated genera in the deep chlorophyll (DCM) and the mesopelagic layer (MES) samples of the oceans relative to the surface layer. The total number of significantly differentially abundant OTU calls were widely different across techniques: Wrench and Scran made roughly 30% fewer calls compared to total sum, TMM, and CSS. Given the relatively general nature of the annotations, all methods yielded ex- pected annotations in the DCM and MES layers based on previous studies, although there were a few differences (additional file 2). Nitrite respiration/reduction/anoxygenic pho- totropy, oil bioremediation were found enriched in mesopelagic layer by all methods, while methanogenesis, a function that is usually associated with mesopelagic and deep sea microbes [8, 146–149] was not found enriched in MES by total sum. Both Wrench and Scran did not find xylanolysis to be enriched in the mesopelagic layer, while other methods did. We were unable to find literature evidence supporting this call, and the re- sult could potentially be due to the higher number of OTUs called differentially abundant by the other methods. Aerobic ammonia/nitrite oxidation and fixation were found to be enriched in DCM by all methods. Total sum and TMM found a methanogenesis related module enriched in DCM, while other methods did not. To evaluate the methods in a more fine-grained setting, we devised the following validation approach. The design of the Tara oceans experiments - where 16S reconstruc- tions are obtained from whole metagenome shotgun sequencing data - makes the follow- ing analysis feasible. Because the Tara project’s functional (gene content summarized as 62 Kegg Modules, KMs) and 16S data arise from the same input DNA samples, the same compositional factors should apply for both datatypes. We therefore estimated compo- sitional factors from 16S data using the different normalization methods and applied the resulting estimates to the KM abundance data from the corresponding matched samples. Next, we computed Spearman rank correlation between OTU and KM normalized abun- dances and annotated OTUs with those KMs which showed correlation of at least 0.75. Finally, we identified OTUs that were positively associated with each layer using dif- ferential abundance analysis. With the KM annotations in place, we performed Fisher exact tests to compute the enrichment scores in the identified OTUs. In mesopelagic sam- ples, Scran finds enrichment in only 30 KMs, while other methods recovered at least 100 KMs. Specifically, ureolysis, motility, several denitrification/methanogenesis processes and aminoacid biosynthetic/transport mechanisms (functions that have been attributed to microbes in the mesopelagic layer and deep sea) [8, 146, 150, 151], were missed by Scran, while Wrench finds them. On the other hand, Total sum, TMM and CSS found more varied and general processes including various ribosomal, transcription/translation components to be enriched in both MES and DCM layers. Notice that the first analysis gives a broad sense of the genera identified by the competing methods in light of existing annotations, while the second gives a sense of the quality of annotations one might confer on the OTUs based on the normalized expression levels of OTUs and the measured functional content themselves. In both cases, Wrench is shown to retain relevant information, and the relatively more specific nature of the latter analysis reveals that Wrench demonstrably improves upon other methods. 63 4.3 Inferences following compositional correction show improved co- herence with experimental data We further demonstrate the impact of compositional bias in downstream inference below. The experimental cell density measurements in the Tara Oceans project show a highly significant overall reduction in the mesopelagic samples when compared the surface layer (see Fig. 3 in ref [8]). Thus, we expect an overall negative change in the reconstructed fold changes, when performing a differential abundance analysis of the OTUs across these two ocean layers. Summing the log-fold changes of significantly associated OTUs (both positive and negative) serves as a measure of a net change experienced by a community. If a given method produces fold change inferences that track the above mentioned empirical cell density measurements, we expect it to yield an overall negative net change value for the significantly differentially abundant OTUs in the mesopelagic community. As illus- trated in Fig. 4.9A, this value for total sum normalized data is +10577.99, while that for Wrench is −8919.65, showing that differential abundances arising from Wrench agrees more appropriately with the underlying community change. Fig. 4.9B and C, show how these values distribute across the major phyla focussed in the Tara oceans article. These plots demonstrate that the two approaches lead to markedly different conclusions on the net change experienced by a phylum. In particular, Proteobacteria, Actinobacteria, Eur- yarchaeota were predicted to have drastically high positive changes by total sum (while Wrench predicts a marked decrease in the negative direction), and sizable differences were apparent in the values obtained with the rest of the phyla. 64 A B 4000 Total Sum 0 3000 −1000 −2000 2000 −3000 1000 −4000 0 −5000 Wrench −1000 −6000 Total Wrench Sum Figure 4.9: Wrench normalized data lead to better downstream inferences. (A) The sum of log-fold changes of differentially abundant OTUs is used as a measure of net change experienced by a community. This value is plotted for the differentially abundant OTUs in the mesopelagic ocean layer relative to the surface layer in the Tara oceans 16S data, for Total Sum and Wrench normalization. (B) The same metric plotted for various major phyla of interest in the Tara oceans project. 4.4 Compositional scale factor estimates imply substantial technical biases, indicating importance of further experimental studies We next analyzed the phenotypic integrity of the compositional scales reconstructed by the various methods. In the absence of technical biases, following our discussion in the previous subsection, compositional factors should hover around 1 (upto some arbitrary scaling). This is not what we observe in samples from metagenomic datasets. All scale normalization techniques resulted in group-wise integrity in the scales they reconstructed within and across related phenotypic categories, potentially indicating the general impor- tance of correcting for confounding induced by compositional bias in general practice. Total sum normalization is oblivious to these biases, making further experimental stud- 65 Net Change Community Wide −5000 0 5000 10000 Net Change undef Proteobacteria Bacteroidetes Deferribacteres Planctomycetes Verrucomicrobia Cyanobacteria Chloroflexi Actinobacteria Thaumarchaeota Euryarchaeota undef Proteobacteria Bacteroidetes Deferribacteres Planctomycetes Verrucomicrobia Cyanobacteria Chloroflexi Actinobacteria Thaumarchaeota Euryarchaeota ies on compositional bias important. For instance, in the microbiome samples arising from the Human Microbiome Project [142], as shown in Fig. 4.11A, we noted system- atic body site-specific global deviations in the fold change distributions. This is similar to what was illustrated with the Tara project in Fig. 4.1. We found the reconstructed compositional scales to largely organize by body sites, across normalization techniques (Fig. 4.11B), behind-ear and stool samples were distinctly located in terms of their com- positional scales from the oral and vaginal microbiomes (notice the log scale in these plots). This behavior was also recapitulated in scales reconstructed from other centers. Similar results were obtained for samples arising from the J. Craig Venter Institute. In the case of the mouse microbiome samples, most normalization techniques predicted a mild change in differential feature content across the two diet groups (Fig. 4.11C, and ). In the lung microbiome, the lung and oral cavities had roughly similar scales across smokers and non-smokers , while scales from the probing instruments had relatively higher variability, which we found to directly correlate with the high variability of feature presence in the count data arising from these samples. In the diarrheal datasets of children, however, no significant compositional differences were found across the various country/health-status populations (Fig. 4.11D). For completeness, we also attach similar results from all the 11 organs of the rat body bulk RNASeq dataset in Fig. 4.10. We noted that the rat body map samples also showed systematic tissue-specific global deviations in the expressed features’ fold change distribution. Fig. 4.10 shows this result and the general behavior of compositional scales across various methods compared and a few related statistics of the dataset. Given that these samples arise from a well designed series of experiments, the similarity in the scales 66 within and across related tissues, and across normalization methods, is striking; the ob- served trend in the reconstructed scales could indeed reflect underlying true compositional differences for the most part. TMM and CSS ascribe substantially deviated scales to mus- cle, heart and liver tissues, when compared to Scran and Wrench estimators. This effect may be due to the truncated estimation strategy which biases the scales for a relatively fewer but highly expressed genes in these tissues. Nevertheless, these results indicate po- tentially heavy compositional bias injected into downstream differential abundance anal- ysis that compare tissues of different types. Compositional bias can be costly not only in metagenomics, but even in common bulk-RNAseq studies. 4.5 Methods 4.5.1 An approach (Wrench) for compositional correction of sparse, ge- nomic count data Briefly, our normalization strategy can be described as follows. Based on eqn. 2.2, for a chosen reference vector q0·, accounting for sample depth τg j, the mean model for the [observe]d positi[ve count o]f the i th ( feature c)an be written as: logE[Yg ji|Yg ji > 0] = log qg jiτg j = log qg ji q τ ≡ log θ q τ , where θ = Λ−1q 0i g j g ji 0i g j g ji g j νg ji. Thus the true0i ratio of proportions θg ji encapsulate both the constant Λ−1g j and the concentration fold changes νg ji, and can be viewed as the net fold change experienced by feature i in sample j from group g. For the purpose of metagenomic applications, and analytic convenience, we slighty relax the standard assumption that most features do not change across condi- tions by assuming that the feature-wise log-fold changes logνg ji arise independently from 67 A 0.45 1.4 24.5 0.07 1.2 0.40 24.0 0.06 1.0 23.5 0.05 0.8 0.35 23.0 0.04 0.6 0.30 22.5 0.03 0.4 0.25 22.0 0.020.2 21.5 0.0 0.01 B 1.0 2.0 0.5 1 1.50.5 0 1.00.0 0.0 0.5 −0.5 −1 0.0−0.5 −2 −0.5 −1.0 −1.0 −1.0 Figure 4.10: Importance of compositional correction in common bulk RNAseq studies. (A) Application of scaling techniques to the rat body map data across tissues. Median positive ratio: median of the positive ratios of group-averaged proportions to that of Adrenal chosen as the reference. Subsequent figures in the top row indicate higher sparsity levels in the heart, muscle and liver samples, although at sequencing depths that are comparable/slightly higher to those from other tissue groups. (B) Reconstructed scales from several normalization techniques. If one were to perform a differential expression analysis between Testes and Heart, the fold changes are roughly 4X (ratio of medians) inflated as predicted by Scran/Wrench, which can lead to high false positive rates especially if most features are not changed across the two tissues. Notice the similarity in scales for closely related tissues, across techniques; for these tissues, the influence of compositional bias in the related differential abundance tests will be low. a zero mean Gaussian distribution, a common assumption in differential abundance anal- ysis [5, 143, 144]. Assuming independence across features i, it then follows that logθg ji follows a Gaussian distribution with a mean parameter logΛ−1g j . Thus, a robust location estimate of θg ji for every sample leads us to the desired compositional scale estimate Λ̂g ji. Below, we first illustrate how the θg ji are estimated, and subsequently discuss the robust averaging procedure. 68 log2(Tmm) Median Positive Ratio Muscle Heart Heart Muscle Liver Liver Kidney Kidney Adrenal Adrenal Brain Brain Testes Spleen Spleen Thymus Thymus Testes Lung Lung Uterus Uterus log2(Css) Fraction Ftrs. Undetected/Absent Muscle Testes Heart Lung Liver Brain Testes Uterus Adrenal Thymus Kidney Spleen Thymus Kidney Brain Adrenal Spleen Heart Lung Muscle Uterus Liver log2(Scran) Log2(Sample Depth) Heart Thymus Muscle Uterus Liver Liver Kidney Spleen Adrenal Brain Spleen Lung Thymus Muscle Lung Adrenal Uterus Testes Brain Kidney Testes Heart log2 W2 Total Spike−ins Proportion Heart Heart Muscle Testes Liver Adrenal Adrenal Kidney Kidney Lung Spleen Brain Thymus Muscle Lung Uterus Uterus Thymus Brain Spleen Testes Liver Model We assume the following hurdle log-normal model for the counts Yg ji: 0 with probability πg ji Yg ji ∼ ,eZg ji with probability (1−πg ji) Zg ji = l︸og︷︷q0︸i + ︸log︷︷τg︸j + ︸logζ0g +︷µ︷g j +ag j︸i +εg ji, log-reference log-sample depth =logθg ji, log net fold change relative to reference ag ji ∼ N(0,η20g), g = 1 . . .G, ( εg)ji ∼ N(0,σ20i), i = 1 . . . p, π log g ji = βi1 +βi2 logτg j +possibly other covariates1−πg ji (4.1) The model assumes the following. For each sample j from group g, the ith feature’s count value is sampled from a hurdle log-normal distribution, in which with probability πg ji, a value of 0 is realized; and with probability 1−πg ji a positive count is observed. The probabilities πg ji are determined by sample covariates, including the total sequencing depth. The positive count value is realized as an exponential of a Gaussian random vari- able Zg ji the mean of which is determined (in accordance with the eqn. 2.2) by the chosen reference value q0i, sample-depth τg j, and the net fold change θg ji = ν −1g ji ∗Λg j , the log of which has been modeled in the above equation as a sum of group-wise effect (logζ0g), two-way group-sample interaction (µg j), a three-way group-sample-feature interaction random effect ag ji and a noise term. 69 Estimation of regularized ratios θ̂g ji: In the model, the 0 subscripted parameters are considered known, and are determined the following way. τg j = Yg j+ is the to- tal count of sample g j. The reference value for each feature i, q0i, is set to the aver- age proportion value q̂++i, where q̂g ji is the observed proportion of feature i in sample g j, i.e., q̂g ji = Yg ji/Yg j+ = Yg ji/τg j . The raw ratio of proportions are then given as: r qg jig ji = q . The mean and variance parameters logζ0g and η 2 0g of the Gaussian random ef-0i fects distribution on the logθg ji are determined based on the moments of the correspond- ing empirical distribution of the group-wise pooled raw ratios of proportions. Specif- ically, we fix the group-wise compositional scale ζ0g = rg+i i.e., as the average of the raw ratios including the zero values (following discussions in Fig. 4.4). We set the variance parameter η2 10g = ∑ I ∑i:Yg ji>0(logrg ji− logrg+i) i.e., as the empirical vari-i [Yg ji>0] ance of the logged-ratios. Finally, the feature-specific expression variances σ20i are fixed with values obtained from Limma/Voom. With the above fixed, the unknown parame- ters µg j(and ag ji are estim)ated/predicted using standard random(effects estimators: µ̂g j =2 ) ∑i wg ji logr 1 σ0i g ji− logζ0g with wg ji ∝ σ2+η2 , and âg ji = σ2+η2 logrg ji− logζ0g− µ̂g j .0i 0g 0i 0g The identifiability of these terms is ensured as the other variance components are fixed. The π̂g ji are estimated with logistic regression. The regularized ratios are then calculated as: θ̂g ji = exp(logζ0g + µ̂g j + âg ji). Robust averaging of the θ̂g ji: While averaging over the regularized ratios W0 =: 1 p ∑i θ̂g ji would be one estimation route to Λ −1 g j , better control can be achieved by tak- ing the variation in the feature-wise zero generation into account. We shall notice that E[rg ji|rg ji > 0] = θg ji · σ 2 e 0i/2 2 , and so a robust averaging over θ̂ σ /2g ji/e 0i , can serve as an 70 estimator of Λ−1g j . One might choose the weights for averaging to be proportional to that of the inverse hurdle/inclusion probabilities (as is done in survey analysis) ∝ 1/(1− π̂g ji) or on the inverse marginal variances ascribed by our model above ∝ 1 2 2 . −π̂ σ +η(1 g ji)(π̂g ji+e 0i 0g−1) σ20i/2 An estimator that we also found to work well empirically is a weighted average of θ̂g ji/e1−π̂ withg ji weights proportional to 1σ2 . The next subsection sketches the derivations for the weights.0i An advantage of these weights (and hence the model) is that the weighting strategies proceed smoothly for features with zero expression values as well, unlike the binomial weights employed in the TMM procedure. Furthermore, when constructing averages, the weights have a favorable property of downweighting zeroes at higher sample depths relative to those in samples at lower sample depths. In summary, we explored the performance of the following estimators for sample- wise compositional factors: 1 W0g j =: ∑ θ̂g ji = θ̂g+ j,p i 1 W1g j =: ∑wg jiθ̂g ji, with wp g ji ∝ 1/(1− π̂g ji)i 1 ∑ 1W2g j =: wg jiθ̂g ji, with wp g ji ∝ ( )( )2 2 (4.2)i 1− π̂g ji π̂g ji + eσ0i+η0g−1 1 θ̂g ji 1W3g j =: ∑wg ji , with w ∝p i 1− π̂ g jig ji σ20i The compositional bias corrected data is then obtained by dividing each sample’s count data with its corresponding estimated compositional correction factor. For instance, if W2 is the choice of estimator, the bias corrected data for sample g j is Yg j/W2g j. 71 We have found W1,W2 and W3 to work comparably well in simulations and empiri- cal comparisons, and W0 slightly less so at high sparsity levels at low sample depths. We prefer W2 as it systematically integrates both the hurdle and positive component varia- tions. In our software implementation, users have the option for other weighted variants, and whether weighted averaging over zeroes is necessary as they see fit. Software docu- mentation for Wrench embarks on further discussions on these ideas. 2 Derivation of marginal variance weights 2Setting φ σ0i = e 0i/2, and γ0g = eη0g/2, we have: Varθ (E(Yg ji|θg ji)) =V(arθ ((1−πg ji)θg jiτ)g jq0iφ0i)2 = (1−π 2 2 2g ji)τg jq0iφ0i ︸(γ0g−︷1︷)γ0gζ0︸g (4.3) group specific contribution Now, if we let Z to be an indicator random variable denoting whether a feature was zero or positive: Var(Yg ji|θg ji) = EZ(Var(Yg(ji|θg ji,Z))+V)a[rZ(E(Yg ji|θg ji,Z])) (4.4) = (1−πg ji) θg jiτg jφ0iq 2 2 0i πg ji +(φ0i−1) Similarly, E(θ 2g ji) =V(ar(θg ji))+E(θ 2 g ji) = γ2 −1 γ2 ζ 2 ( ) ( 0g ) 0g 0g + ζ 2 0gγ0g (4.5) = ζ γ 2 γ20g 0g 0g 72 Together, eqns. 4.3 and 4.4 lead to: [ ] ( )2 E(Var(Y 2 2 2g ji|θg ji)) = (1−πg ji) πg ji +(φ0i−1) (q0iτg jφ0i) γ ζ0g (4.6)0g Eqns. 4.3 and 4.6 then imply: [ ] Var(Yg ji) = (1−π 2 2 2 2 2g ji)(q0iτg jφ0i) [πg ji +φ0iγ0g−1] γ0 ζg 0g (4.7) ∝ (1−π 2g ji)(q0iτg jφ0i) π 2 2g ji +φ0iγ0g−1 The variances for the adjusted ratios then follows from straightforward calculations, the inverse of which take the weight forms shown in in the previous subsection. Data We principally demonstrate our results with five datasets from metagenomic sur- veys. A smoking study (n = 72) where the lung microbiome of smokers and non-smokers were surveyed (along with the instruments that were used to sample the individual). A diet study in which the gut microbiomes (n = 139) of carefully controlled laboratory mice fed plant-based or western diets were sequenced [152]. A large scale study of human gut microbiomes (n = 992) from diarrhea-afflicted and healthy children from various developing countries [139]. 16S metagenomic count data corresponding to all these studies were obtained from the R/Bioconductor package metagenomeSeq [5]. The Tara Oceans project’s 16S reconstructions from whole metagenome shotgun sequenc- ing (n = 139) deposited in http://ocean-microbiome.embl.de/data/ was obtained from file miTAG.taxonomic.profiles.release.tsv.gz. The flow cytometry counts for autotrophs, bac- teria, heterotrophs, picoeukaryotes were obtained from TaraSampleInfo_OM.CompanionTables.txt 73 from the same website and summed to serve as a rough measure of total cell count that cor- relates with sequence-able DNA material. The Human Microbiome Project count data in file HMQCP/otu_table_psn_v35.txt.gz was downloaded from http://downloads.hmpdacc.org/data/, and the associated metadata are from v35_map_uniquebyPSN.txt.bz2 under the same website. The processed bulk-RNAseq data corresponding to the rat body map from [7] was obtained from [153]. The Unique Molecular Identifier (UMI) single cell RNAseq data from Islam et al., [62] was downladed from GEO under accession GSE46980. Implementation of normalization and differential abundance techniques All anal- ysis and computations were implemented with the R 3.3.0 statistical platform. EdgeR’s compNormFactors for TMM, DESeq’s estimateSizeFactors, Scran’s computeSumFactors (with positive=TRUE in sparse datasets) and metagenomeSeq’s calcNormFactors for CSS were used to compute the respective scales. Implementation of CLR factors used a pseudo-count of 1 following [140], and were computed as the denominator of column 3 in table 3.1. Limma’s eBayes in combination with lmFit, edgeR’s estimateDisp, glmFit and glmLRT, DESeq2’s estimateDispersionsGeneEst and nbinomLRT were used to perform differential abundance testing [144]. Welch’s t-test results were obtained with t.test. Simulations Given a set of control proportions q1i for features i = 1p, and the frac- tion of features that are perturbed across the two conditions f , we sample the set of true 74 log fold changes ( logνgi ) from a fold change distribution (fold change distribution) for those randomly chosen features that do change. The fold change distribution is a two- parameter distribution chosen either as a two-parameter Uniform or a Gaussian. Based on the expressions from the first subsection of the results section, the target proportions were then obtained as q ν= giq1igi ∑ ν q . Conditioned on the total number of sequencingk gk 1k reads τ , the sequencing output Ygi· for all i were obtained as a multinomial with propor- tions vector q pg· = [|qgi|]i=1. We set the control proportions from various experimental datasets (specifically, mouse, lung and the diarrheal microbiomes). With this setup, we can vary f , and the two parameters of the fold change distribution, and ask, how various normalization and testing procedures compare in terms of their performance. For bulk RNAseq data, as illustrated in the previous chapter, we simulated 20M reads per sample. For comparison of Wrench scales with other normalization approaches, we altered the above procedure slightly to allow for variations in internal abundances of features in observations arising from a group g. We used νgi ( where the bar indicates this value will now assume the role of an average) generated above as a prior fold change for observation- wise fold change generation. That is, for all samples j ∈ 1 . . .ng for all g, where ng represents the number of samples in group g, for all i (including the truly null features), sample νg ji from LN(logν 2 2gi, σ̃ν ) for a small value of σ̃ν = .01. This induces sample specific variations in the proportions within groups. Notice that this makes the problem harder and more realistic, as feature marginal count distributions now arise from a mixture of distributions. Based on empirically observed MA plots for our metagenomic datasets, we set the mean and standard deviation of prior log-fold change distribution to 0 and 3 respectively. For generating 16S metagenomic-like datasets, logged sample depths were 75 sampled from a log-normal distribution with logged-standard deviation of .25 and logged- means corresponding to log(4K), log(10K) and log(100K) reads. These parameters were chosen based on comparisons with M-A plots, the sparsity levels and total sample depths observed in current experimental datasets. We repeated simulations for 20 iterations. In both versions of simulations, the total induced abundance change relative to that of the control is Λ = νTg j g j·q1·, where νg j· is the vector of fold changes for sample j in group g, and q1· is the average vector of feature-wise control proportions. As it can be seen from the expression for Λg j, notice that perturbing features with very low relative frequencies do not demonstrably induce compositional bias at low sample depth settings (unless perturbed by very high fold changes). So for every simulation iteration, the frac- tion f of features that were perturbed in cases were chosen randomly according to their control proportions. We apply the term compositional correction factor for Λ−1g j and the term normalization factor for a sample as the product of its compositional correction fac- tor with something that is proportional to that of its sample depth. Thus, all technical artifacts like total abundance changes, but sample depth, are incorporated into the defini- tion of compositional factors. Performance comparisons For simulations, we used edgeR as the workhorse fitting toolkit. The compositional scale factors provided by all normalization methods were provided to edgeR as offset factors. We define detectable differential abundance in our simulated count data as follows. For each simulation, as we know the true compositional factors, we input them as normalization factors in edgeR, and the detectable differences in abundances are recorded. All the performance metrics are then defined based on this 76 ground truth. Because we are interested in fold changes and their directions, the perfor- mance metrics we report are redefined as follows: Sensitivity as the ratio of the number of detectable true-positives with true sign over the total number of positives, False discovery as the ratio of the number of detectable true positives with false sign and false positives, over the total number of significant calls made. The offset-covariate analysis followed the procedure in [6]. For resampling anal- ysis, samples from each experimental group (with atleast 15 samples) were split in half randomly to construct two artificial groups. Normalization factors from each method were then used to perform differential abundance analysis, and the total number of differ- entially abundant calls were recorded. The procedure was repeated for ten iterations for each group, and the results were averaged across 41 experimental groups. Those samples for which Scran fails to reconstruct normalization scales were discarded from differen- tial abundance analyses to avoid any power differences while testing. The normalization scales however, were obtained with all data for each method. Fisher exact tests were used to perform functional enrichment analyses for posi- tively associated OTUs. A Genus level functional enrichment analysis was first performed by aggregating annotations from the FAPROTAX1.1 database [146] at the Genus level. A more specific OTU level functional enrichment analysis was devised as follows. Because the Tara Oceans Kegg module (KM) abundance data (downloaded from TARA243.KO- module.profile.release.gz, under http://ocean-microbiome.embl.de/data/) and the 16S re- constructions are obtained from the same input DNA through whole metagenome shot- gun, the same compositional factors apply to both datatypes. Each normalization ap- proach’s compositional factors for 16S data was used to rescale the KM relative frequen- 77 cies. This normalized KM data was used to annotate each OTU by (normalized) KMs that Spearman correlate at a value of atleast .75. 4.6 Discussions For some researchers, statistical inference of differential abundance is a question of differences in relative frequencies; for others, it is a matter of characterizing differ- ences in absolute abundances/concentrations of features expressed in samples across con- ditions [54,154]. In this work, we took the latter view and aimed to characterize the com- positional bias injected by sequencing technology on downstream statistical inference of concentrations of genomic features. It is clear that the probability of sequencing a particular feature (ex: mRNA from a given gene or 16S RNA of an unknown microbe) in a sample of interest is not just a function of its own fold change relative to another sample, but inextricably linked to the fold changes of the other features present in the sample in a systematic, statistically non- identifiable manner. Irrevocably, this translates to severely confounding the fold change estimate and the inference thereof resulting from generalized linear models. Because the onus for correcting for compositional bias is transferred to the normalization and testing procedures, we reviewed existing spike-in protocols from the perspective of composi- tional correction, and analyzed several widely used normalization approaches and differ- ential abundance analysis tools in the context of reasonable simulation settings. In doing so, we also identified problems associated with existing techniques in their applicability to sparse genomic count data like that arising from metagenomics and single cell RNAseq, 78 which lead us to develop a reference based compositional correction tool (Wrench) to achieve the same. Wrench can be broadly viewed as a generalization of TMM [52] for zero-inflated data. We showed that this procedure, by modeling feature-wise zero gen- eration, reduces the estimation bias associated with other normalization procedures like TMM/CSS/DESeq that ignore zeroes while computing normalization scales. In addition, by recovering appropriate normalization scales for samples even where current state of the art techniques fail, the method avoids data wastage and potential loss of power during differential expression and other downstream analyses Some practically relevant notes on the application of proposed method to metage- nomic datasets follow. First, our choice of methodology and simplifying assumptions were principally determined by the scale and sparsity of the 16s metagenomic datasets and estimation robustness. While fully joint parameter inference algorithms will certainly be more accurate, they can be unwieldy and computationally intensive with large scale datasets boasting a large number of features with high sparsity. A case in point is the neat GAMLSS methodology [9], which improved over the proposed pipeline (Wrench nor- malization coupled with edgeR differential abundance analysis) in a small scale equimo- lar miRNA benchmarking dataset, but could not run to completion in the simplest of our metagenomic datasets, the mouse gut microbiome. In Fig. 4.12, we present the same benchmarking analysis as in Fig. 7 of Argyropolous et al., [9] for DeSeq2, GAMLSS, Wrench normalization + EdgeR and Scran normalizaiton + EdgeR pipelines for differen- tial abundance. Second, our simulation results indicate that the performance of Wrench stabilizes by 10−20 samples per group depending on sample depth and the fraction of features that 79 change across conditions. While in our experience, this is very well within the limits of practically realized sample sizes in metagenomic experiments, at very low sample sizes and very low sample depths (less than a few thousand reads per sample), some care might be necessary. For instance, coherence of the reconstructed sample-wise compositional scales within groups relative to the experimental design can be checked and deviations from expectations analyzed/corrected. Third, our current implementation exploits cate- gorical group information/factors alone (e.g., cases and controls), and extension to con- tinuous covariates (e.g., age, time) underlying the sampling design are planned for future work. If a continuos covariate is present, converting it to factors by discretizing its range in to non-overlapping windows is an option that the analyst can entertain. Furthermore, because group information is exploited during normalization, our proposed methodology is not immediately applicable for classification purposes. In such applications, imme- diate extensions of the proposed empirical Bayes formalism by assuming priors on the unknown-sample’s group membership (based vaguely, for example, on clustering dis- tances) can be done, and is planned for future work. A few important insights on compositional bias emerge from our theory, simulation and experimental data analyses. In our simulations, we found reference based normal- ization approaches to be far superior in correcting for sequencing technology-induced compositional bias than library size based approaches. From a more practically relevant perspective, we found that in all the tissues from the rat body map bulk RNAseq dataset, the scale factors can be robustly identified. We expect that in other bulk RNAseq datasets, the assumptions underlying compositional correction techniques to hold well. These re- sults reinforce trust in exploiting such scaling practices for other downstream analyses of 80 sequencing count data apart from differential abundance analysis; for example, in esti- mating pairwise feature correlations. In the regimes where assumptions underlying these techniques are met, an analyst need not be restricted to scientific questions pertaining to relative frequencies alone. The fundamental assumption behind all the aforementioned techniques is that most features do not change across conditions (or the closely related assumption that the log-fold change distribution is centered at 0). As we illustrated, these assumptions appear to hold rather well in bulk RNAseq. Do we expect these to hold in arbitrary microbiome datasets as well? This question is not easy to address without more experiments, but the relatively high correlations obtained with orthogonal measurements of technical biases, the similarity in the compositional scales obtained within samples arising from biological groups, and their sometimes highly significant shifts preserved across normalization techniques and across sequencing centers in large scale studies cer- tainly reinforce the critical importance of characterizing compositional biases, if any, in metagenomic analyses by establishing carefully designed spike-in protocols. In particular, given the inverse dependence of compositional correction factors on the total feature con- tent in the absence of technical biases, the large compositional scale estimates obtained for stool samples (across all normalization techniques) is suspect. Compositional effects can amplify even when a few features experience adverse technical perturbations, and only carefully designed experiments can isolate these effects to inform further normal- ization approaches. Finally, our results also emphasize the tremendous care one needs to exercise before applying the most natural normalizations based on total sequencing depth or by applying pseudocounts when the data is excessively sparse (CLR, RPKM, CPM, rarefication are a few examples). 81 This brings us to the question of how effective spike-in strategies are in enabling us to overcome compositional bias. It is immediately clear that the widely used ERCC recommended spike-in procedure for RNAseq cannot help us in overcoming confounded inference due to compositional bias for the simple reason that it already starts with an extract, a compositional data source. If one is able to add the spike-in quantities at a prior stage during feature extraction, we would have some hope. Lovén et al., [113] demon- strate a procedure for RNAseq that precisely does this, in which the spike-ins are added at the time when the cells are lysed and suspended in solution [114]. One can perhaps extend these solutions to metagenomics, where we may expect confounding due to composition- ality to be heavy by adding barcoded 16S RNAs during feature extraction. We expect similar problems to arise in other genomic and epigenetic measurement techniques that exploit sequencing technology, and the need for the development of appropriate spike-in procedures should be addressed. Finally, it is imperative that we enforce new tools and techniques for normalization and differential abundance analysis of sequencing count data be benchmarked for com- positional bias at least in the simulation pipelines. Data analyses based on large-scale integrations of different data types for predicting clinical phenotypes is increasingly com- mon, and care should be taken to include effective normalization techniques to overcome compositional bias. We hope the results and ideas presented and summarized in this work enables a researcher to do just that. 82 4.7 Conclusions Compositional bias, a linear technical bias, underlying sequencing count data is in- duced by the sequencing machine. It makes the observed counts reflect relative and not absolute abundances. Normalization based on library size/subsampling techniques can- not resolve this or any other practically relevant technical biases that are uncorrelated with total library size. Reference based techniques developed for normalizing genomic count data thus far, can be viewed to overcome such linear technical biases under reasonable assumptions. However, high resolution surveys like 16S metagenomics are largely under- sampled and lead to count data that are filled with zeroes, making existing reference based techniques, with or without pseudocounts, result in biased normalization. This warrants the development of normalization techniques that are robust to heavy sparsity. We have proposed a reference based normalization technique (Wrench) that estimates the overall influence of linear technical biases with significantly improved accuracies by sharing in- formation across samples arising from the same experimental group, and by exploiting statistics based on occurrence and variability of features. Such ideas can also be exploited in projects that integrate data from diverse sources. Results obtained with our and other techniques, suggest that substantial compositional differences can arise in (meta)genomic experiments. Detailed experimental studies that specifically address the influence of com- positional bias and other technical sources of variation in metagenomics are needed, and must be encouraged. 83 A 1.00 16 4 HMP, Baylor 0.99 15 3 0.98 2 140.97 1 0.96 13 0 0.95 12 −1 0.94 11 −2 0.93 10 B 2 4 3 2 0 2 0 1 −2 0 −2 −1 −4 −4 −2 −6 −6 −3 C D Mouse Diarrhea 3 3 4 4 1 1 0 0 −1 −1 −4 −4 Figure 4.11: Wrench retains potential biological information, and indicates importance of compositional correction in general practice. We plot some statistical summaries and the com- positional scale factors reconstructed by a few techniques for various Human Microbiome Project samples, sequenced at the Baylor College of Medicine. (A) On the top-left, we plot the logged median of the positive ratios of group-averaged proportions to that of Throat chosen as the ref- erence group. Stool samples show considerable deviation from the rest of the samples despite having comparable fraction of features detected and sample depths to other body sites. Notice the log scale. (B) The similarity in the reconstructed scales across techniques (second row) for closely related body sites are striking; although minor variations in the relative placements were observed across centers potentially due to technical sources of variation, the overall behavior of highly significant differences in the scales of behind-ear and stool samples were similar across sequencing centers and normalization methods. Continued on next page. 84 Log2(Tmm) Log2(Median Positive Ratio) Log2 (Scran) Log2 (Scran) Western Right Retroauricular crease Right Retroauricular creaseLeft Retroauricular crease Left Retroauricular crease Right Antecubital fossa Anterior nares Anterior nares Vaginal introitus Left Antecubital fossa Right Antecubital fossa Buccal mucosa Posterior fornix Vaginal introitus Buccal mucosa BK Hard palate Hard palate Attached Keratinized gingiva Left Antecubital fossa Palatine Tonsils Palatine Tonsils Mid vagina Throat Posterior fornix Subgingival plaque Saliva Tongue dorsum Log2 (W2) Tongue dorsum Supragingival plaque Throat Attached Keratinized gingiva Supragingival plaque Saliva Subgingival plaque Mid vagina Stool Stool Western Log2(Scran) Fraction Ftrs. Undetected/Absent BK Right Retroauricular crease Saliva Left Retroauricular crease Subgingival plaque Anterior nares Supragingival plaque Posterior fornix Palatine Tonsils Attached Keratinized gingiva Hard palate Vaginal introitus Tongue dorsum Hard palate Throat Log2 (Scran) Buccal mucosa Buccal mucosaRight Antecubital fossa Stool Mid vagina Attached Keratinized gingiva Throat Right Antecubital fossa Left Antecubital fossa Left Antecubital fossa Tongue dorsum Anterior nares Palatine Tonsils Right Retroauricular crease Control Saliva Left Retroauricular crease Supragingival plaque Vaginal introitus Subgingival plaque Mid vagina Stool Posterior fornix Case Log2(W2) Log2(Sample Depth) Log2 (W2) Left Retroauricular crease Left Antecubital fossa Right Retroauricular crease Vaginal introitus Anterior nares Attached Keratinized gingiva Hard palate Throat Buccal mucosa Saliva Control Tongue dorsum Mid vagina Throat Posterior fornix Attached Keratinized gingiva Supragingival plaque Palatine Tonsils Anterior nares Right Antecubital fossa Left Retroauricular crease Left Antecubital fossa Hard palate Case Posterior fornix Buccal mucosaSaliva Stool Mid vagina Tongue dorsum Vaginal introitus Palatine Tonsils Subgingival plaque Subgingival plaque Supragingival plaque Right Antecubital fossa Stool Right Retroauricular crease Figure 4.11: Continued from previous page. These techniques predict a roughly 4X-8X (ratio of medians)inflation in the Log2-fold changes when comparing abundances across these two body sites. (C) Wrench and scran compositional scale factors across the plant- based diet (BK) and Western diet (Western) mice gut microbiome samples. (D) Compo- sitional scale factors for healthy (Control) and diarrhea afflicted (Case) children. Slight differences in the compositional scales are predicted in the diet comparisons with t-test p-values < 1e-3 for all methods except TMM, but not as much in the diarrheal samples. DESeq2 gamlss 1.2 0.9 0.6 0.3 0.0 0.73 0.32 scran wrench 1.2 0.9 0.6 0.3 0.0 1.02 0.46 −3.0 −1.5 0.0 1.5 −3.0 −1.5 0.0 1.5 Fold Change (log10) Figure 4.12: Benchmarking analysis of the small scale, high coverage Argyropolous et al., miRNA dataset for deviatioin from expected fold changes in the clustered symmetric DE without global changes in expression ratiometric A versus B. Same as Fig. 7 in [9]. The shown numbers measure deviation of the reconstructed fold changes from the true ex- pected fold changes by experimental design, for the pipeline. Lower is better. Refer [9], Fig. 7 for details on experimental design. The data was downloaded from authors’ repository: https://bitbucket.org/chrisarg/rnaseqgamlss. 85 Density Part II Adaptive immunity in prokaryotes 86 Chapter 5 The curious case of prokaryotic adaptive immunity. In vertebrates, adaptive immunity against infectious agents is special [155]. It pro- vides hosts with immediate adaptations to counteract infections, over ecological time scales. This is in direct contrast to any innate defense novelties that may arise through natural selection over much longer evolutionary time scales. To fully appreciate the util- ity of adaptive immunity, one only needs to look back to the pathological complications of Measles morbillivirus or Varicella zoster viral infections, which respectively cause measles [156, 157] and chickenpox [158]. These examples are often quoted and perhaps, the most relatable. In general with higher organisms, adaptive immunity operates in several steps. The first phase involves a rapid combinatorial production and selection of specialized cells that synthesize proteins (antibodies) with high specificity to bind an appropriately pro- cessed target infectious agent. The resulting stably bound complexes activate a series of host innate responses, ultimately clearing host infection. Perhaps owing to the complex nature of the system and the variety of components involved [155, 159, 160], until about a decade ago, it was generally thought only higher organisms like vertebrates entertain adaptive immunity. As far as prokaryotes were concerned, two distinct classes of rel- 87 atively rudimentary immune systems were known to operate, especially against viruses (phages) that invade them. The first makes the host resistant to phage infection. Examples of this type include mutations in the cell surface receptors that prevent phage adsorption (envelope resistance), and the variety of restriction-modification enzymes that recognize and cleave the intracellular phage DNA introduced during a phage infection. The second class induce altruistic cellular suicide of infected hosts, thus limiting the spread of infec- tions to other concomitant hosts. Toxin/anti-Toxin and abortive infection (Abi) systems are examples of the latter. In the late 2000s, a very simple yet highly effective adaptive immune system was documented in natural prokaryotic populations. Owing to their genomic architecture in bacterial DNA, the system was named CRISPR – clustered regularly interspersed palin- dromic repeats [161]. It is instructive to summarize a decade long story behind this dis- covery [162, 163]. The history also serves to emphasize the pivotal role sequencing and bioinformatics played in generating effective, testable hypotheses. The first steps towards the identification of CRISPRs were laid down by a series of papers that documented the presence of roughly equally spaced repeats, interleaved with some spacer DNA in the genomes of a few archaea and bacteria [164–171]. Continued public funding allowed the growth and maintenance of sequencing databases, along with effective algorithms to search them. These tools revealed that the aforementioned spacer DNA that interleave the repeat segments had extensive similarities to seemingly random segments of the ge- nomic DNA of invading phages, among other things [161, 172–175]. Putative genes and promoters upstream of the CRISPR locus were also identified, and relating these find- ings back to the nobel prize winning RNA interference mechanism of gene expression 88 inhibition, several authors speculated that spacer DNA when expressed could function as anti-sense RNAs [176, 177]; when bound to their complementary phage targets, they could induce an RNA interference like pathway resulting ultimately in the destruction of target. By 2010, these speculations turned to biological facts through careful experi- ments [178–183]. But what makes CRISPRs adaptive? Sequencing CRISPR loci from laboratory cultures of CRISPR hosts before and after coevolution with invading phages showed that CRISPR spacer DNA reflect new segments in phage DNA over ecological time scales [178, 180, 184]. Furthermore, host resistance was correlated with the fraction of spacers that match segments in the invading phage genomic DNA [178, 180]. Thus, a very fundamental and highly significant advancement in microbiology was made in the last decade: prokaryotic CRISPR is an effective adaptive resistance mechanism. Our interest in CRISPRs is piqued by another set of observations. With powerful experiments and bioinformatics analyses, various authors have demonstrated that autoim- munity is a major side-effect experienced by CRISPR hosts [185–188]. This is caused because the CRISPR machinery could, with very high error rates, exploit DNA segments derived from the host genome itself as spacer DNA. As a result, CRISPRs are config- ured to consider host DNA as foreign, adversely affecting host health. We analyze the influence of autoimmunity in CRISPR mediated prokaryote-phage coevolution over eco- logical time scales, and discuss some evolutionary implications. This is the subject of chapter 6. 89 Chapter 6 Ecological dynamics of autoimmune CRISPR induced prokaryote- phage coevolution. Prokaryotes have evolved diverse molecular defense systems over billions of years of co-evolution with phages [189, 190]. Clustered Regularly Interspersed Palindromic Repeats (CRISPRs), found in roughly 40% of sequenced bacteria and 90% of archaea, are peculiar in that they confer adaptive immunity against invading phages [183, 191–194]. CRISPR, as a defense mechanism, works via targeted acquisition of 26-72bp fragments (called protospacers) from the target DNA, and subsequently use of acquired fragments (spacers) for target restriction through an RNAi-like mechanism [176, 178]. Acquisition events appear to concentrate around short 2 – 5bp motifs (protospacer adjacent motifs, or PAMs) in the target DNA [180, 183, 195]. CRISPR loci are organized as cassettes in which short repeats interleave spacers, and are located adjacent to highly diverse genes that code for the CRISPR associated protein machinery [183] [187]. Intriguingly, in addition to acquiring phage fragments, CRISPR systems can also acquire spacers from the host genome. This has been experimentally demonstrated in two model systems: first, selective induction of the acquisition machinery (in the ab- 90 sence of interference) in laboratory strains of Escherichia coli resulted in the accumula- tion of a large number of self-targeting spacers [185]; second, abolition of interference activity (and not the acquisition machinery) in wild type Streptococcus thermophilus re- sulted in unbiased acquisitions of self-targeting spacers alongside phage-targeting spac- ers [186]. However, a large-scale survey of CRISPR cassettes in microbial genomes iden- tified that only about 0.4% of the spacers are self-targeting, which, considering the rel- ative size of prokaryotic genomes over phages, suggests some mechanism of selection against self-targeting spacers, perhaps to avoid autoimmunity [186,187,196]. Indeed, di- rected experiments have conclusively shown that self-targeting can result in severe lethal- ity [180, 197–201]. We therefore face a conundrum: how do prokaryotes maintain functional CRISPR systems [202]? Despite the conceptual similarities with restriction-modification systems that avoids autoimmunity by methylating the host genomes’ target restriction sites [203], no analogous genome wide self- vs. non-self-discrimination (SND) mechanism is known for CRISPR systems. In fact, as noted above, the evidence thus far suggests that an effi- cient SND may not exist (The SND mechanism described by Marrafini and Sontheimer explains the evasion of self-destruction of CRISPR locus only and does not confer genome wide protection [204]). But there are other routes to avoiding autoimmunity. Toxin/anti- toxin or abortive infection systems restrict the scope of autoimmunity to infected pop- ulations via infection-induced activation [205]. Indeed, upregulation of CRISPRs upon phage infection has been demonstrated experimentally [206–208]. This makes it possi- ble that the accumulated self-targeting spacers may function as "toxins", which can be activated upon infection. We therefore address the following two questions in this study: 91 1. Does infection-induced activation allow CRISPRs to function as an abortive infec- tion (ABI) system? If so, what is the relative contribution of ABI in determining coevolving host and phage densities? 2. If CRISPR suppression in uninfected host populations is required to avoid host extinction, how strong should this suppression be? Clearly, the answers to these questions depend on key ecological and CRISPR kinetic parameters. For instance, while CRISPRs are highly active against phages in wild type S. thermophilus (a lactic acid bacteria widely used in industrial production of cheese) [180], artificial induction is essential to activate the system in E.coli [209]. To this end, we develop and analyze a dynamical model that integrates prokaryote-phage coevolutionary dynamics, with regulated, infection-induced CRISPR acquisition and in- terference activity. Several models of CRISPR-mediated prokaryote-phage coevolution- ary dynamics have been previously reported [1, 2, 200, 210–213]. While refs. [211–213] account for an abstract CRISPR-associated cost, they do not include the specifics of au- toimmunity kinetics/the regulatory aspect of CRISPRs. The model we develop here is detailed enough to incorporate the adaptive aspects of CRISPR, and general enough to al- low intuitive (analytic) interpretations of the resulting qualitatively distinct steady states. We interrogate the model using simulations and bifurcation analyses, and we find that as a function of key host, ecological, and CRISPR evolutionary parameters, the operational behavior of CRISPRs (and the resulting host densities) decomposes into four qualitatively distinct regimes. In those regimes where CRISPR is advantageous to the host, both re- striction and abortive infection operate; the latter dominates restriction in SND absence. 92 Crucially, CRISPR maintenance is determined by an upper bound on the activation level of CRISPRs in uninfected populations. This critical limit of activation – beyond which host extinction is inevitable – is determined by a simple dimensionless combination of parameters. We compare the current experimental data on CRISPR kinetics with these qualitative observations, which helps to explain the spacer deletion mechanism and ab- sence of CRISPR activity in highly virulent and multi-drug resistant clinical isolates. 6.1 Results 6.1.1 Behavior of a simple prokaryotic immune system with regulated autoimmunity Before proceeding to model the complexity of CRISPR dynamics in general, we start by considering the case of a simple prokaryotic immune system with regulated au- toimmunity. The goal here is to analyze the influences of the regulation, immunity and autoimmunity on the resulting coevolutionary dynamics. Fig 6.1 illustrates a simple coevolutionary model in which the immune system, apart from conferring immunity, also induces autoimmunity that is regulated in a cell state (infected / uninfected) specific manner. Dynamic variables are denoted with Roman letters, and parameters are denoted with Greek symbols. Any parameter associated with production of an item i is denoted as αi and that with its degradation is denoted by γi. Free cells (p), grow exponentially at a rate of αp, under a carrying capacity constraint of Φp. Phages (v) infect free cells to produce infected cells at a rate of αq. Infected cells can lyse to release phages at a rate of γq→v or undergo immunity to become a free cell at a rate 93 1 A δγp→φ 0.8 p γq→p 0.6 α Pp q γq→φ 0.4 0.2 v γq→v 0 0 0.2 0.4 0.6 0.8 1 δ B 10 8 1 0.5 0.8 C 6 Host 0.4 E 0.6 GP→φ Extinction 0.3 0.4 4 0.2 0.2 0.12 0 5 10 15 20 25 0.5 1.5 2.5 3.5 0 0.2 0.4 0.6 0.8 1 δ C 10 2 1.6 HE 8 1.2 0.8 G 6Q φ Complete→ 2 0.4 4 Phage Evasion 1.6 LE 0 10 20 30 40 1.2 τ 2 0.8 2 0.4 1.6 LH 0 0.2 0.4 0.6 0.8 1δ 0 10 20 30 40 1.2 τ 0.8 2 2 0.4 1.6 LL1 1.6 LL2 1.2 1.2 0 10 20 30 τ 0.8 0.8 0.4 0.4 0 10 20 30 40 0 10 20 30 40 τ τ Figure 6.1: Bifurcation analysis of a simple model of a prokaryotic immune system with reg- ulated autoimmunity side effect. (A) p, q and v denote densities of uninfected, infected cells, and phage respectively. q undergoes autoimmunity at a rate of γq→φ , while p undergoes autoimmunity at a suppressed rate determined by δγp→φ . The second figure shows the bifurcation behavior of the free cell densities with respect to the control parameter δ , beyond a certain critical value of which one of the steady states vanishes. Continued on next page. 94 Figure 6.1: Continued from previous page. (B,C) Two-parameter bifurcation diagram revealing coexistence (C) and host extinction (E). Each plot instance is denoted by a tuple where A and B can indicate low (L) or high (H) values or extinction (E) of prokaryote (free–solid, infected–dotted) and phage (dashed) respectively. GP→φ and GQ→φ denote the rescaled free cell autoimmunity rate and infected cell autoimmunity (abortive infection) rates respectively. High values of GQ→φ lead to complete phage evasion. Parameter values αp = 1 hr−1, Φ = 108 cells ml−1, γv = 5 hr−1, αv = 50, α −9q = 5×10 ml phage−1 hr−1. Variable Description Value, Units p,q Cell densities cells ml−1 v Phage density phages ml−1 αp Free cell replication rate hr−1 αq Phage adsorption rate ml phage−1 hr−1 γp→φ , γq→φ , γq→p, γq→v Autoimmunity, Immunity, and Lysis rates hr−1 γ Phage death rate hr−1v αv Phage burst size phages Φp Environmental carrying capacity cells ml−1 δ Scale factor (0≤ δ ≤ 1), determines CRISPR activity in free cells. µv Phage mutation rate per protospacer protospacers−1 Table 6.1: Descriptions of variables and parameters in model 1. Dynamic variables are denoted with Roman letters, and parameters are denoted with Greek symbols. Any parameter associated with production of an item i is denoted as αi and that with its degradation is denoted as γi. Steady state value of an item i will be donted by i∗. of γq→p, or undergo autoimmunity at a rate of γq→φ . Free cells undergo autoimmunity at a suppressed rate of δγp→φ , (0 ≤ δ ≤ 1). Note γp→φ need not necessarily equal γq→φ , for reasons that will become clear later when we discuss the detailed CRISPR model. The condition δ = 0 implies complete repression of autoimmunity in free cells, whereas δ = 1 indicates no difference in repression across the two cell states. The burst size of phages is αv. Phages also die at a rate of γv. Table 6.1 describes the variables and model parameters. 95 The dynamical equations for this model can be written as: ( ) p+q ṗ = αp p 1− −δγΦ p→φ p−αq pv+ γq→pq p q̇ = αq pv− (γq→φ + γ (6.1)q→p + γq→v)q v̇ = αvγq→vq− γvv−αq pv [ ] Measuring all the state variables in units of Φp, and time in units of τ = α Φ −1 q p t, and denoting all the transformed variables and parameters with their corresponding Ro- man alphabets, we obtain: Ṗ = Ap p(1−P−Q)−δGP→φ P−PV +GQ→PQ Q̇ = pv− (GQ→φ +GQ→P +GQ→V )Q (6.2) V̇ = αvGQ→V Q−GVV −PV We can study the influence of regulation (determined by the parameter δ ), immunity and autoimmunity rates (GQ→V ,GQ→φ , and GP→φ ) on the above dynamical system using a bifurcation analysis. These results are summarized in Fig. 6.1. Fig. 6.1A shows that, as a function of δ , two fixed points collide at a critical value of δ (which we denote by δ1), beyond which one of them ceases to exist. Fig. 6.1B shows that in the (δ ,GP→φ ) space, beyond a critical curve that falls roughly as G−1P→φ , hosts go extinct. Fig. 6.1C reveals in the (δ ,GQ→φ ) space, beyond a line of critical points, phages go extinct. Behavior in the (δ ,GQ→P) space is similar. We provide an analytical treatment below. Bifurcations occur when the number of fixed points or their stability properties change in response to a dynamical parameter. Our system can approach three qual- 96 itatively distinct steady states: the first corresponds to host extinction, which we de- note by E∗ = (P∗e ,Q ∗ e ,V ∗ e ) = (0,0,0). The second corresponds to a phage free system, which occurs with pure cultures where phages have not been introduced, or when hosts completely evade phage infection, which we denote by F∗ = (P∗,Q∗ ,V ∗f f f ) = (P ∗ f ,0,0). The third corresponds to the case of prokaryote-phage coexistence, which we denote by C∗ = (P∗c ,Q ∗ ∗ c ,Vc ). In the phage free situation, the system evolves along the curve Ṗ = AP(1−P)− δGP→φ P, towards the fixed point P∗ = 1− δGP→φ f A . Non-extinction/positivity conditionP on this expression reveals a criticality condition on δ for maintenance of hosts carrying our simple immune system in the phage free case: δ < APG = δ . This is preciselyP→φ 1 the curve mapped out in Fig. 6.1B beyond which the hosts go extinct; when δ = δ1, F∗ = E∗, and when δ > δ ∗1, F is infeasible. Hence, as long as the immune system (with an autoimmunity side effect) is suppressed below a critical nondimensional ratio of the free cell reproduction rate to that of its autoimmunity potential, the phage free steady state is feasible. The non-trivial fixed point for the case of coexistence, C∗, is given by: P∗  GVc = αv  G G  −1  Q→φ + Q→P1+︸ GQ︷→︷ V ︸ immune advantage (6.3) ∗ AP(1−P∗c )−δGV P→φc = A P∗ G+ Q→φ+GQ→PP c GQ→V P∗ ∗ Q∗ = c Vc c GQ 97 Here GQ = (GQ→φ +GQ→P +GQ→V ) denotes the overall removal rate of infected cells. In this coexistence regime, the steady state expression for P∗c decomposes into the two parts: steady-state value when the dynamics is phage limiting and the advantage of- fered by the immune system in overcoming phage lysis. This advantage is given by the ratio of the sum of immunity and autoimmunity rates conferred by the immune system in infected cells to that of the phage specific lysis rates. Thus inducing autoimmunity, alongside immunity, in infected cells (abortive infection) is beneficial to the prokaryotic population when coevolving with phages. As is the case with predator-prey models, P∗c is independent of the cell’s own growth rate [214], and is completely determined by the immunity and autoimmunity parameters, along with the phage specific parameters. Fur- thermore, positivity conditions on the steady state values yields the feasibility conditions ∗ for the existence of this steady state: (0 < P∗c < 1), and (0≤ δ δ with δ AP(1−P )< 2) c2 = GP→φ (as V ∗c ≤ 0 otherwise), giving us a tighter constraint on δ for coexistence. Notice that δ2 < δ1. So regardless of the presence or absence of phages, a free cell autoimmunity suppression level of δ < δ1 is required for the population to avoid losing the immune system altogether. When free cells completely repress the immune system (δ = 0), or when there is no autoimmunity (GQ→φ = 0), V ∗c and Q ∗ c achieve their maximum values. As δ → δ2, the values of V ∗ and Q∗c c are reduced progressively. The form of these equillibria implies that by increasing the net autoimmunity rate in free cells, lower net viral abundance is achieved. However, by doing so the range of δ that supports coexistence is narrowed. When δ > δ2, the coexistence steady state C∗ is infeasible, and the system operates in the phage free regime, at which point, the condition δ < δ1 has to be satisfied to avoid host 98 extinction. The bifurcation diagram in Fig. 6.1A maps this behavior: C∗ continues to be stable until δ < δ2, whereas beyond δ2 the otherwise unstable F∗ becomes stable. The stability of the steady states ascertained by the Routh-Horwitz criteria [214]. To analyze the influence of abortive infection on coevolution, we produced a two- parameter bifurcation diagram for the (δ ,GQ→φ ) space (Fig. 6.1C). Two distinct regimes are clear: a coexistence regime, and a regime where hosts evade phages. A third regime corresponding to host extinction also occurs for autoimmunity suppression exceeding the value δ1 (for the parameters in this figure, it occurs along the line δ = 1). The bifurcation diagrams are similar for a variety of other parameter combinations tested. Coexistence occurs for low values of GQ→φ , and are progressively lost as δ is increased. We can trace the line of critical points analytically as follows. Recall that the switch from coexistence to ∗ phage evasion is principally determined by the equality δ δ AP(1−P= = c )2 G . If we let GQ =P→φ (GQ→V + GQ→P + GQ→φ ) and substituting for P∗ GP→φ GV c , we obtain 1− δ A = .P αvGQ→V G −1Q When αvGQ→VG >> 1 , as a function of GQ→φ and δ , this condition spans the line:Q δ GQ→φ + = 1 (6.4) K1 K2 [ ( )] w[here t(he intercep)ts]are given by K = AP 1− GV 1 G+ Q→P1 G α G , and K =P→φ v Q→V 2 G α→ v − 1 G + Q→PQ V G G . For the parameters in Fig. 6.1C, Routh-Horwitz criteria [214]V Q→V reveals that the achieved C∗ values are stable. Beyond this boundary, coexistence is infea- sible, and cells assume a density determined completely by δ , and independent of GQ→φ : P∗ = 1− δAPf G . Clearly, both K1 and K2 are reduced with increasing values of GP→φ Q→P (immune rate), the net effect being reduction of the area under the line resulting in loss 99 of coexistence. To map the influence of immunity, one can similarly establish the critical line determining the boundary of coexistence explicitly as a function of (δ ,GQ→P). In summary, our bifurcation analysis of this simple model (i) reveals the precise regimes for the three possible fates of a prokaryotic immune system with regulated au- toimmunity (complete evasion of phages, coexistence with phages, or extinction) (ii) shows that infected cell autoimmunity (alongside restriction) is beneficial to the prokary- otic population, and (iii) reveals a strict limit on the free cell autoimmunity levels above which host extinction occurs. Perhaps the most characteristic feature of CRISPRs is their adaptive ability for con- tinued novelty resulting from spacer acquisitions and deletions. The model above does not incorporate spacer turnover kinetics or its regulation. Neither does it allow us to explicitly determine the influence of host protospacer levels on the interval of autoimmu- nity regulation 0 ≤ δ < δ1; the larger this window, the higher the cellular tolerance for CRISPRs. We will therefore proceed to incorporate CRISPR specific reactions into the simple model described above. We will show that (i) the simple model arises as a particular limit of a more general model, and (ii) by thwarting the accumulation of self-targeting spacers through an SND (whose existence/absence is hard to ascertain from existing data), and/or through a highly active spacer deletion mechanism, the range of free-cell CRISPR activity levels, δ , is widened. Furthermore, the general model will reveal other idiosyncratic features of CRISPR and its maintenance in populations over ecological time scales. 100 6.1.2 A detailed model for CRISPRs incorporating their adaptive ability and regulation In this section we develop a more detailed model of CRISPR dynamics, which gen- eralizes the simple model discussed above. Our modeling strategy in this section (Fig 6.2 ) is intermediate to models that fix a constant rate of immunity (as in [1]) and agent- based models that describe strain-specific immunity (as in [2]). Briefly, we track spacer accumulations over time and use linear mass action kinetics to model the CRISPR reac- tions and the resulting ecological dynamics due to immunity and autoimmunity. Such an approach offers the computational advantage to model growing populations while simul- taneously accounting for the underlying regulatory dynamics of CRISPR and its kinetics. While this model cannot capture strain-specific behavior, we can nonetheless make qual- itative and even quantitative predictions for the average spacer accumulation kinetics re- sulting from the adaptive nature of CRISPR dynamics. The key variables in this detailed model are described in Table 6.2 and discussed below. We let πv denote the total number of phage protospacers per phage genome. The amount of self-targeting spacers per prokaryotic genome is defined relative to the phage protospacer amount as βπv. Thus β = 0 implies no self-targeting protospacers per prokary- otic genome, which can also be interpreted as the absence of self-targeting protospacers due to the presence of an SND. At any time, both the free and infected cell populations (denoted as p and q respectively) have an associated CRISPR spacer content, the "per- cell" quotas of which are completely specified by ypA,ypI,ypS and yqA,yqI,yqS respectively (table 6.2). Here y·A denotes the active spacer quota per cell (i.e., phage reactive), y·I de- 101 Variable Description Value, Units p Free cell density cells ml−1 q Infected cell density cells ml−1 v Phage density phages ml−1 (ypA,ypI,ypS) Average Active, Inactive, and Self-targeting spacer quota per free cell spacers cell−1 (yqA,yqI,yqS) Average Active, Inactive, and Self-targeting spacer quota per infected cell spacers cell−1 xA Average Active phage protospacer quotea per infected cell protospacers cell−1 αp Free cell replication rate 1 hr−1 α Phage adsorption rate 5×10−9 ml phage−1q hr−1 αv Phage burst size 50 phages γv Phage death rate 5 hr−1 Φp Environmental carrying capacity 108 cells ml−1 αc Acquisition rate of new spacers 10−6 cells hr−1 γ Deletion rate of new spacers varied hr−1c γq→p Immune rate per active spacer per infected cell 1−10−6( spacers −1 −1cell ) hr γq→φ Autoimmunity rate per self-targeting spacer per infected cell varied ( spacers)−1cell hr −1 γq→v Lysis rate of infected cells 1 hr−1 πv Total number of protospacers per phage genome phage−1 β ×πv Total number of self-targeting protospacers per prokaryotic cell. varied protospacers cell−1 δ Scale factor (0≤ δ ≤ 1), determines CRISPR activity in free cells. µv Phage mutation rate per protospacer 30×10−8 protospacers−1 Table 6.2: Description of the different variables used in the detailed model. Dynamic variables are denoted with Roman letters, and parameters are denoted with Greek sym- bols. Any parameter associated with production of an item i is denoted as αi and that with its degradation is denoted as γ ∗i. Steady state value of an item i will be donted by i . Parameter values were obtained from [1, 2]. 102 PPhhaaggee Lysis Infection Free Cell Infected Cell Population Population CRISPR Immunity Auotimmunity Auotimmunity Prokaryotic Population-Specific Spacer types Protospacer types Spacer Active Inactive Self-targeting Phage Host acquisition Figure 6.2: A detailed model of CRISPR dynamics. The infected cell population (and its as- sociated CRISPR spacer content) is created from the growing free cell population (and its cor- responding CRISPR content) through phage infections. The overall CRISPR spacer content in each cell population is abstractly partitioned into active, inactive and self-targeting. Active spac- ers elicit phage restriction, while self-targeting spacers cause cell death (autoimmunity). While both the free and infected cell populations have genomic protospacers that contribute to the cre- ation of self-targeting spacers, only the infected cell population has access to the released phage protospacers for the creation of active spacer content. Continued on next page. 103 Figure 6.2: At any given time, the CRISPR induced rate of immunity for an infected cell is proportional to its per capita quota of active spacer content associated with the population at that time. Similarly, we use the corresponding self-targeting spacer content to define the rates of autoimmunity for both the infected and free cell populations. In our equations, we directly model these per capita quotas. Thus the rates of CRISPR induced immunity and autoimmunity for a cell population are reflective of its associated spacer content at any given time, which in turn is determined by the kinetics of CRISPR and prokaryote-phage interaction. notes the inactive spacer quota per cell (i.e., phage inactive, due to mutations in the cor- responding PAMs in phages) and y·S denotes the self-targeting spacer quota per cell. The average phage protospacer quota per infected cell available for its new spacer acquisitions is denoted by xA. The per capita quotas of the various types of CRISPR spacer content are used to model the rates of acquisition and interference reactions in each subpopulation. Let γq→p be the rate of immunity conferred per active spacer; then at any given time the immunity rate per infected cell is assumed to be γq→pyqA. Similarly, if γq→ denotes the rate of autoimmunity conferred per self-targeting spacer, the autoimmunity rate per infected cell is then γq→φ yqS. To obtain the corresponding term for the free cell population we will first need to model infection-mediated CRISPR activation. As the operonic structure of CRISPR/Cas genes lends itself to regulation based on free/infected cell states [208,209,215–218], we simply scale the rates of all the CRISPR reactions (acquisition, deletion and interference) by δ (0≤ δ ≤ 1), in the free cell popula- tion relative to that of the infected population. So δ = 0 implies that all CRISPR reactions in free cells are switched off whereas δ = 1 implies that there is no differential CRISPR expression between the free and infected cell populations. Note that, only infected cells 104 can acquire novel phage protospacers, while both infected and free cell populations can acquire self-targeting protospacers. The latter events occur when δ > 0. Under these modeling assumptions, the corresponding autoimmunity rate per self-targeting spacer is given by δγq→φ ; this is scaled by the per capita free cell quota of self-targeting spacers to calculate the autoimmunity rate per free cell, δγq→φ ypS. 6.2 Population dynamics. We now describe how the above reactions are coupled with prokaryote-phage co- evolution. Free cells (p) replicate at a rate αp under the constraint imposed by the carrying capacity Φp. Free cells are also produced from infected cells (q) due to immune evasions of phage lysis at a rate of γq→pyqA (as described above). Thus the net rate of infected cells that undergo immunity is given by γq→pyqAq. Phages (v) infect free cells with an adsorption rate constant αq to produce q. In addition, free cells undergo autoimmunity at a rate of γq→φ ypS, which is determined by the amount of self-targeting spacers (ypS) in free cells and the degree of CRISPR activity in free versus infected cells (δ ). Phages with a burst size of αv are produced from lysis of infected cells at rate γq→v and removed a t a rate of γv. q can undergo autoimmunity at a rate of γq→φ yqS, or switch to free cells with rate γq→pyqA. The differential equations are then given as: 105 ( ) ṗ = αp p 1− p+q −δ︸ γp→︷φ︷ypS p︸−αq pv+ γΦp ︸q→︷p︷yqAq︸ autoimmunity immunity q̇ = α︸︷q︷p︸v −(γq→φ yqS + γq→pyqA + γq→v)q (6.5) infections v̇ = α︸ vγ︷q︷→vq︸−γvv−αq pv lysis For convenience in exposition below, we will let Γp = (αqv+δγq→φ ypS) and Γq = (γq→pyqA + γq→φ yqS + γq→v), which denote the overall removal rates of cells in the free and infected populations respectively. 6.3 Spacer and protospacer concents in free and infected cells Fig. 6.3 presents the set of reactions influencing the total spacer and protospacer contents of different types. These give rise to the following derivatives when q(t) =6 0 and p(t) 6= 0. pv ẋA = αq [(1−µv)πv− xA]q pv [ ] ẏqA = αcxA +αq ypA− yq qA − γcyqA pv [ ] ẏqI = αq ypI− yqI − γcyq qI pv [ ] ẏqS = αcβπv +αq yq pS − yqS − γcyqS (6.6) [ ] yqAq [ ]ẏpA = µv ypI− ypA + γq→p yqA− ypA −δγ yp c pA[ ] y ẏ qA q [ ] pI = µv ypA− ypI + γq→p yp qI − ypI −δγcypI y q [ ] ẏpS = δαcβπ qA v + γq→p y − y −δγ yp qS pS c pS 106 Cell & Cell & Cell & Cell Removal CRISPR Kinetic CRISPR Kinetic CRISPR Kinetic Removal Removal Removal q(t)•x (t) Acquisition AcquisitionA q(t)•yqA(t) q(t)•yqI(t) q(t)•yqS(t) Infection Immunity Infection Immunity Infection Immunity Infection Phage p(t)•y (t) Mutations p(t)•y (t) p(t)•y (t) AcquisitionpA pI pS Cell Cell Cell growth growth growth Cell & Cell & Cell & CRISPR Kinetic CRISPR Kinetic CRISPR Kinetic Removal Removal Removal Figure 6.3: Reactions influencing total spacer and protospacer densities. The inflow and out- flow of different species are indicated. The figure shows the reactions influencing the total spacer and protospacer contents at any given time in the population. We use this reaction set to derive the rates of average spacer quota change over time. Squares in the top row correspond to the total protospacer and spacer content in the infected cell population; those in the bottom correspond to those in the free cell population. Note that while we model average spacer quotas this figure illustrate all the reactions that influence total spacer contents. We derive the aforementioned per cell quotas of protospacer (xA), and various spacer contents as follows. Active phage protospacer quota per infected cell Because we track the per capita quotas of protospacer contents per infected cell, any expression for its derivative has to account for the current spacer density in the infected cell population, the influx due to the newly infected cells, weighted by their corresponding population sizes (refer Fig. 6.3). At any time instant t, the total amount (density) of phage protospacers associated with 107 the entire infected population is xA(t)q(t) . The total amount of newly released phage protospacers is given by the product of total amount of infections and the expected amount of native phage protospacers per phage as αq p(t)v(t)× (1−µv)πv . The total amount of protospacers leaving the infected pool is proportional to the removal rate of infected cells and is equal to Γq(t)q(t)xA(t). For any small time interval ∆t then, we can write xA(t+∆t) as: xA(t)q(t)+∆tαq(1−µv)πv p(t)v(t)−∆tΓx t ∆t q (t)q(t)xA(t) A( + ) = (6.7)q(t)+∆tαq p(t)v(t)−∆tΓq(t)q(t) where the denominator is the expected infected cell density at time t+∆t. Thus xA(t+∆t) is precisely the average protospacer content per infected cell at time t +∆t. In a straight- forward fashion, when q(t) 6= 0, we can compute the limit dxA(t)dt = lim xA(t+∆t)−xA(t) ∆t→0 ∆t to obtain our derivative: αq p(t)v(t) [(1−µẋ v )πv− x(t)] A = (6.8)q(t) We now follow a similar procedure to calculate the average spacer contents in the free and infected cell populations. CRISPR spacer quota per infected cell New additions to the active and self-targeting spacer content associated with infected cell population can occur upon infection due to acquisition reactions. In addition, they are also inherited from free cells that are infected. Inactive spacers in the infected cell population, however, can only be inherited. Further- more, we will also account for the removal of spacers due to CRISPR kinetics through a 108 spacer deletion parameter γc . Given the current phage protospacer levels available per infected cell (xA) for spacer acquisitions and the acquisition rate of αc, the total rate of new active spacer acquisitions is computed as αcxA. Similarly, given the current genomic protospacer density of βπvq, the rate of newly acquired self-targeting spacer content is given by αcβπvq . For a given spacer type, the inflow due to inheritance is determined by the amount of infections (αq pv) and the spacer density of that type in the free cell population (e.g., αqvp× ypA for active spacers). Finally, all three spacer types within an infected cell are also removed at a rate proportional to the removal rate of infected cells and spacer deletion. Taken together this results in the following equations for the different spacer contents at time t +∆t: ( ) q(t)yqA(t)+∆tαcxA(t)q(t)+∆tαq p(t)v(t)ypA(t)−∆t Γq + γc q(t)y (t)yqA(t +∆t qA ) = q(t)+∆tαq p(t)v((t)−∆tΓ)q(t)q(t) q(t)yqI(t)+∆tαq p(t)v(t)ypI(t)−∆t Γq + γy t ∆t c q(t)yqI(t) qI( + ) = q(t)+∆tαq p(t)v(t)−∆tΓq(t)q(t) ( ) q(t)yqS(t)+∆tαcβπvq(t)+∆tαq p(t)v(t)ypS(t)−∆t Γq + γc q(t)yqS(t)yqS(t +∆t) = q(t)+∆tαq p(t)v(t)−∆tΓq(t)q(t) (6.9) When q(t) =6 0, we obtain the corresponding derivatives for the variables by com- puting the limits lim yqA(t+∆t)−yqA(t) , lim yqI(t+∆t)−yqI(t) and lim yqS(t+∆t)−yqS(t)∆t→0 ∆t ∆t→0 ∆t ∆t→0 ∆t . CRISPR spacer quota per free cell New additions to the self-targeting spacer content in free cells is determined by the differential activation rate of acquisition in free cells (δαc), and the current available pool of genomic protospacers βπv p . All three spacer types are inherited from the infected cells at a rate proportional to the amount of infected cells undergoing immunity. Further, at a rate determined by per protospacer spacer mu- 109 tation rate µv , mutated phage protospacers can switch to being native (and vice versa); at this rate then, this effect is also reflected in the corresponding CRISPR content as the transition of inactive spacers to active states (and vice versa). For simplicity, we do not consider the difference in the rates of forward and backward mutation rates. Finally, all three spacer types within a free cell replicate at a rate proportional to the effective free cell duplication rate, and are removed at a rate proportional to the removal rate of free cells and spacer deletion (which, as mentioned before, is scaled by the CRISPR activation rate δ ). Taken together this results in the following equations for average spacer contents at time t +∆t (for clarity, we ignore mentioning time dependence explicitly): ( ) ypA p+∆tµvypI p−∆tµvypA p+∆ty2qAq−∆tΓp pypA−∆tδ( γc pypA +) ∆tαp p 1− p+qy t ∆t Φ ypApA( + ) = p+∆tγq→pyqAq−∆tΓp p+∆tαp p 1− p+qΦ ( ) ypA p+∆tµvypA p−∆tµvypI p+∆ty p+qy t ∆t qA qyqI−∆tΓp pypI−∆(tδγc pypI)+∆tαp p 1− Φ ypIpI( + ) = p+∆tγq→pyqAq−∆tΓp p+∆tα p 1− p+qp Φ ( ) ypS p+∆tαcβπv p+∆tγq→pyqAqyqS−∆tΓp pypS−∆tδ(γc pypS +)∆tα p 1− p+q yypS(t +∆t p Φ pI) = p+∆tγq→pyqAq−∆tΓp p+∆tαp p 1− p+qΦ (6.10) When p(t) 6= 0, we obtain the derivatives as lim ypA(t+∆t)−ypA(t) , lim ypI(t+∆t)−ypI(t)∆t→0 ∆t ∆t→0 ∆t and lim ypS(t+∆t)−ypS(t)∆t→0 ∆t . We non-dimensionalize our equations by choosing to measure our cell density vari- ables in units of the carrying capacity Φp, and phage density in units of αvΦp, spacer and protospacer variables in units of the number of native phage protospacers πv, and time in the non-dimensional units of τ = α−1c t (CRISPR evolutionary time scales). This leads to the following set of equations, with effective parameters A α= vαqV α Φp, G γ = q→p c Q→P α πc v, and G γ= q→φQ→φ α πv, while the rest of the rate parameters get scaled by α −1 c . Non-c 110 dimensionalization, apart from reducing the number of parameters in the model, also simplifies analysis of relative parameter sizes. Ṗ = APP(1− (P+Q))+GQ→PYQAQ−δGQ→φYPSP−AV PV Q̇ = AV PV − (GQ→φYQS +GQ→PYQA +GQ→V )Q V̇ G Q−G V − AV= Q→V V PVαv PV ẊA = AV [(1−µv)−XA]Q PV ẎQA = XA +AV [YPA−YQA]−GCYQ QA (6.11) PV ẎQI = AV [YPI−YQ QI ]−GCYQI PV ẎQS = β +AV [YPS−YQS]−GCYQ QS Y Q ẎPA = MV [YPI−Y QA PA]+GQ→P [YP QA −YPA]−δGCYPA Y Q ẎPI = MV [Y QA PA−YPI]+GQ→P [YQI−YPI]−δGCYP PI Y Q ẎPS = δβ G QA + Q→P [YQS−YPS]−δG YP C PS 6.3.1 Simulations and bifurcation analysis. All numerical simulations were performed with Matlab 2013b. Numerical bifurca- tion analyses were performed with XPPAUT (AUTO) [219]. 6.3.2 SND absence is extremely lethal in the absence of regulation In the absence of SND, given the large host genome size relative to that of phage (e.g. E.coli genome is roughly 100x the length of phage λ )and short PAM demarcating 111 protospacers, we expect an abundant host protospacer pool. In our model, this would imply a large host to phage protospacer ratio (β > 1). On the other hand, if SND is present, then its efficiency determines the β value, with higher efficiencies implying lower β values and vice versa. Similarly, the parameter δ determines the activation level of CRISPRs in free cells relative to that of infected cells; thus δ = 0 represents complete repression, and δ = 1 signifies no difference in CRISPR activation between free and infected cell populations. To study the influence of host protospacers levels and regulation on prokaryotic densities, we vary δ and β across a large range of biologically feasible values (Fig. 6.4). Remarkably, as we observed in the case of our simple model, the steady state prokaryotic densities show a sharp, threshold-like behavior as a function of the degree of CRISPR regulation δ : hosts switch from maximal densities to complete extinction as the degree of free-cell CRISPR activity, δ , increases (Fig. 6.4A). Even in the case of comparable levels of host and phage protospacer (β = 1), greatly reduced levels of activation in free versus infected cells (δ < 0.01) are required to guarantee host existence. While this tight window of prokaryotic existence is relaxed slightly at lower host protospacer levels, these results indicate that tight regulatory control is necessary for a wide range of host protospacer levels. It is therefore clear that the presence or absence of an SND is a crucial determinant of CRISPR maintenance in populations. Fig. 6.4B shows the time course of several typical simulations for various (β ,δ ) combinations, to illustrate the effects of these two key parameters on intracellular spacer contents. For a wide range of parameters and initial conditions we find that the system approaches a steady state. 112 6.3.3 A simple constraint determines CRISPR maintenance in the model We now work to derive an analytical understanding of the critical limit on δ (de- noted by δ1) that permits population survival. As in the simplified model, exact conditions for the threshold-like behavior of the system in the δ and β space can be obtained by con- sidering the phage free system, in which case, the full system reduces to: Ṗ = APP(1−P)−δGQ→φYPSP ẎPA = MV [YPI−YPA]−δGCYPA (6.12) ẎPI = MV [YPA−YPI]−δGCYPI ẎPS = δβ −δGCYPS These values give rise to the following fixed point: {P∗ = 1− δGQ→φYPSA ,Y ∗ PA =P 0,Y ∗PI = 0,Y ∗ β PS = G }. In the absence of any feedback from infections, and in the presenceC of an active spacer deletion mechanism, the active and inactive spacer contents are pro- gressively lost from the population. The influence of CRISPR induced autoimmunity on free cell density is manifest in the steady state expression for free cells. For a population to not completely lose their CRISPR activity, the condition P∗ > 0 must be satisfied. This leads us to the condition required for sufficient suppression of CRISPR in free cells: AP APGδ C< ∗ = , (6.13)GQ→φYPS GQ→φ β For values of δ exceeding this upper bound, the system goes extinct. The same constraint holds for a system with phage, as non-negativity of the net cellular growth rate 113 is essential to avoid the only steady state of extinction. Note that, in the presence of a perfect SND, β = 1 and so the constraint on δ is effectively removed altogether. But in the absence of such a mechanism (β > 0), the internal steady state level of self-targeting spacers determines an upper limit on the free-cell CRISPR activity, δ . The role of another crucial parameter is also apparent from this analysis: the spacer deletion rate. High spacer deletions can effectively remove self-targeting spacer accumu- lations, thus suppressing autoimmunity. So in addition to CRISPR regulation, the spacer deletion rate can also be increased to maintain CRISPR+ hosts in a population with larger host protospacer levels. We will use simulations below to determine how large this rate should be relative to the spacer acquisition rate. 6.3.4 Coevolutionary dynamics under the assumption of equilibrated spacer levels over CRISPR evolutionary time scales For a wide variety of parameters and initial conditions tested, we found that the sys- tem converged to steady states (see Fig. 6.4B for an example). Let (Y ∗ ∗QA,YQS,Y ∗ PS) denote the resulting steady state levels of intracellular spacer contents over CRISPR evolutionary time scales. These can then determine fixed rates of immunity (G Y ∗Q→P QA) and autoim- munity (G Y ∗ ,G Y ∗Q→φ QS P→φ PS). To do so, we use the simplified model shown in Fig. 6.1, which replaces all immunity and autoimmunity rates (which were originally functions of the spacer variables) by fixed rate constants. In such a limit, a thorough analysis of the co- evolutionary dynamics is feasible. These results indicate that as long as the constraint on δ is met and the steady state intracellular levels of self-targeting spacers in infected cells 114 is non-zero, CRISPRs can exploit the abortive infection strategy alongside restriction. In the absence of SND, by contrast, the levels of self-targeting spacers will be much higher than phage reactive spacers. Under these conditions, the model predicts that CRISPRs will function principally as an abortive infection system. We stress that we are not considering the situation that individual spacer sequences themselves are fixed in the population, but rather, the total number of them. 6.3.5 Four characteristic regimes of CRISPR activity Given the importance of the dimensionless parameters {δ ,β ,GC} in determining the evolutionary maintenance of CRISPR+ hosts, we now focus on understanding the influence of these parameters on the general model. Free cell densities in the {β ,GC} space for a given value of δ reveal a characteristic four-regime pattern. Fig. 6.5 shows the free cell densities achieved (first column) and phage densities (second column) for various values of (β ,GC) values under two cases of δ : δ = 10−2 and δ = 10−4. Regime I occurs at low β and very high GC values. Here both free cells and phages coexist; while the former assume significantly low levels (but never extinct), the latter achieve their highest densities. Regime II occurs at low β and low GC values. Here hosts achieve their highest densities driving phage densities to very low values, if not extinction. In regime III, which occurs at high but still plausible β values, host extinction occurs. Regime IV is an extension of regime II’s behavior, but at high GC and high β values. Hints to explain the existence of these four qualitative regimes, and their bound- aries, are provided by the corresponding intracellular steady state spacer levels and the 115 constraint on δ we derived in the previous section. As we proceed to higher β values, the active spacer levels decrease and self-targeting spacer levels increase (see for exam- ple Fig. 6.4B). Higher β values lead to larger steady state levels of self-targeting spacers, effectively increasing the autoimmunity rate of infected cells. This inhibits immune me- diated feedback of active spacers to the free cell population (through inheritance) and causes a reduction in the overall active spacer levels. Self-targeting spacers, on the other hand, can be independently acquired in free cells at a rate determined by δ . According to this basic intuition, we can now derive rough conditions for falling in each of the four qualitative regimes. (Regime I) At high GC values (GC→ ∞) CRISPR cassettes are empty and the im- munity and autoimmunity reactions are overwhelmed by phage lysis. Under these con- ditions, both the steady state spacer levels and their derivatives become zero, making G ∗ ∗ the factor Q→P YQA+GQ→φYQS G = 0 , resulting in no net growth advantage to CRISPR hostsQ→V (compare to steady state of the simple model). In this regime, the coevolutionary dy- namics is phage limiting, resulting in steady state free cell levels of GVα −1 in terms of thev simple model. (Regime II) At lower GC values, and when the existence condition on δ is satisfied, both immunity and autoimmunity operate, allowing prokaryotes to evade phage lysis at significant rates. In this regime, phages are driven to very low densities or extinction. (Regime III) At lower GC values, progressing to higher β values increases steady-state levels of self-targeting spacers, thereby increasing the risk of not satisfying the constraint on δ . In such cases, regime III operates for all higher values of β , and ex- tinction is inevitable. (Regime IV) This regime operates in the region where high levels of β are matched by corresponding high GC values that are sufficient to reduce self-targeting 116 spacer levels so as to satisfy the δ constraint. In this regime, host extinction occurs. Here no active spacer mediated immunity occurs, but CRISPRs transform to a full-fledged abortive infection system. When δ = 0, regime III does not occur, and regime IV ex- tends into regime III. Thus the boundaries between regimes I and II, IV can be mapped G ∗ ∗ by Q→P YQA+GQ→φYQS G = 0, and that between II, IV and III can be mapped by the criticalQ→V condition on δ . 6.3.6 Elimination of abortive infection improves coexistence of phages To study how ABI influences the coevolutionary dynamics in the general model, we remove the autoimmunity term from the model and compare the resulting prokaryotic and phage densities across several host protospacer and CRISPR activation levels (Fig. 6.6). We find that while removing ABI in infected cells increases the size of the coexistence regime and allows for improved phage densities. Indeed, this is the same effect predicted by our bifurcation analysis of the simplified model, where lower abortive infection rates lead to increased coexistence owing to higher phage turnover. 6.4 Discussion A handful of prokaryote-phage experimental systems for studying CRISPR dynam- ics have been established. However, the extreme diversity of CRISPRs [190] makes it dif- ficult to draw broad conclusions from any one biological model system. Computational models, which allow exploration over a wide range of feasible parameters, provide an attractive alternative. 117 In this work, we analyzed the influence of infection-induced activation of CRISPRs and their autoimmunity side effect on prokaryote-phage coevolutionary dynamics. Our model integrates the classical ingredients of the prokaryotic CRISPR immune system, along with aspects of regulation and autoimmunity. Our analysis suggests that CRISPRs exploit both restriction and abortive infection. Moreover, we identified a key constraint that determines the growth advantage associated with CRISPRs as a prokaryotic immune system. As summarized in Fig. 6.7, our model reveals a characteristic four-regime pattern determined principally by three effective parameters: the activation level of CRISPRs in uninfected population, the host to phage protospacer ratio, and spacer deletion to acqui- sition rate ratio in CRISPRs. In the presence of SND, the host to phage protospacer ratio is close to zero, and CRISPRs operate exclusively by exploiting restriction, while in the absence of SND, they tend to principally exploit the abortive infection route. Several previous models have also studied CRISPR associated fitness costs, al- though as abstract functions. Nevertheless, these models reproduce and help to explain some of the key experimental and comparative genomics findings on CRISPRs. Levin and colleagues exploited classical density dependent ecological models to numerically analyze the invasion of costly CRISPR genotypes in the presence of innate (envelope) re- sistance and conjugative plasmids [1, 200, 220], and showed that selection due to contin- uous phage exposure and absence of less costly resistance mechanisms improve CRISPR maintenance in the population. Similar in spirit, Gandon and Vale make general discus- sions based on their analysis of general epidemiological models on the evolution of a CRISPR-like resistance mechanism, when the side effect associated is that of beneficial horizontal gene transfer impedance [213]. Childs et al., established a multiscale agent- 118 based simulation model to characterize CRISPR spacer and viral diversity during coevo- lution, and conclude that population dynamics is more sensitive to spacer acquisition rates than interference rates [2]. Weinberger et al., derive a critical threshold on CRISPR asso- ciated cost as a function of coevolving viral diversity, innate resistance and spacer acqui- sition rate and conclude that high viral diversity selects against CRISPRs [212]. Iranzo et al., used numerical simulations of a general agent based simulation model that addition- ally accounted for CRISPR loss and horizontal transfer, to exhaustively study CRISPR maintenance as a function of various kinetic parameters in their model [211]. They also concluded that CRISPR loss is encouraged at high prokaryote/phage population sizes. Our analyses complement these studies summarized above, and they advance our understanding of CRISPR mechanisms in general. We have delineated the precise condi- tions under which CRISPRs can be lost even at low viral diversities. The level of com- plexity in our model, intermediate to previous simulations of agent-based models and models requiring radical simplifications and that do not account for the adaptive nature of CRISPR kinetics, provides an opportunity for mathematical analysis and intuitive under- standing of the results. We have presented an analytical treatment of a particular limit of our model (which empirically hold for wide parameter regimes), summarizing qualitative behavior of the CRISPR system as a function of the underlying parameters. It is also worthwhile to re-examine previous experimental and bioinformatic studies of CRISPRs, in light of the insights gained from our modeling analyses. We found that for CRISPRs to be maintained in a population, free-cell CRISPR activity must be sufficiently suppressed. This upper bound on free-cell activity is determined by a nondimensional ratio of free cell growth rate to that of its autoimmunity potential due to the accumulated 119 self-targeting spacers. An immediate consequence is that CRISPRs are likely to be lost from populations or cell types with reduced growth rates. This result helps to explain well-known empirical trends. For example, in general it is known that drug resistance or virulence is associated with moderate to high fitness costs; under these conditions cells often assume low growth rates [221]. According to our model, then, such strains should lack functional CRISPR elements, as has been confirmed for multi-drug resistant Escherichia coli [222]and for highly virulent Francisella sp. [223]. Furthermore, clini- cal isolates of Pseudomonas aeruginosa lack CRISPR resistance despite crRNA expres- sion, and several virulent clinical isolates of pathogenic Vibrio parahaemolyticus [224], Shigella [225], pathogenic Clostridium jejuni [226] and Mycoplasma gallisepticum [227] seem to lack CRISPR resistance. While these studies have suggested a causal role played by CRISPR inactivity in the gain of virulence of clinical isolates, we propose an alter- native mechanism: reduced growth rate in virulent strains induces selection for reduced CRISPR activity. Under the assumptions of our model we can make approximate quantitative state- ments about the kinetic parameters underlying CRISPR function. In the absence of SND, our results suggest that CRISPRs can be maintained in a prokaryotic population only under high repression in free cells and/or high deletion rates (> 102 times the spacer ac- quisition rate in the absence of complete repression, as obtained in Fig. 6.5). But while high repression is possible through crosstalk with specialized pathways that detect phage invasion or foreign DNA element, as is often the case with toxin/anti-toxin or abortive infection systems [208,215,216,216–218,228], how can such high deletion to acquisition ratio be achieved? One possibility is a spacer deletion mechanism [180–182,229,230] but 120 we still lack sufficient biochemical characterization of this process. Our model assumed that the spacer deletion system is coupled with the rest of the CRISPR machinery, because it is likely that such a system must be expressed from the same operon as the rest of the CRISPR genes. We tested two hypothetical deletion systems that relax the requirement for high spacer deletion rates (Fig 6.8). The first is constitutively expressed regardless of the cell state. The second is regulated in a direction opposite to that of the rest of the CRISPR machinery – it is repressed when infected, and fully activated when uninfected. The reason these strategies work is because of the fundamental reduction they produce in the steady state expressions of the self-targeting spacers. Notice however that neither of these alterations guarantee CRISPR maintenance for arbitrarily large host protospacer levels. They still must respect the required constraint of reduced CRISPR activity in free cells. A thorough biochemical characterization of the spacer deletion mechanism is re- quired for advancing our understanding of CRISPRs. Stern et al. [187], in their large scale survey of CRISPR cassettes in microbial genomes, remarked that deactivated self- targeting spacers are found throughout the CRISPR array. This is in contrast to experi- mental conclusions that, in most systems, more recent acquisitions appear in the leader proximal end [181, 182, 229, 230]. In fact, Stern et al. found that self-targeting spacers with no signs of deactivation were limited to the leader proximal end, indicating that their acquisition followed immediate lethality. It is therefore tempting to suggest that the spacer deletion machinery was likely impaired, resulting in continued acquisitions alongside ad- vantageous coevolving phage targeting spacers; and the continued selection pressure to evade self-targeting activity but retain phage targeting activity persisted and selected for 121 loss-of-function mutations in the self-targeting spacers. While this manuscript was in re- view, Levy et al., demonstrated that artificially induced CRISPR systems in laboratory populations of E.coli tend to exploit degradation products from the enzyme RecBCD, which processes double strand breaks resulting from replicating DNA and through the processing of exposed linear phage genomes after infection [231]. Because this bias re- duces the effective number of self-targeting spacer acquisitions, this can be seen as a po- tential self- vs. non-self detection mechanism resulting in a relaxed constraint on CRISPR regulation. It is however crucial that the spacer deletion system is still in check so as to avoid the loss of effective antiviral spacers, thereby encouraging CRISPR maintenance in the population. The rapidly growing empirical literature on CRISPR molecular and cellular biology will surely suggest further refinements to our model. Several avenues for model improve- ment are already apparent. First, the impact of the most commonly occurring alternative resistance mechanisms (such as envelope resistance) in laboratory populations was ne- glected. Second, our activation model where all CRISPR reactions are scaled uniformly in free cells is simplistic, as differences in activation levels among the acquisition and interference genes may occur. Third, assignment of equal autoimmunity rate constants for all the genomic protospacers is a rough approximation and it is known that the genetic sequences vary in their essentiality. Fourth, the current analytic cannot describe multiple CRISPR genotypes with diverse spacer configurations, in contrast to agent-based mod- els [2, 212]. While we have presented some theory for explaining the maintenance of altruistic CRISPR hosts over ecological time scales, a clear characterization of factors determining their long-term evolutionary stability in well-mixed conditions continues to 122 be an open question (ref. Appendix B). Nevertheless, despite these simplifications, our analysis clarifies the effects of CRISPR autoimmunity in a general setting – a problem that is difficult to address experimentally, due to the lethality of self-targeting. 123 A 1 β=0.1 0.8 β=1β=10 0.6 P 0.4 0.2 0 −5 −4 −3 −2 −1 0 log10δ B 0.02 P 80 YQA 80 YPA Q YQI YPI 0.01 V 40 Y Y β=100, δ=1QS 40 PS x 10−3 1.2 80 4 0.8 2 β=0.01, δ=140 0.4 −3 0.02 x 10 0.8 8 0.01 β=100, δ=0 0.4 4 x 10−3 x 10−3 0.8 3 3 0.4 β=0.01, δ=0 1 1 2 6 10 2 6 10 2 6 10 x 10−3 0.8 1.2 8 0.8 β=0.01, δ=0.01 0.4 4 0.4 0 400 800 0 400 800 0 400 800 τ τ τ Figure 6.4: SND absence is lethal due to accumulation of self-targeting spacers. (A) A sharp threshold-like behavior is observed with steady state prokaryotic densities in the (δ ,β )space. Without a sufficient amount of CRISPR suppression in free cells, determined by δ , cells go extinct. (B) Time course trajectories of the species and spacer variables for several parameter settings. In the absence of strong regulation of auto-immunity, high host protospacer levels are extremely toxic and cause population extinction. 124 A 6 0.030.8 6 I 4 IV 0.6 4 0.02 0.4 2 2 0.01 0.2 II III 0 0 0 −4 −2 0 2 4 −4 −2 0 2 4 log10 β log10 β B 1 6 6 0.03 I 4 IV 4 0.020.5 2 2 0.01 II III 0 0 0 0 −4 −2 0 2 4 −4 −2 0 2 4 log10 β log10 β Figure 6.5: The (δ ,β ,GC) space. We plot steady state free cell densities (the first column) and the phage densities (the second column) for various values of (β ,GC) values under two cases: (A) moderately suppressed free-cell CRISPR activity, δ = 10−2 and (B) strongly suppressed free- cell CRISPR activity, δ = 10−4. GC is the dimensionless parameter indicating the ratio of spacer deletion rate to spacer acquisition rate. β is a dimensionless parameter indicating the ratio of host to phage protospacer levels. (Regime I) Very high GC values effectively reduce CRISPR content to very low levels (phage lysis rates are relatively overwhelming) offering no immune advantage to the hosts, resulting in free cell levels of GQ→VA . (Regime II) Both abortive infection and immunityV operate with the available intracellular steady state levels of active and self-targeting spacers. (Regime III) The constraint on δ is not satisfied and the hosts are extinct. (Regime IV) CRISPRs behave as full-fledged abortive infection systems exploiting only the accumulated self-targeting spacers, with phage reactive spacers eliminated due to high GC values. 125 log10 GC log10 GC A β=1 β=1000 1 1 0.8 0.8 DefaultWithout Abi 0.6 0.6 P 0.4 0.4 0.2 0.2 0 0 −6 −4 −2 0 −6 −4 −2 0 B x 10−4 x 10−5 6 4 5 3 4 V 3 2 2 1 1 0 0 −6 −4 −2 0 −6 −4 −2 0 log10δ log10δ Figure 6.6: Elimination of ABI allows for improved phage densities. Steady state densities of free cells (A) and phages (B) for various values of free-cell CRISPR activity, δ . Both coexistence and phage densities are improved without ABI. Above a critical value of δ , the system goes to extinction. 126 Figure 6.7: Qualitative behavior of regulated CRISPR modules. Depending on the activation level of CRISPR activity in free cells (δ ), the host to phage protospacer ratio (β ), and the CRISPR specific spacer deletion to acquisition rate ratio (GC), regulated CRISPR cassettes can fall in one of the four regimes: no advantage (regime I), advantageous to hosts by offering immune resistance and abortive infections (regime II and IV), or causing host extinction (regime III). Because per- spacer immune rates have been experimentally measured to be high, we do not study its influence specifically here. When CRISPR activity is completely repressed in free cells (δ = 0), regime III vanishes, and regime IV expands into its place. Notice that a low β value corresponds to efficient SND during acquisition process. 127 A CRISPR/Cas active spacer inactive Infection self-targeting coding genes Delete B 6 6 0.03 0.8 4 0.6 4 0.02 0.4 2 2 0.01 0.2 0 0 0 0 −5 0 5 −5 0 5 log10 β log10 β C 6 0.030.8 6 4 0.6 4 0.02 0.4 2 2 0.01 0.2 0 0 0 0 −5 0 5 −5 0 5 log10 β log10 β Figure 6.8: Decoupled behavior of a spacer deletion system. (A) A schematic of the decoupled model of CRISPR regulation. Arrow indicates activation, and a blunt arrow indicates repression. The dashed arrow can be active (suppression when infected) or inactive (constitutive expression). We plot the steady state free cell densities (the first column) and the corresponding phage densities (the second column) for various values of (β ,GC) values at δ = 10−4. Comparison with Fig. 6.5B illustrates that decoupled spacer deletion systems as in (B) no regulation or (C) regulation in a direction opposite to that of the rest of the CRISPR system can tolerate higher host protospacer levels without requiring extremely high GC values. Note that log10 β = 2 corresponds to 100× the corresponding phage protospacer levels, a realistic condition in the case of E.coli vs. phage λ , where the expected number of host protospacers is a hundred fold. 128 log10 GC log 10G C log G log10 C 10G C Part III Appendix 129 Chapter 7 Ecological equivalence as a modeling strategy for metagenomic count data. In chapter 4 of part I, we mentioned statistical inference of taxas (OTUs) observed in large-scale 16S metagenomic surveys is of considerable biological interest. The large number of taxa thus discovered (albeit with only a few dominating/abundant ones) and excess zeroes in the taxa count distributions, however, make it a challenge for performing statistical analyses. In this appendix, we present a strategy that aims to mitigate these issues by aggre- gating counts of carefully chosen taxa that behave similarly to latent ecological factors and environmental processes. It is well known that the relative abundances of such eco- logically equivalent/nearly-equivalent species are not necessarily influenced by changes in environmental conditions across local and regional scales, but their summed total abun- dance, however, is [232–235]. To this end, we aim to cluster the 16S metagenomic taxa into equivalence classes, and create a reduced dataset that presents these clusters of taxa as new units of analysis interest (termed "Equivalence Class Units") and their summed counts as new measurements across observations in an experiment. We suggest 130 a Bayesian nonparametric model of ecological equivalence, and establish posterior infer- ence algorithms, for inferring equivalent classes of metagenomic 16s features, which are simply clusters of OTUs. Interesting prior probability distributions also appear which al- low for both unknown number of ECUs in a dataset, and known relationships among the species to be clustered (example, a taxonomic tree). Our approach is applicable to datasets with few thousands of species, and we demon- strate these ideas with metagenomic data arising from a few simple ecosystems. While several clusters of taxa showed significant enrichment of taxonomic identities there were also many clusters that did not demonstrate this behavior suggesting cross-taxonomic equivalence in these ecosystems. Examples illustrating the coherence of the clusters in terms of reflecting known biology are indicated. We present the models and derive the inference algorithms, before presenting these preliminary results with publicly available experimental datasets. 7.1 Model We consider metagenomic features (OTUs) i = 1 . . . p, measured across different experimental conditions g = 1 . . .G, in samples s = 1 . . .Ng. Here Ng is the number of samples in group g. For now, we shall assume the number of equivalence classes (the OTU clusters we want to infer) to be fixed to K, and let k = 1 . . .K index clusters. Z is a p length vector where each entry Zi indicates which equivalence class k ∈ {1 . . .K}, feature i is a member of. Given a configuration of Z, Xgsk j will denote the metagenomic count of the jth feature in cluster k in sample s from experimental condition g. We let 131 τgs Class proportion in group g Y ψ αψgsK gK Class indicators αφ T η αηgsk p φK Zi Xgskj nk K αθ Ng G Figure 7.1: A plate model illustration of the proposed generative process underlying metagenomic counts. The entire process is specific to every group g with Ng samples and p metagenomic features (OTUs). The total number of classes is fixed at K. The total number of OTUs in an equivalence class k ∈ 1, . . .K is given by nk. Orange nodes indicate observed data, and the blue and green nodes indicate our target variables, the posterior of which is needed. The equivalence classes built are conditional on the available data, and with respect to the distinct groups a researcher has. nk = ∑i I[Zi==k], where I is the indicator variable, be the total number of features assigned to equivalence class k. The abundance count corresponding to any given equivalence class k is simply defined as the summed total count of all features assigned to the equivalence class according to the configuration Z. Specifically, Ygsk = ∑i:Zi=k Xgski denotes the count of the kth equivalence class in sample s from experimental condition g. We will use the · notation to denote vectorized quantities. So for instance, Ygs· is a K length vector desribing the net count of K different equivalence classes in sample s from experimental condition g. We will also use represents a product / element-wise product (which will be clear from the context). 132 With this notation, we now present the baseline Bayesian hierarchical model below in 7.1, and illustrate it in Fig. 7.1. Our goal is to infer Z. We leave the prior on Z unspec- ified for now. We shall first derive the likelihood of the data, conditioned on the cluster assignments and the parameters. We then consider two distinct priors in the subsequent sections, and derive the resulting posteriors in each case. Z |αθi . . .∼ p(Zi|αθ , . . .)← prior on equivalence class assignments for each OTU ψ |K,αψg· ∼ Dirichlet(αψ 1K)← prior on relative abundances of K equivalence classes Ygs·|ψg· ∼Multinomial(τs,ψgK)← total summed abundance of equivalent species. ηgsk|Z,αη ∼ Dirichlet(αη 1nk)← models drift Xgsk·|Z,ηgsk,Ygsk ∼Multinomial(Ygsk,ψgk×ηgsk)← observed data (7.1) For convenience, we describe the dimensions of the variables in the model above. Z is a p length vector, each entry i holding the equivalence class of feature i. ψg· is K length equivalence class proportions vector, where the kth entry describes the proportion of equivalence class k of 1 . . .K classes. Ygs· is a K length vector describing the count of each of the K equivalence classes in a sample s from condition g. ηgsk is an nk length drift proportions vector where each jth entry models assigns a proportion for the jth member feature of cluster k in sample s from condition g. Finally, Xgsk· is an nk length vector describing the count of all member features of cluster k in sample s from condition g. 133 7.2 Data likelihood We now derive the likelihood p(X |Z,ψ,η). First, we shall restrict ourselves to describing the conditional likelihood of the fea- ture count data in a single sample s from an experimental condition g. The key is to first observe from the last line in the generative process 7.1 that, the feature count data for any given equivalence class (cluster) k in sample s from condition g follows: Xgsk·|Z,ψ,η ∼Multinomial(τgs,ψgk×ηgsk) (7.2) So the entire feature count data for a given sample s in group g has the conditional distri- bution: Xgs··|Z,ψ,η ∼Multinomial(τgs,ψg·⊗ηgsk) (7.3) Here ⊗ is used to denote a Hadamard product as follows: each entry of the K length equivalence class proportions vector ψg·, ψgk multiplies the corresponding nk length drift proportion vector ηgsk. The result is a p length vector of feature proportions that describe the average relative abundance of the feature count data in sample gs. 134 Writing eqn. 7.3 explicitly, we observe: τ ! p ( ) p(Xgs··|Z gs X ,ψ,η) = gsk j ∏K ∏n ψ η k k=1 j=1 X ∏ gk gsk j gsk j! j=1 τ ! K ngs k ( ) = ∏Y ! ∏ ψ η Xgsk jK Y nk gsk gk gsk j∏k=1 gsk!∏ j=(1 Xgsk j! k=1 j=1τ K ngs! )∏ ψ Ygsk Ygsk! k ( )X= gsk j∏K gk ηk=1Ygsk! k=1( ) ∏ nk j=1 Xgsk j! ∏ gsk j j=1 τ ! K ngs Y k = K ∏ ψ gskgk ∏ Multinomial(XY ! gsk·|Ygsk,Z,ηgsk)∏k=1 gsk k=1 j=1 In the above derivation, Y = ∑nkgsk j=1 Xgsk j, where the entries in Xgsk j are organized based on the equivalence class membership vector Z. Now, in a straightforward fashion, we can integrate out ηgsk yielding the likeilihood distribution per sample: ∫ τ ! K ( ) nk p(X |Z,ψ,αη gs Y) = ∏ ψ gskgs·· K gk ∏ Multinomial(Xgsk·|Y ηY ! gsk,Z,ηgsk)p(ηgsk|α )dηgskηgsk ∏k=1 gsk τgs! K (k=1) ∫ j=1n∏ ψ Y k= gskK [ gk ∏ Multinomia]l(Xgsk·|Ygsk,Z,ηgsk)p(η |α η gsk )dηY ! gsk∏k=1 gsk k=1 ηgsk j=1 τ Kgs! ( )Y = gsk η ∏K ψ DM(X |Y Y ∏ gk gsk gsk,α )k=1 gsk! k=1 where DM(X ηgsk·|Ygsk,α ) represents a Dirichlet-Multinomial distribution of vector Xgsk with nk features of total count Ygsk, and concentration parameter αη 1nk . Here 1nk represents an nk length vector of 1s. The condition data likelihood, for all independent samples from all experimental 135 conditions g = 1 . . .G, is then given as: G Ng K ( ) p(X |Z Y,ψ,αη) ∝ ∏ ∏ ∏ ψ gsk ηgk DM(Xgsk|Ygsk,α ) g=1 s=1 k=1 G K N Ng g = ∏ ∏ (ψk)∑s=1 Ysk ∏DM(X ηsk|Ysk,α ) (7.4) g=1 k=1 s=1 G K ( ) N Ng ∏ ∏ ψ ∑ g = s=1 Ygsk gk ∏ DM(X ηgsk·|Ygsk,α ) g=1 k=1 s=1, Ygsk>0 The last line arises because Ygsk = 0 =⇒ Xgsk j = 0 ∀ j = 1 . . .nk, and therefore, the corresponding Dirichlet-Multinomial evaluates to 1. This is seen easily using the Gamma representation of a Dirichlet Multinomial distribution. 7.3 Posteriors for ψ and Z We shall now derive the posterior for the two classes of variables of interest. The first class of variables is the equivalence class proportions vector for each experimental group g, ψg·, whose posterior is based on a standard Dirichlet-Multinomial hierarchy. The second variable of interest is the equivalence class membership vector Z, whose posterior is based on priors derived from relational data among the metagenomic features to be clustered. Additionally, we present the Z posteriors in both cases of fixed and unknown number of clusters (K). 7.3.1 Conditional posterior for ψ While the cluster proportions vector ψ can in principle also be integrated out, our inferential interest centers also in the posterior estimate of ψ . We use the likelihood 136 prescribed by eqn. (7.4) to derive the conditional posterior of ψg· given the rest of the variables and parameters. p(ψg·|Z,X , ...) ∝ [p(X |Z,ψ, ...)p(ψ|αψ) ] G K ( ) N Ng ∝ ∏ ∏ ψ ∑ g s=1 Ygsk gk ∏ DM(Xgsk|Y η ψgsk,α ) p(ψ|α ) g=(1 k=1) s=1, Ygsk>0K Ng ∝ ∏ ψ ∑s=1 Ygskgk p(ψ|αψ) k=1( ) (7.5)K Ng Y αψ∝ ∏ ψ ∑s 1 gsk ψ k −1=gk gk k=1 K ( ) Ng ψ ∝ ∏ ψ ∑s=1 Ygsk+αk −1gk k=1 ≡ Dirichlet(ψ ψg·|(α 1nk)+Yg··) where the last line describes the parameter of the posterior Dirichlet distribution as an nk length vector of αψ added to the total count of each of the K components across all samples s from group g. Thus, we arrive at the posterior of one of the two target variables we are interested. We next derive the posterior for the equivalence class feature assignments vector Z. 7.3.2 Conditional posteriors for Z To derive the posterior for Z, we will first need to specify its prior distribution. We take two routes below. The first is a standard move in Bayesian hierarchical clustering based on a Dirichlet distribution for component membership probabilities. The second is a more natural route for modeling metagenomic microbial features, a prior based on 137 taxonomic trees. It reflects the prior belief that metagenomic features belonging to similar taxonomic categories should are part of the same equivalence class. In each of the above two cases, we first provide the posterior when the number of equivalence classes in the data K can be considered as known, fixed quantity. We then provide a non-parametric extension, which assumes a small, but unknown number of clusters in the data. This is based on the theory of infinite mixture models [236]. Case 1: Classic Dirichlet priors Known K Suppose in our baseline model (7.1), we assume the following prior distribu- tion for p(Z| . . .). φ |αφK ,K ∼ Dirichlet(αφ 1K) (7.6) Zi|φK ∼Multinomial(φK,n = 1) ∀i = 1 . . . p which leads to the conditional prior: n +αφ p(Zi = k|Z αφ k\i , ) = k\i ∑ αφ (7.7) k k + p−1 The posterior for Zi is then given using standard calculations as: [ ] p(Zi|Z\i,X ,ψ, . . .)∼ p(X |Z, . . .) n φk\i +α (7.8) where the likelihood term is given by (7.4), and Z\i denotes the membership vector of all features except the ith feature whose membership gets sampled with the above 138 posterior. Similarly, nk\i denotes the number of features assigned to clsuter k except the ith feature. We have also abused the notation above by mentioning αφk to indicate the kth component of the Dirichlet prior parameter, which is derived in eqn. 7.6 as a K length vector of a single value αφ repeated. Unknown K. A non-parametric extension We now want to generalize to arbitrary K, with the idea being that we would like to generate a small number of clusters. Consider α̃φ = [|αφ/K|]Kk=1. Notice that a Dirichlet prior with a paramer < 1 for all entries favors fewer categories. We consider the limit K → ∞. Then the conditional prior in eqn. 7.7 becomes: 1. For sampling a represented equivalence classes with at least one OTU: n p Z k\i( i = k|Z φ\i, α̃ ) = ∑ α̃φk k + p−1 2.(For creating a new equivalence class: (7.9) new [ ] ) ( )Kp Z new φ φi = k ∩ k =6 Z j ∀ j 6= i |Z\i, α̃ = 1− ∑ p(Zi = k|Z\i, α̃ ) k,k 6=knew αφ = ∑k α̃ φ k + p−1 When applying the above non-parametric prior to derive the posterior for Z, we do not need to worry about the divergence of the inner sum in the prefactor Γ(∑K φk=1 αk ) terms in the Dirichlet−Multinomial likelihood, as the inner sum (from our definition of αφ above ) is always given by α̃φ . This mathematical convenience for convergence is reflected as the "sparse number of custers" assumption above. 139 Case 2: Tree priors for equivalence class memberships Often times, a metage- nomic data analyst has additional relational information about the metagenomic features that one may wish to account for in the above clustering procedure. For instance, ge- nomically closely related OTUs share the same cluster component. These similarlities are reflected in the taxonomic relationships among the microbial features or the edit- distances among the 16S RNA sequences themselves. These measures render themselves conveniently for a tree representation. The nodes T and αθ in Fig. 7.1 precisely corre- spond to a process that accounts for such a prior. If not immediately available, such a tree can be constructed using the Cho-Liu/Edmond’s algorithm based on other relational data among the metagenomic features. We consider prior distribution generated by the model in Fig. 7.2 below. For any given undirected(/directed) tree prior, we would like a given taxon at the leaf assume a cluster membership similar to other taxons closer to it in the tree. We con- sider the following generative model, a simplified caricature of an evolutionary process: p(Zi = k|wi,Pi,φ)∼ Discrete((φi wi) ) h−1 wi|Pi ∼ Discrete [|θih ∏ |P |(1−θiu)|] ih=1 u=1 (7.10) θih ∼ Beta(a,b) ∀ h ∈Pi φih ∼ Dirichlet(αφ 1K) ∀ h = 1 . . .H For a given taxon i, and a prior tree, there exists a path Pi from it to the root of the tree. For all internal nodes that lie in the path (denoted as h ∈Pi), we associate a multinomial parameter φih, and a Bernoulli switching parameter θih. The taxon i sends a 140 Root h Th i Figure 7.2: Prior distributions based on a tree of relationships among taxa. For a given taxon i, a given tree of relationships implies a particular distance metric between the chosen node and all others. We consider a rooted tree of relationships among the taxa to be clustered. All the taxa are positioned in the leaves, and are colored brown. This is very similar to a phylogenetic tree. Blue nodes indicate distinct internal nodes in the tree, the deepness of color indicate the positition/height level in the tree. Each internal node h is associated with a specific Dirichlet-Multinomial probability distribution φh of size K for components. Every taxon i is at the leaft the tree, and there exists a path Pi from it to the root that passes through a subset of internal nodes. A given taxon i chooses its cluster membership according to the φh prescribed by an h∈Pi node that it randomly chooses to stop at (red signal), as it visits each internal node h ∈Pi from the bottom up (green signal). Such tree based priors closely reflects the commonly available tree of relationships in metagenomics, and to an extent, evolutionary divergence in real world systems. 141 variable wi up the tree which chooses to stop at h ∈Pi with probability θih (or jump one level up with probability 1−θih). Stopping at a level h (if no internal h ∈Pi was chosen, it stops at the root of the tree), a taxon chooses to derives its component membership (Zi) according to φih prescribed by node h ∈ Pi with the Geometric probability θ h−1ih ∏u=1(1− θiu). Ac(cordi)ng to the above model then, on average, at a given level h, this product is close to a h−1 aa+b ∏u=1(1− a+b). If a > b, this product −→ 0 as h grows, thus capturing our prior belief. In summary, taxa decide on an internal node to sample their component assignments from, and conditioned on these choices, the entire vector of component assignments Z is generated independently across all these internal nodes, as prescribed by the Multinomial mixture at each node. Posterior Equations for Z and W We can integrate out φih, and θih yielding: H p(Z|w,P)∼∏ DM(Z(h)[|α φ 1nkh) h=1  ( ) ( ) ] h−1 |Pi|a awi|Pi ∼ Discrete | 1− |  ∀ i = 1 . . . pa+b ∏u=1 a+b h=1 Here, DM is the Dirichlet-Multinomial distribution, Z(h) is the vector of taxons that derive their membership from internal node h, nkh is the total number of taxon for which the component indicator is k, and derived with probability at level h. Notice hs in the set of internal nodes from which no taxon derives its membership from can be safely ingored as the DM simply evaluates to 1. Also notice, while in principle marginalizing over w is possible, the posterior for Z is easy to sample conditioned on w. We therefore consider 142 sampling both variables. At a given level h in path, a single Dirichlet-Multinomial determines component assignments for all taxa that sample their component assignments from that level. Thus, we can straightforwardly write the conditional prior as: p(Zi = k|wi = h,Z\i,w φ (h) (h) \i,P,α ) = p(Z φ φi = k|wi = h,Z\i ,w\i ,P,α ) ∝ nkh\i +α The posterior for w becomes: p(wi = h|Z,X) ∝ p(X |Z,wi)p(Z|wi)p(wi) ∝ (p(Z|wi))p(wi)( )[ ] a h−1 n +αφ (7.11) ∝ a(+b ∏ − a1 kh\i )[ a+b n αφu=1 kh ]+nkh−1h−1 a nkh\i +αφ∝ ∏ 1− u=1 a+b nkhαφ +nkh−1 The posterior for Z|w becomes: p(Zi = k|X ,Z\i,w\i,wi = h) ∝ p(X |Z)p[(Zi|wi,Z\i)] (7.12) ∝ p(X |Z) n φkh\i +α Notice the trade off in these posterior equations. Suppose if a given taxon is truly from component k; while the posterior is proportional to the number of features that arise from the component, as we go higher up in the tree, the number of features that arise from the same component is likely to grow, giving rise to higher posterior probablities for Zi = k. However, our prior on wi can force the taxon to stay low in the tree. 143 Incorporating phylogenetic distance If one wants to take branch lengths in a (phy- logenetic) tree into account, a simple and straightforward model would be to model the up-wise jumping probabilities (1− (a/(a+b))) terms for each level h in the the wi pos- terior equation above, as a decreasing function of the to(tal dis)tances spanned upto level h+1. This can be done by making them a solution of log 1−θihθ =−δih h,h+1, where δh,h+1 is the distance spanned from internal node h to h+1 both ∈Pi. With this expression, the posterior for w becomes: [ ] h−1 n +αφ p(wi = h|Z,X) ∝ (θih)∏ (1−θ kh\iih) φ u=1 [ nkhα +nkh]−1 (7.13) h−1 nkh\i +αφ∝ ∏ (1−θih) u=1 nkhαφ +nkh−1 – Non-parametric extension Solution is straightforward and looks similar to that de- rived in the Dirichlet case of previous section, with nk\i replaced by nkh\i. 7.4 Applications The non-parametric inference algorithm with the tree prior was applied to the mouse microbiome data [152] with roughly 1600 taxa. Relative to a standard Dirichlet-Multinomial model for the count data, the equivalence class model lead to a > 300X increase in the data likelihood. Roughly 110 equivalent clusters were found, and some were found to be differentially abundant between cases (mice fed a "western" diet) and controls (mice fed a plant based "BK" diet). While the use of a taxonomy tree prior for the metagenomic features lead to more stable enrichments of taxa among the clusters identified (Fig. 7.3), this was not always the case, suggesting that ecologically equivalent clusters need not be 144 A B Abundance Increase Enrichment of clusters for Abundance Decrease taxonomy with taxonomic tree priors. 0 1 2 0 0.5 1 1.5 2 -log10(p) -log10(p) Firmicutes Clostridia Parabacteroides Ruminococcus Alistipes RuminococcaceaeIncertaeSedis Betaproteobacteria Veillonellaceae Sutterella Enterococcus Hydrogenophaga Roseburia IncertaeSedisXIII Ruminococcus Betaproteobacteria Anaerovorax Anaerovorax Enterococcaceae Subdoligranulum ErysipelotrichaceaeIncertaeSedis Coriobacteriaceae Turicibacter Prevotella Enterococcus ErysipelotrichaceaeIncertaeSedis Coprobacillus Eggerthella RuminococcaceaeIncertaeSedis Faecalibacterium PeptostreptococcaceaeIncertaeSedis Coprobacillus LachnospiraceaeIncertaeSedis Anaerofilum Bacteroidales Mogibacterium Prevotella Bacilli Collinsella Dorea Clostridiales Enterobacter Lactococcus Lactobacillales Anaerostipes Anaerotruncus Prevotellaceae Proteobacteria Anaerotruncus Bryantella Clostridia Anaerofustis Anaerofustis PeptostreptococcaceaeIncertaeSedis Akkermansia Anaerostipes Bacteria Bacteroidetes Clostridium Roseburia Bacilli Erysipelotrichaceae Holdemania Bacteroides Ruminococcaceae Dorea Bacteroidales Mogibacterium Hydrogenophaga Enterobacter Enterobacteriaceae Subdoligranulum Alistipes Lactobacillales Eubacterium Bryantella Clostridium Faecalibacterium Enterococcaceae Coriobacteriaceae Sutterella Proteobacteria LachnospiraceaeIncertaeSedis Veillonellaceae Lachnospiraceae Lachnospiraceae Prevotellaceae Erysipelotrichaceae Collinsella Eubacterium Bilophila Anaerofilum Firmicutes Bacteroidetes Bacteria Bacteroides Turicibacter IncertaeSedisXIII Akkermansia Eggerthella Lactococcus Ruminococcaceae Holdemania Bilophila Clostridiales Enterobacteriaceae Parabacteroides Enterobacteriaceae Prevotella & Lachnospiraceae Clostridia increase Lachnospira, Bacteroides Other ECUs, mostly increased slightly, potentially reflecting the relative decrease abundance measurements. Figure 7.3: Tree priors improve taxonomy enrichments. −log10(p−value) from Fisher exact tests for taxonomic categories in each EC without (A) and with tree priors (B). taxonomically closely related. Applying the non-parametric inference algorithm with the tree prior to the Tara Oceans mirobiome data [8], about 450 equivalent clusters were found across the different oceans categories. Even though the clusters were built based on the different oceans categories alone, they recapitulated several OTU level properties. For instance, as shown in Fig. 7.4, given the general negative correlation between temperature and pressure in ocean layers, OTUs that were found to correlate positively with pressure, showed negative correlation with temperature. This behavior was retained when summarizing at the level of ECs as well. We next sought to identify ECs, with interesting community level properties. The Tara oceans dataset [8] also has sample-specific relative abundance information on the en- 145 Count 0 200 500 7 1 12 15 14 4 9 11 3 13 2 6 8 16 5 17 10 Count 0 200 500 10 8 3 7 16 11 1 14 9 4 6 15 17 13 5 12 2 A Measured Covariates B Behavior of Potential Ecotypes ● ● ● ● ●●● ●● ●● ●● ●● ●● ● ●●● ● ●●●● ● ● ●●●● ● ●● ●● ●●●●●●●●●●●●●● ● ● ●● ●●●●● ●●●● ●●● ● ● ●●● ●●● ●● ● ● ● ●●● ● ● ●● ● ●● ● ●● ●●●● ● ● ●●●●●●● ● ● ● ●●● ●● ● ● ● ●● ● ●● ●● ●●● ● ● ●● ● ●●●●●●● ●●● ●●● ● ●● ● ●● ● ●● ● ●● ●●● ● ● ● ● ● ●●●● ●●●● ● ● ● ● ●● ●●●●● ● ● ● ● ●● ● ●● ●●●● ● ● ●●● ●●● ● ● ●● ● ● ● ●●●●●●●●●● ●●●● ●● ●●● ●● ● ●● ● ●●● ●● ● ●● ● ● ●● ●● ●● ●●●● ● ● ●●● ● ● ●●●●● ●● ●●● ●● ●●● ● ●● ● ● ● ●● ● ● ● ●●●●●● ● ● ● ●●●● ● ● ● ●● ● ●● ● ●● ● ● ● ●●●●●●●● ●● ●●● ●●●●●● ● ●● ●●● ●● ●●● ●●●●● ●● ● ● ●● ● ●●●●●● ● ● ●● ●●●●●●●●● ●● ● ● ● ● ●●● ● ●●● ● ● ● ● ● ●●●● ●●● ● ● ● ●●●● ● ●●● ● ● ● ● ●●● ●● ● ● ● ●●● ● ●●●● ● ●● ● ●●●● ●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● 0 5 10 15 20 25 30 Temperature ( C ) Temperature ( C ) C Behavior of OTUs ●●●●●●●●●●●● ●●●●● ●●●● ● ●●● ●●●● ●●●●● ●● ●● ●●●●● ● ●●●●● ●●●● ● ● ●●●●●● ●●● ●● ● ●●● ●●●●●●●●●●●●● ● ● ●●●●●●● ●●●●●●●● ● ●●● ●●●●●● ●●● ●● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●● ● ● ●● ● ●●●●●●●●● ●●● ●●●● ● ●●●●●● ●●●●●●●● ●●●●●● ● ●●●● ●●●● ●●●●●●●●●●● ● ● ●● ●●●● ●●●●● ● ●●●●●●●●● ●●●●●●● ●●●●●● ● ●● ● ●● ●●●●●●●●●● ●●●●●●●●● ● ●● ●●●●●● ●●●●●●●●●●●●● ●● ● ● ● ●● ● ●●●●●●●●● ●●●● ●● ● ● ● ● ●●●● ●●●●●● ●●●●● ●●● ●● ●●●●●● ● ●●●●●●●●●● ●●●●● ● ● ●●● ● ●●●●●●●●●●●●●●●●●●●● ● ●● ●● ●●●● ●●●● ●●● ●● ● ●● ● ●●●● ●●●●●●● ●● ● ●●● ● ● ●● ●● ●● ●● ● ● ● ● ●●●●●●●●●●● ● ●●●●● ●●●●●●●● ● ● ●●●●●●●● ● ●●●●●●●●●● ●● ●●● ● ● ● ● ● ● ●●●●● ●●●●●●●●●●●●● ●●●● ●●●●●● ●● ●●● ●● ●●●● ● ● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●● ●● ● ●●●●●● ● ● ●●●●●● ●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●●●●●●●● ● ●● ●● ● ● ●●●●●●●●●●●● ●●●● ● ● ●●● ● ●●●●●● ●●●●●●●● ●● ●●●● ●●●●●●●●●●●●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●●●●●● ●●● ● ● ● ●● ● ●●●●●●●●●●●●●● ●●●● ● ● ● ● ●● ●●●●●● ●● ● ● ● ●●●●●● ●●● ● ● ● ●● ● ●●● ●●●●●●●●●● ●●● ●●●●●●●●●●●●● ● ●●● ●●● ●●● ● ●● ● ●●● ●● ●●● ●●●●●●●●●●●●●● ●● ● ● ● ●● ● ●● ●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ●● ● ●● ● ●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●● ● ● ●●● ●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●● ●● ● ●●●●●● ●●● ●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ● ●● ● ●● ● ●●● ● ● ●● ● ●●●●●● ● ●● ● ● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ● ● ● ● ●●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ●●● ● ●● ● ● ● ● ● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●● ●●● ● ●●●●●●● ●● ● ●●●● ● ● ● ●●● ●●●●●● ●●●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●● ● ● ● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●● ● ●● ●●●●● ●●●● ●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●● ● ● ●●● ●● ●● ● ● ●●●●● ●●●●●●●●●●●●●●●●● ●●●●●● ●● ● ● ● ● ●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●● ●● ●●●●●● ● ● ● ●●●●●●●●●● ● ●●●●●● ●●●●● ● ● ● ●●●●●● ●●●●●●●●●●●● ●●●●●● ●● ●● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ● ●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ● ●●●●●● ● ●●●●● ● ● ● ●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ● ● ●● ●● ●●●●●●●●●●●●●● ●●●●● ●● ●●● ●●●●●● ●● ● ● ● ● ●● ●● ●● ● ●●●●●● ●●●●●●● ● ●●●●●●●●●●● ● ●● ●●●●●● ●● ●●● ●●● ● ●●●●● ● ● ●● ●●●●●●● ●●●●●●●●●●● ● ●●●●●● ●●●●●● ●●●●●●●● ● ●● ●● ● ● ● ● ● ● ●● ● ●●● ●● ●● ●●●●●●●●●● ● ●●●● ● ● ● ● ● ● ●● ●●●● ●●●●●●●●●●● ● ● ● ● ● ● ● ●●●● ●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●● ●● ●●● ●●●●●●●● ●●● ●●●●●●●●●●●●●● ●● ●●● ● ● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●● ●● ●●● ● ● ●●● ●● ● ● ●●●●●● ● ●●●●●●●●●●●●● ●●● ●●● ● ● ● ● ● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●● ●●●●●●●●●●●● ●●●● ●●●●●●●●●●●●●●● ●● ● ●●●●● ●●●●●●● ●● ●●●●●●● ●● ●●●● ●●●●●●●●●●●●●●●●●●●●●● ● ●● ● ● ● ● ● ●● ●●●●● ●●●● ●● ● ●● ● ● ●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● ●●●●●●●●●● ● ●●●●●●●● ●●● ● ●● ●●●●●●●●● ●● ●●●●●●●●●●● ●● ● ●● ●●●●● ●●● ●●● ● ● ●●● ●●●● ●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●● ● ● ●●● ●●●●●●●●● ● ●●●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●● ● ● ●●●● ●●●●●●●●●●●●●● ●●●●● ● ● ● ●●●● ● ● ● ● ● ●●●●●●●● ● ●●●●●●●●●●●●●● ● ●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ● ●●●●●●●●●●●● ●●●● ●●●●●●● ●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●● ●●●● ● ●●●●●● ● ● ●● ●● ●● ● ●●●●● ● ● ●●●● ●●● ●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●● ● ● ● ● ● ●● ●●●●●●●●●● ●●●● ● ●●● ● ● ●● ● ●●●●●●●●● ●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ● ●● ● ● ●●●● ● ● ● ●● ● ●● ● ● ●●●●●● ● ● ● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●● ● ●● ●●●●●●●●● ●●●●●● ●●●●●● ●●●●●●● ●●●●● ●●●●●●●●●● ● ● ● ●● ● ● ●●●● ●●●●●●● ● ● ● ●●●●●● ● ●● ●●●●●●● ●●●● ●●●●●●●●●● ● ●●● ●● ● ● ● ●● ●●●●●● ●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ● ●●● ● ●●●●●●●●● ● ● ● ● ● ●●●● ●●●●●●● ●● ●●● ●●● ● ● ● ● ● ● ● ● ●●● ●●● ●●●●●●●●●●●●●●●●● ●●●● ● ●● ●●●● ● ●●●●● ● ● ● ● ● ● ● ●● ● ●● ●●●●● ● ●●●●● ●● ● ● ●●●● ●●●●●●● ● ● ● ● ● ● ●● ●● ●●●●● ● ● ● ● ●● ●● ●●●●● ●●●●● ●● ● ● ● ●●● ●● ●●●●●●● ● ●●● ●●● ● ● ● ●● ● ● ● ●●●●●●●●● ●●● ● ● ●● ● ● ● ● ● ● ● ●●●●●●● ● ●● ●●● ● ● ●● ●●●●● ● ●●●●●●● ●●●● ●●● ●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●● ●● ● ●● ● ● ● ● ● ● ● ●● ●●●● ●●●● ●● ●●● ● ● ● ●● ● ●●●● ● ● ● ● ● ● ● ●●● ● ●●●●●● ●●●● ●●●●● ● ● ●● ● ●● ●● ●● ●●●●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●● ●● ●●●●● ●● ●● ● ● ●● ● ●●● ● ●● ● ●●● ● ●●● ●●●●● ● ●●●● ●●●●●●● ●●●●● ●●●●●● ●● ●● ●● ●● ● ●●●● ●●●● ●● ●● ●● ●●● ● ●●●●●●●●●●● ● ●●●●●●●●● ●●● ● ●●● ●●●● ●●● ● ●● ●●●●● ● ●●●●●● ● ●●● ● ●●●● ● ● ●●● ● ●● ●●● ●●● ●●●●●● ● ● ●● ● ●● ●●● ●●●● ● ● ● ●● ●●●●● ●●●●●●● ● ●●●●●●●● ●●●●●●●●●●● ●● ●●● ●● ● ● ●●● ●● ●●●●● ●●●●●●●●● ●●● ●●●●● ● ●●● ● ●● ● ●●●● ● ● ●●●● ●● ●●●●●●●●●●● ●● ● ●● ●●●●●● ●●●●●●●● ● ●● ● ●● ● ● ●● ● ●● ●●●● ● ● ● ● ●●●●●●●●●● ●●● ●● ● ● ●● ●●● ●●●● ● ●●●● ● ●● ●●●●● ●● ●● ●● ● ● ●●●●●● ●●●●● ●●●●●●● ●●● ●●●●●●●●● ● ● ●● ● ● ● ●●●●●●● ● ●● ●● ●●● ●●●●● ● ● ● ●●● ● ●●● ●● ●●●● ● ● ● ●●● ● ●●● ●●●●●●● ●●●●●● ● ●● ● ● ●●●●● ●●●●● ●● ●● ● ● ● ● ●●●● ●●●● ●● ● ●●● ●●● ● ● ● ● ● ●● ● ●●● ●●●●●●●●● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ●●●●● ●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ●● ●● ● ●●● ●● ●●● ●●●●●● ● ●● ● ●● ●●● ●●●● ●●●●●●●●●●●● ●●●● ● ● ● ● ● ● ● ● ●● ● ●● ●●●●●●●● ●● ●●●●●●●●●● ● ● ●● ● ● ● ●●●● ●● ● ● ● ●● ● ●●●●● ●● ●● ●●●●● ● ●●● ● ●●● ● ●● ●●● ●●● ●● ●●●● ● ●●●●● ● ● ● ● ● ●●●● ● ● ●● ●● ● ●● ●● ● ●●● ●●● ●● ●●●●● ● ●● ●●●●● ● ●●● ●●●●●●● ●●●●● ●● ●●●●●● ● ● ● ●● ● ● ● ●● ● ●●● ●●●●● ●● ● ● ● ●● −0.5 0.0 0.5 ρTemperature ( ° C ) Figure 7.4: Equivalence classes capture environmental gradients. (A) The depth and local temperature measurements for the various Tara Ocean’s samples. (B) A plot of the correlations of each EC’s abundance profile with depth and temperature. (C) Same as B, except the correlations are computed with OTU abundance profiles. We observe that ECs recapitulate the anti-correlations in temperature and pressure, similar to OTUs. 146 Depth (m) ρDepth (m) 4 6 8 10 −0.5 0.0 0.5 1.0 Depth (m) coded functional/gene content (e.g., various ion transport channels, glycolysis pathways) by the sample’s microbiome. So after ranking the ECs based on the p-values obtained from a differential abundance test [124] across oceans, we chose ECs that showed strong correlations to at least one of the functional categories. For each such chosen EC, we built a hierarchical clustering tree of the constituent OTUs, split the EC to finer clusters depend- ing on the tree topology manually, and visualized the count profiles for each of these ECs across various categories samples. In several cases, we not only found a coherent set of OTUs that behave equivalently in their count profiles, but they also had sound potential for generating testable hypotheses. For instance, as illustrated in Fig. 7.5, cluster EC163 was depressed in its relative abundance in the samples from (deep ocean) mesopellagic layer, but had comparable relative abundances in the samples from surface (SRF) and deep chlorophyll maximum (DCM) oceanic layers. The EC was highly correlated with the relative abundances of the Iron (III) transport systems, which also exhibited higher abundances in DCM and SRF layers compared to the samples from MES layer across oceans. The EC was a multi-phyla cluster with member taxa from 9 different Phyla, all consisting of members with known roles in Iron metabolism and chelation with the aid of iron transport channels. In particular, several Proteobacteria is known to convert Iron(II) to Iron (III), which are further metabolized by the other members of phyla associated with EC163. These associations and correlations lead us to hypothesize that the decrease in Proteobacteria (for e.g., in MES) leads to reduced levels of circulating Iron(III); this is limiting for the growth of strains with a higher preference for Iron(III). The MES relative abundance of the EC163’s Deferribacteres and Proteobacteria member taxa could explain the observed reduction in the iron (III) transport channels’ relative abundances. 147 A B C Iron(III) transport system EC163 EC163: sub−groups DCM SRF MES (DCM) (MES) (SRF) (DCM) (MES) (SRF) 0 20 60 100 140 Sample Index D E F Proteobacteria Proteobacteria Proteobacteria Firmicutes Firmicutes Firmicutes Euryarchaeota Euryarchaeota Euryarchaeota Deferribacteres Deferribacteres Deferribacteres Cyanobacteria Cyanobacteria Cyanobacteria Chloroflexi Chloroflexi Chloroflexi Bacteroidetes Bacteroidetes Bacteroidetes Actinobacteria Actinobacteria Actinobacteria Acidobacteria Acidobacteria Acidobacteria Number of Taxa in (MES) in (SRF) and (DCM) G Fe II Actinobacteria Proteobacteria Fe III Acidobacteria ? MES Deferribacteres Figure 7.5: Equivalence classes of OTUs as better hypotheses generators. (A,B) The count profile of EC163, a cluster of 16S OTUs, was found to highly correlate with Iron(III) transport system frequencies from the whole metagenome shotgun sequencing from the Tara Oceans project [8]. (C) Observing the count profiles of the two distinct subgroups within this cluster showed the coherent average abundance changes of distinct phyla in different depth layers of the oceans surveyed in the experiment. (D,E,F) indicate the dis- tribution and relative abundances of the EC163 member taxa. (G) a potential hypothesis to explain these results. 148 Rel. Ab 13.5 14.5 15.5 16.5 0 2 4 6 8 10 12 Rel. Ab 200 400 600 800 0 10 20 30 40 50 60 70 Rel. Ab 0 200 400 600 800 0 20 40 60 80 100 As future work, we aim to integrate appropriate statistics on convergence of the Gibbs samplers, introduce additional structure in the model to constrain clustering choices, and incorporate compositional correction factors to offer some prection for technical vari- ation. While the inferences and hypotheses generated above appear meaningful, given the discussions in part I of this thesis, the fact that such inferences were based only on relative abundance information does not offer much confidence in them, and additional evidence must be sought. With rapidly growing dataset sizes, compute time will limit the applica- tion of the proposed approach, and faster algorithms based on other approximate inference methods (like variational inference and stochastic gradient descent) will be preferable. 149 Chapter 8 Evolutionary invasion analysis of altruistic post-infection suicidal genotypes in a well-mixed epidemiological model. In this chapter, we sketch the conditions for the evolution of altruistic, post-infection suicidal mechanisms in a simplified well-mixed epidemiological model, using an adaptive dynamics approach. The derivation serves to illustrate the difficulty of explaining the surprising evolutionary origins, and the continued maintenance of an altruistic trait in well-mixed systems, in general. As with chapter II, our inspiration for analysis arises from the following set of observations in microbiology. Prokaryotic Toxin/Anti-Toxin (TA) and Abortive Infec- tion (Abi) defense systems work against parasite invasion by inducing dormancy/cellular suicide following infection [189, 205, 237]. To avoid excess population loss when unin- fected, such systems are regulated so that their activation is restricted to infected popu- lations [238]. Given the widespread occurrence and the high effectiveness of phage tar- geting machinery that clear a variety of phage infections without inducing host cell death (like that of restriction-modification (R-M) and CRISPR systems) [239,240] or the widely occurring envelope resistance mechanisms [241–243], the prevalance of suicidal defense 150 systems in prokaryotes is surprising [190, 237, 244]. Indeed, their complete absence in most endosymbionts is perhaps a testament to their associated fitness cost [244]. Thus the search for an explanation for their evolutionary origins and ecological maintenance in natural prokaryotic populations is quite interesting. Several investigations have previously addressed the evolution of altruistic defense systems by incorporating epidemiological feedback in the context of evolutionary game theory or agent-based simulation models and extensively concluded that spatial structure is necessary to allow for the stable evolution of altruistic hosts [245–249]. Favorable spatial constraints can limit the dispersal of infectious agents, and post-infection suicide of altruistic hosts can reduce the local densities of infectious agents. Such effects can ultimately reduce the propensity of infections among locally dense altruists, providing the hosts with a fitness advantage through selective assortment [250]. Similar results for the evolution of altruism have been proposed in social game theory [251–253]. Kin selection and inclusive fitness theory offer another route [254, 255]. Epidemiological Model We consider a simplified epidemiological model, with extensive similarities to the CRISPR model analyzed in Part II. Here, susceptible hosts (with density S) grow at a rate of b under a carrying capacity constraint of K, the intensity of which is measured by α . Susceptible hosts acquire infections from infecteds (with density I) at a rate of β . Infected hosts clear infections through background resistance mechanisms at a rate of ρ . The back- ground host mortality rate is assumed to be γ . Infected hosts additionally undergo suicide 151 at a rate of ξ , and susceptible hosts regulate this autoimmunity through a suppression factor of 0 ≤ δ ≤ 1: a value of 1 implies no difference in autoimmunity rates between the susceptible and infected host states, while a value of 0 implies complete suppression in the susceptible host state. The excess mortality induced by the parasite (virulence) is indicated by λ . Denoting κ = αK , we can write the following non-dimensional system for the resident population (in which the parameters and variables are rescaled accordingly but we preserve the same notation): Ṡ = b [1−κ (S+ I)]S+ρI−βSI− (δξ + γ)S (8.1) İ = βSI− (ρ +λ +ξ + γ) I 8.1 Evolutionary Dynamics As a defense mechanism, both regulation by suppression and abortive infection mediated resistance can be costly to a host [188, 213, 256–264]. Our goal is to perform a very simplified analysis and get a few insights on mech- anisms that could lead to the evolution of altruist defense. In the above model, let us consider the evolution of altruist defense (ξ ) at a fixed δ under the special case where there is no recovery ρ = 0 (but notice that the condition ρ = 0 can also reflect an in- stantaneous recovery/resistance where only a fraction of parasite-adsorbed susceptibles actually proceed to the infected stage). Assume that our resident host population is at a stable endemic equilibrium (S > 0, I > 0), encoding a no-suicide strategy (ξ = 0). We ask when a rare mutant encoding a strategy ξm = 0+ε,ε > 0 small can invade this no-suicide 152 resident host population. Under the assumptions of the theory of adaptive dynamics, a mutant encoding a strategy ξm will invade a resident strategy ξ when its per-generational invasion fitness φm > 0. In our model, the invasion fitness for a mutant encoding a strategy ξm is given by: φm = bm(1−κm(S+ I))−βmI− (δξm + γ) Here the m−subscripted parameters (other than ξm) indicate that they are as yet unspecified smooth functions of the mutant strategy ξm, and possibly also of the encoded resident strategy ξ . For instance, bm = b(ξm,ξ ). Usually, transmission coefficient (β ) is considered to be a parasite related phenotype but for now, we let that to be an as-yet unspecified smooth function of the encoded suicidal strategy as well. For evolution to lead away from the resident no-suicide resident defense phenotype, the fitness gradient at ξ = 0 must be positive. That is, dφmdξ |ξm=ξ=0 > 0. This, by the chain rule, means:m dbm [ ]· 1−κm(S+ I) |dξ ξm=ξ=0m − · dκb mm · (S+ I)|dξ ξm=ξ=0m (8.2) − dβm · I| dξ ξm=ξ=0m > δ ∈ [0,1] Now, the terms (1−κm(S+ I)) > 0,S+ I > 0, and I > 0 under conditions needed for stable endemic equilibrium. So whether the above inequality holds ultimately rests on the interplay of other (derivative) terms. We will consider a series of simple cases; each of these will ultimately inform potential mechanisms that one can explore for the evolution 153 A 1.00 1.00 2.5 2.0 0.95 0.99 1.5 0.90 0.98 0.97 1.0 0.85 0.96 0.5 0.80 0.95 0.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 ξ ξ Suppression Level, δ B 1.00 1.00 2.5 0.98 2.0 0.95 0.96 1.5 0.90 0.94 1.0 0.85 0.92 0.90 0.5 0.80 0.88 0.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 ξ ξ Suppression Level, δ Figure 8.1: Evolution of host abortive infection potential. Trade off curves and their shapes influence the evolutionary stability (A) and bistability (B) of the evolved suicidal defense strategy. Green and brown regions indicate regions of positive and negative se- lection respectively; black dark line indicates evolutionary stable states (ESS), dotted line indicates unstable states. of suicidal defense. Case A If both the intraspecific and the transmission coefficients ar[e fixed consta]nts, and not a function of ξm (κm ≡ κ , βm ≡ β ), then ∂bm∂ξ |m ξm=ξ=0 > δ/ 1−κ(S+ I) |ξm=ξ=0 ≥ 0 is needed for suicidal defense to evolve. This means, that the hosts must ex- perience a sufficient reduction in intrinsic growth rates (i.e., its gradient must be sufficiently positive) at ξ = 0. It is easy to see that opposite conclusions arise for the two other cases below. Case B When bm≡ b,βm≡ β but κ still a function of ξm and possibly also of ξ , ∂κm∂ξ |m ξm=ξ=0 needs to be sufficiently negative for the inequality to hold. That is, the resident hosts must suffer increased intraspecific competition at ξ = 0. 154 b(ξ) b(ξ) β(ξ) β(ξ) Autoimmunity, ξ Autoimmunity, ξ Case C Similarly, for bm ≡ b,κm ≡ κ but when β still left to vary with ξm and possibly also of ξ , ∂βm∂ξ |ξm=ξ=0 needs to be sufficiently negative for the inequality to hold.That is,m the resident hosts must suffer increased disease transmission rates at ξ = 0. This simple catalog then reveals that there are at least three distinct mechanisms by which adaptive evolutionary dynamics can lead to potential altruistic defense against parasites in a well-mixed SIS model. Indeed, each of these mechanisms can be viewed as a punishment mechanism that enforces cooperation among defecting individuals that do not support altruistic defense. The above analysis addressed the question of whether a mutant encoding a suici- dal strategy can invade a resident host population encoding no suicidal defense strategy, under the model assumptions. We can also ask when evolutionary singular points, in particular, attractors, occur in the interior of the ξ parameter space; in this case, the in- equality in (8.2) sign above must be replaced with equality. It is then clear that under the classical const-benefit trade-off assumption of monotonically decreasing birth rates with higher resistance (in this case, suicidal defense ξ ), unless one of the other gradients is sufficiently positive in (8.2), there is no way for altruism to evolve. This is illustrated in Fig. 8.1, where dependence of intrinsic growth rate and transmission rates on the encoded suicidal strategy allows stable evolution of altruist defense. Similar results are obtained numerically in the presence of background resistance and when the parasites are allowed to coevolve their virulence strategy. Thus, for post-infection suicidal hosts to evolve in an evolutionarily stable man- ner, unless helped by other structural changes in the model, complete loss of suicidal 155 systems must be assumed to be costly. Such a helping mechanism for the evolution of altruistic traits [253,265] operates by punishment [266–271], and is mimicked by natural prokaryotic toxin/anti-toxin systems that induce post-seggregational killing, or when they take part in essential cellular pathways, or when parasites specialize to infect less suicidal hosts. 156 Chapter 9 Multi-resolution analysis with bifurcation analysis of smoothing spline models. Modern biological data are often measured along time and spatial coordinates. And relative to some baseline reference, they reveal trends at various scales. Arctic and antarc- tic sea ice cover vary by month, season and years. Relative to healthy controls, cancer associated epigenetic signals have been recorded over small chunks of DNA, yet one can also view them to be organized coherently over large blocks of the genome. The goal of this study is to develop some analytics to identify such trends at various scales in a systematic fashion. We exploit smoothing spline ANOVA models to model case-control longitudinal data, and propose carrying out bifurcation analysis of the fitted spline’s roots (zero crossings) as a function of the regularization parameter λ (which determines the wiggliness of the fit) to identify qualitative changes in the spline topology. We illustrate the potential of the proposed approach in revealing sound inferences in the case of a few biological applications. 157 9.1 Smoothing Splines Models We now give a brief introduction to the theory of smoothing splines, which is needed for our model construction below. For more mathematical details, we direct the readers to refs. [272–274]. Suppose we want to study the association of predictors xi to outcome yi, given the observations i = 1 . . .n. The predictors xi could arise from a discrete set of treatment groups∈G= {1, . . . ,K} or a continuos covariate like time or age∈R. More generally, the predictor variables could be multi-dimensional, arising from a product space of covariates. As with other chapters in this thesis, we shall use the · notation to indicate vector- ized quantities. Given the data (x·,y·), the smoothing spline technology aims to fit general functions f (xi), xi ∈ χ, f ∈H to describe the mean outcome, by solving a penalized op- timization problem: argminL( f ;y·,x·)+λJ( f ) (9.1) f∈H Here, χ is the input domain, and H is a reproducing Kernel Hilbert space (RKHS) of functions defined on χ . L(·) is a loss function, and arises usually from the likelihood model that one prescribes for the data. For instance, if the data generative process is a Gaussian process, one recovers the root mean square deviations of the predicted from the observed values, as L(·). λ ∈ R is a penalty parameter that acts on the "roughness" (wiggliness) measure of the function J( f ). Thus, roughly speaking, the goal of the above optimization problem is to find a reasonably smooth function whose predictions deviate minimally from the observations, while at the same time satisfying some constraint on its 158 roughness. The key advantage of restricting ourselves to fitting functions from an RKHS H 2 is that any function f ∈ H can be decomposed linearly in terms of their orthogonal projections based on the given data points i= 1 . . .n (like a standard finite K− dimensional RK vector space) as: n n f (x) = ∑ αiR(x,xi)+ν(x),with ν ∈H {q : q = ∑ αiR(·,xi)} i=1 i=1 n (9.2) = ︸∑ ciR︷A︷(x,xi︸)+ ︸ζ︷(︷x︸) , with H = HA⊕HBi=1 ∈HB⊂H ∈HA⊂H Here α·,ξ· are scalars. R(·, ·) is a bivariate, symmetric non-negative definite func- tion called the reproducing kernel (RK) and is unique for every RKHS. The first line in the above equation decomposes the function into two orthogonal pieces, the first term based on the space’s RK, and a residual term ν . In the second line, the overall space H is decomposed in to two orthogonal closed subspaces HA and HB, which are themselves RKHSs, and therefore have their own RKs; by RA(·, ·), we mean the RK associated with the RKHS subspace HA. The term orthogonal is used in the standard sense: the space’s associated inner product ( f ,g)H = 0 ∀ f ∈HA, g ∈HB. So because all our decomposi- tions above are orthogonal, it should be clear (R(·,xi),ν)H = 0, and (RA(·,xi),ζ )H = 0. Notice we could write ζ in terms of the RK of HB as well; for our purposes, that is immaterial. In fact, one can go further and exploit this decompositional convenience to build 2a special type of Hilbert space, which is defined as a complete vector space endowed with an inner product. 159 a classical ANOVA like procedure as follows. Consider decomposing the subspace HB further: n f (x) = ∑ αiR(x,xi) i=1 n = ︸∑ ciR︷A︷(x,xi︸)+ ζ︸︷(︷x︸) , with H = HA⊕HBi=1 ∈HB⊂H ∈HA⊂H n m−1 m−1 = ∑ ciRA(x,xi)+ ︸∑ d︷j︷φ j(xi=1 j=1 ︸)+ ρ︸︷(︷x︸) , with ρ ∈HB {q : q = ∑ d jφ j(·)}j=1∈HB1⊂H ∈HB0⊂H (9.3) where φ j, j = 1 . . .m− 1 are some set of basis functions chosen such that they are orthogonal to the space spanned by RA(·,xi) ∀i = 1 . . .n. For instance, this basis can be chosen to include constant and linear order terms (like a general linear model). The higher order terms are then left intact in subspaces HA and HB1. We shall come back to this point in the next subsection. In summary, we have decompsed our function space of interest, the RKHS H , into three distinct orthogonal subspaces: H = HA +HB0 +HB1. 9.2 Two specific instances of the problem We now briefly illustrate two example smoothing spline models. These will serve to simplify our discussions for case-control longitudinal data next. Each such instance of the general problem considered in eqn. 9.1 involves (a) defining the loss function, (b) the space of functions we need to search over as an RKHS, and their corresponding RKs, and (c) the roughness penalty we would like to impose. Usually, the roughness penalty 160 measures the (inner product) induced squared norm of the function’s projection onto the penalized H1 subspace, where higher order terms live. 9.2.1 Ridge regression The standard one-way ANOVA can be cast as an instance of the penalized opti- mization problem mentioned in eqn. 9.1, by fitting discrete functions f : χ → R, where χ = {1, . . .K} represents the experimental groups. It turns out the function space H de- fined this way is an RKHS with an RK R(x,y) = I[x==y]. Let 1 be a K dimensional vector of 1s. We can decompose H 3 f into two orthogonal subspaces with individual RKs (that add to give the original space’s RK as): ( ) 11T 11T I[x==y] = ︸︷︷︸+︸ I−K ︷︷K ︸ (9.4) R0 R1 . The RKHS subspaces H0 and H1 of H spanned by each of these RKs can be rea- soned about by studying their respective linear combinations {∑i αiR j(·,xi),αi ∈ R,xi ∈ χ} for j ∈ {0,1} corresponding to each of the RKs R0, and R1. This corresponds to spaces H0 = { f : f (1) = f (2) = · · ·= f (K)}, and H1 = { f : ( f ,g) = 0, f ∈H0,g ∈H1}, where the inner product ( f ,g) = f TH g. Then, when one considers the following instance of the optimization problem 9.1: 1 n argmin 2 T f∈H ︸ ∑(Yi−n i=1 ︷︷ f (xi))︸+ ︸λ ︷f︷ ︸f (9.5)=λJ( f ) =L( f ;x·,y·) 161 we are effectively solving the ridge regression problem. 9.2.2 Cubic smoothing splines Consider fitting functions f : χ → R, where χ = [0,1] to a continuos covariate x∈ χ . Let us restrict our attention to fitting functions whose second derivatives are square integrable: i.e., for f ∈ {g : g(2) ∈L2[0,1]}. It turns out one can decompose this RKHS for f as H = H0⊕H1, where: { } H ={g : g(2)0 = 0 ,and, ∫ ( ) } (9.6)2 H1 = g : g(0) = g(1) = 0, g(2) dx < ∞ with corresponding RKs: R0(x,y) = 1+ k1(x)k1(y) R1(x,y) = k2(x)k2(y)− k4(|x− y|), where : (9.7) k1(x) = x−(0.5 ) 1 1 k2(x) = k2(1(x)−2 12 ) 1 k2(x) 7 k4(x) = k41(x)− 1 +24 2 240 are the Bernoulli polynomials. The point here is to note that the zero and first order terms (in x) are attached to H0 and the non-linear higher order terms are restricted to H1. So one choice of basis 162 functions for H0 are given by φ0(x) = 1,φ1(x) = x− .5. With such a setup, one can write f (x) as: 1 n f (x) = ︸∑ d jφ j+∑ R1(x,xi)+ρ(x)j=0︷︷ ︸ i︸=1 ︷︷ ︸ (9.8) ∈H0 ∈H1 where as before ρ ∈H1 g : ∑ni=1 αiR1(·,xi). Due to orthogonality (R1(·,xi),ρ) = (φ j,R1(·,xi)) = (φ j,ρ) = 0. One then considers the penalized problem for obtaining cubic smoothing splines f : 1 n ∫ 1( )2 argmin︸ ∑(Yi︷−︷ f (x )) 2+λ f (2)i dx f∈H n i=1 ︸ ︸ 0 ︷︷ ︸ (9.9) =L( f ;x ,y ) =λJ( f )· · where the penalty mesures the roughness/curvature of the function. 9.2.3 Deriving the solution of the cubic smoothing spline problem Substituting the decomposition of f derived abo∫ve, and noting the orthogonalities of ρ , R1(·,xi), and φ j, together with the identity that 1 (2) (2) 0 R1 (x,xi)R1 (xi,x) = R1(xi,x j) we find that our specific cubic spline optimization problem in 9.9 is given in matrix terms as: argmin(Y −Sd−Qc)T (Y −Sd−Qc)+nλcT Qc+nλ (ρ,ρ) (9.10) c,d . Here Y is the n length response vector, Q is an n×n matrix with Q(i, j) = R1(xi,x j), 163 S is the n× 2 matrix with row-wise entries (φ1(xi),φ2(xi)). It should be clear that ρ appears only through its square norm term, which being independent of the parameters, merely serves to introduce a non-negative shift in the objective’s location away from zero; clearly then, the objective is minimized when ρ = 0. One then exploits linear algebraic techniques to solve for c and d to estimate our cubic smoothing smoothing splines. As one would expect, these estimates are functions of the roughness penalty λ . We thus arrive at the celebrated Kimmeldorf-Wahba result [272] that polynomial smoothing splines reside in a closed, finite dimensional space H ⊕{g : g = ∑n0 i=1 αiR1(·,xi),αi ∈ R}. 9.3 Proposed strategy for multi-resolution analysis of case-control lon- gitudinal data As we noted in the general formulation of the smoothing splines models in eqn. 9.1, by increasing the penalty parameter λ , one trades off data fit for model simplicity. For instance, in the case of cubic splines models, λ → ∞ chooses a linear fit with no wiggli- ness, while λ → 0 retrieves a cubic spline within the function space that interpolates all the data. We make three key observations that lead us to the proposed algorithm. First, vary- ing λ leads us to fitting models with varied complexities. This means, in terms of a difference (contrast) function that one might build for longitudinal data (e.g., time series measurements of a bacterium’s abundance in cases and controls, or DNA coordinate-wise epigenetic measurements from cancer cells relative to healthy controls), lower values of λ would reveal finer blocks of changes, while higher values of λ will restrict our attention to 164 larger organizational blocks of changes in DNA. Second, given the continuous nature of functions we fit our data with, the only way by which a bigger block of change can arise at a higher value of λ , is by loss of one or more roots from the splines fitted at smaller λ values (that is via a reduction in the number of points where the difference function attains a value of zero). Thus by cataloging the number of roots obtained as we vary λ (and the function gradients at these points, discussed later), one obtains a quantitative picture of the special points in the λ space, where qualitative changes in the topology of the fitted splines occur. Finally, as assumptions behind the Implicit Function Theorem apply to spline solutions almost everywhere in the domain, one can go further, exploit numerical continuation theory to obtain the location of the roots smoothly as we vary λ . 9.4 Model construction for longitudinal case-control data In the two simplified problem instances described above, we constructed functions for categorical and continuous data separately. Our goal now is to exploit smoothing spline technology to construct a contrast function that describes a continuous change in outcome along one continuous coordinate t ∈ τ (e.g., DNA coordinate or time) between two discrete experimental conditions x ∈ χ = 1,2 (e.g., controls and cases). It turns out one can construct RKHS model spaces for multi-variate functions as easily as constructing them for univariate functions. The main result we exploit here is that given two marginal RKHSs and their decompositions: χ χH χ = H0 +H1 and 165 H τ = H τ0 +H τ 1 , a (tensor) product RKHS space can be constructed as : χ χ H χ⊗τ = {H0 ⊕H1 }⊗{H τ 0 ⊕H τ1 } = {︸ χH0 ︷⊗︷H τ0 }︸⊕{︸ χH ︷⊗︷H τ χ0 1 }︸⊕{︸H1 ⊗H τ0 }︷⊕︷{ χH1 ⊗H τ1 }︸ (9.11) χ⊗τ ︸ χ⊗τ ︷︷ χ⊗τ=H0 =H10 =H11 ︸ χ⊗τ =H1 where we have hierarchically decomposed our new product input space into two orthogonal RKHS [272–274]. The only remaining step needed to utilize the algorithms outlined in the previous section is to identify the RKs associated with these spaces. Con- veniently, it turns out that RKs assigned to the marginal spaces also add and multiply accordingly! Rχ⊗τ = {Rχ0 +R χ 1 }×{R τ 0 +R τ 1} ︸{Rχ= 0 ︷×︷Rτ χ τ χ τ χ0}︸+︸{R0 ︷×︷R1}︸+︸{R1 ×R0}︷+︷{R ×Rτ1 1}︸ (9.12) χ⊗τ χ⊗τ χ⊗τ =R0 ︸ =R10 ︷︷ =R11 ︸ Rχ⊗τ1 From the hierarchical decomposition of RKs and the marginal basis functions, it is clear that Hχ×τ0 is spanned by the basis {φ1(x, t) = 1,φ2(x, t) = t − 0.5} and models the grand mean and linear main effect of τ; Hχ×τ10 describes the smooth main effect due to τ , and the third, interesting for our purposes, space Hχ×τ11 models the smooth-linear interaction and smooth-smooth interactions of x and τ . Thus, it is this subspace whose contributions to the fitted function completely specify overall treatment effects that we care for in this work. Specifically, we can write, for every function f ∈ H χ×τ , and 166 denoting zi = (xi, ti) 1 n f (z) = ∑ d jφ j(z)+∑ ciRχ⊗τ1 (z,zi)+ρ(z) j=0 i=1 1 n n χ⊗τ χ⊗τ (9.13) = ∑ φ j(z)+∑ ciR10 (z,zi)+ ︸∑ ciR1︷1︷ (z,zi︸)+ρ(z)j=0 i=1 i=1 =γ(z) where ρ ∈ χ⊗τH1 −{∑ n χ⊗τ i=1 ciR1 (·,zi)}, and we have exploited the hierarchical decomposition of the RK corresponding to χ⊗τH1 . We emphasize that γ(·) is the overall effects function whose roots we are after, as a function of λ in the rest of this work. 9.4.1 Estimation and Notation With this model space construction, for every λ , we can estimate c(λ ) from the lin- ear algebraic algorithms (Algorithm 3.4.2 [273]) available for smoothing spline models, and compute the contrast function γ (z = (case, t)) as: γ (z = (case, t);λ ) = R T11 (t)c(λ ) (9.14) Here R (t) = [|R (t , t)|]n11 11 i i=1. Henceforth, we drop the superscript χ ⊗ τ and denote by c the entire estimated vector of ci s instead of a bold typeface. Furthermore, because we restrict our analysis to x = case, we will make the estimation functions’ dependence on it implicit, and simply use γ(t,λ ) and R11(t). 167 9.5 Bifurcation analysis of γ(t,λ ) with λ as the control parameter Noting that it is the roots of the contrast function γ(t,λ ) we are after, and that the function is continuously differentiable1, Implicit Function Theorem guarantees the existence of a smooth solution to the equation γ(t,λ ) = 0 in the open neighborhood of a given root (t∗,λ ∗) whenever ∂γ∂λ (t ∗,λ∗) =6 0. In fact, this is the central theory underlying numerical bifurcation analysis in dy- namical systems theory, where qualitative behavior about steady states are mapped as a function of some control parameter. We had exploited such a technique in Part II of this thesis for the analysis of a CRISPR model. For our purposes here, we developed the equivalent numerical continuation algo- rithms and implemented them in the R software language. Fold points were detected by simultaneously asking for γ(t∗) = 0 and γ̇(t∗) = 0. 9.5.1 Confidence intervals for t̂ given λ For every λ , the confidence intervals in the fitted roots can be obtained with a linearization calculation as below. For a given value of λ , let the root along τ axis be given as t̂. Expand around the true root t0, when γ̇(t̂) 6= 0: γ(t̂) = γ(t0)+ γ̇(t̂− 1 t0)+O(|t̂− t |20 ) =⇒ t̂ ≈ t0 + γ(t̂), (9.15)γ̇(t̂) 1to be precise, the function is continuously differentiable almost everywhere, given the non- differentiable nature of the RK R11(z) at data points xi, i = 1 . . .n. But we do not worry about this com- plication in this work 168 which leaves us with the following approximate variance on the estimated root: [ ] 1 2 Var(t̂) = Var(γ(t̂)) (9.16) γ̇(t̂) where Var(γ(t̂)) is easily available as the posterior variance of the fitted function γ at t̂ form the Bayesian calculations of Wahba [272] for polynomial smoothing splines. Based on that theory, a Gaussian process prior with a mean zero and a covariance function proportional to the RK R11(·, ·) can be assumed for γ . Whenγ̇(t̂) = 0, which is the case at a fold point, a second order treatment is made in the above variance calcula√tions. A 100(1− α2 )% confidence interval can then be approximately obtained as: t̂± z α Var(t̂),2 which is an interesting overlay to the classical bifurcation analysis methodology exploited for deterministic systems. 9.6 Applications 9.6.1 Metagenomic time series To illustrate the potential of the proposal above in identifying multi-resolution changes in longitudinal data, we first considered the 16s metagenomic feature data from David et al., [275]. In this work, the authors measured microbial frequencies over time as an in- dividual travelled abroad and returned back home. We performed a bifurcation analysis on the contrast function for a few dominant genera, where samples post-travel were con- sidered as "cases" and those pre-travel were considered as "controls". The results are presented in Fig. 9.1. One of the main results in from David et al., is clearly recapitulated 169 in this plot: Bacteroides and Blautia undergo major long-term changes post- travel before they settle back to the pre-travel state. Bifurcation points along the time axis indicate at which points changes in the (relative) abundances started to occur. These changes are located at roughly similar time points for several features, indicating correlated factors underlying their observed changes. As to whether they are purely due to compositional effects discussed in Part I of the manuscript or truly owing to underlying biological rea- sons, we cannot conclude from this result alone. 9.6.2 Genome-Wide DNA Methylation Signals We next applied the technique to characterize long and short-term changes in high resolution methylation signals throughout the genome in lung cancer tissue relative to healthy controls. In contrast to the metagenomic time series datasets above, which con- sist of a few hundred to a few thousand observations, we are now faced with the problem of analyzing millions of methylation intensity values averaged and recorded throughout the genome in 150 bp nucleosome sized windows [276]. This is a major computational challenge, which we currently address using the following modifications to the more ac- curate algorithm outlined in the previous section. First, we observed that changes in methylation often spanned over several thousands of base-pairs. Second, given a value of λ , finding its roots involve solving an one-dimensional root finding problem. Although non-linear, sound computational techniques exist that are very fast for this problem [277]. Finally, we exploited the faster estimation algorithm by Kim and Gu [278] (also see sec- tion 3.5.3 in [273]), which by only using a fewer q < n set of observations to describe the H1 space (instead of all of the n observations to form ∑i c q i=1R1(,xi) ), improved the speed 170 ● ● +/− ●● ● +/− ● ● ● ●● ● + ● ● ● ● ● ● ● ● ● ● ● ● ● ● + ● ● ● ● ● ● ● ● − −● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● Bacteroides Faecalibacterium● ● −14 −12 −10 −8 −6 −4 −14 −12 −10 −8 −6 −4 log(nλ) log(nλ) +/− +/− ● + ● + ●● ● − ● ●● ● ● ● ● ● ● ● ●● − ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Bifidobacterium Blautia −14 −12 −10 −8 −6 −4 −14 −12 −10 −8 −6 −4 log(nλ) log(nλ) Figure 9.1: Long term and short-term differences in microbial time series pre- and post- travel. Plotted are the roots of the contrast functions for various microbial genera γ(x,λ ) for various values of n ∗λ ; n is the sample size, and x is rescaled time. Purple and red points indicate roots where the contrast function has positive and negative gradients with respect to the longitudinal coordinate (in this case, x) respectively. For any given value of λ then, the region between a consecutive (blue, red) pair is a region of positive difference, while a region between a consecutive (red, blue) pair of points indicate regions with a negative difference in cases relative to controls. Blue squares indicate fold points. of the algorithm several folds. In summary, if one does not care about fold points, these modifications to the original algorithm prescribed above, lead to deriving plots similar to Fig. 9.1 genome-wide in non-overlapping 1 megabase pair windows in less than 3 hours, with 3 parallel processes on a Macintosh laptop. With the more accurate algorithm, this 171 x x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 problem would have taken more than two weeks to solve, with 16 parallel processes. After obtaining the contrast function for various values of the roughness penalty λ , we computed the lengths of differential regions/segments suggesting a negative differ- ence (hypo-methylated) and positive change (hyper-methylated) relative to controls. In Fig. 9.2, bottom panel, we plot the growth in the median lengths of the hypo- and hyper- methylated regions vary (over 8x) as a function of λ . In general, for all resolutions( spec- ified here by the value of log(nλ )), we found higher median lengths for hypo- methylated regions than in hyper- methylated regions. The distributions of length values obtained at fine- (log(nλ ) =−1) and large-scale (log(nλ ) =−14) changes are shown in the top two panels. Interestingly, we also found that across all resolutions, transcription factor bind- ing sites, as measured with ChipSeq by the ENCODE consortia, were enriched in hypo- methylated regions genome-wide in lung cancer tissues. To illustrate this, we have plotted the binding site fraction in hypo-methylated regions in Fig. 9.3. These findings were con- firmed by Fisher exact tests for >85% of the transcription factors as well. Here is one possible explanation for the aforementioned results. The significantly longer hypo-methylation blocks in lung cancer and the enrichment of transcription factor binding sites in these regions could be caused by competitive binding of over-expressed transcription factors, or other DNA binding agents, preventing stable methylation estab- lishment. A simple kinetic / stochastic process model of such binding events will indicate that this is a genuine possibility. If one has access to absolute concentration measure- ments of protein molecules/mRNA expression from genes, such a hypothesis can quickly be tested as well. As described in Part I of this thesis, relying on frequency based measure- 172 log(nλ)=−1 log(nλ)=−14 hyper hypo hyper hypo                                                                   −14 −10 −8 −6 −4 −2 −14 −10 −8 −6 −4 −2 log(nλ) log(nλ) Figure 9.2: Scale specific genome-wide differences in DNA methylation in lung cancer tissue relative to controls. Plotted are the median lengths of differentially methylated regions γ(x,λ ) estimated from lung cancer data relative to healthy controls for various resolutions (as measured by log(n∗λ )); n is the sample size. ments from RNAseq (unless resolved effectively with internal spike-in control features) need not always allow stable biologically relevant conclusions. Although scale normal- ization approaches for RNAseq can lead to effective inferences when most genes do not change in their expression values, cancer tissues exist where such an assumption is heav- ily violated [113]. 173 Log2( hypo−Methyl Lengths ) Log2(Lengths of Segments) 16 17 18 19 12 14 16 18 20 22 24 Log2( hyper−Methyl Lengths ) Log2(Lengths of Segments) 14.5 15.5 16.5 17.5 10 15 20 0.8 0.6 0.4 0.2 0.0 Figure 9.3: Enrichment of transcription factor binding sites in hypo-methylated regions. For each transcription factor whose binding sites were characterized by the ENCODE consortia, we plot the fraction of binding sites found in lung cancer’s hypo-methylated regions. Enrichment of transcription factor binding sites in hypo-methylated regions was generally the case for all resolutions as measured by log(nλ ). The red dashed line indicates a value of 0.5. 174 Fraction binding sites in hypo−methylated regions, log(nλ)=−14 ZBTB33 CEBPB CTCF TAF1 GABPA USF1 SP1 EGR1 FOXA1 RUNX3 MAZ RAD21 SMC3 MAFF MAFK BHLHE40 FOSL2 JUND E2F6 MAX POLR2A PAX5 PHF8 PML YY1 SIN3AK20 E2F1 GTF2F1 ATF2 MYC KDM5A MXI1 POU2F2 KDM5B TBP IRF1 EP300 TAF7 ELK1 RFX5 TCF7L2 CHD2 FOXP2 ATF3 BRCA1 NFYA RELA NFYB GRp20 REST JUN E2F4 SRF ELF1 CREB1 ATF1 SIX5 USF2 FOS TBL1XR1 ZNF143 SP2 EBF1 CTCFL TEAD4 THAP1 ZEB1 ZNF263 PBX3 UBTF CBX3 BCLAF1 NR2C2 RBBP5 GATA1 RCOR1 FOSL1 GATA2 TAL1 GATA3 TCF12 BCL3 NFATC1 MEF2A MEF2C CCNT2 BACH1 HDAC2 TCF3 ZNF274 STAT1 BATF SPI1 HMGN3 SETDB1 ETS1 ZBTB7A EZH2 JUNB SP4 TFAP2A TFAP2C NR2F2 ESR1 SIN3A TRIM28 HNF4G RXRA GTF3C2 SUZ12 CTBP2 NR3C1 SAP30 CHD1 KAP1 NANOG STAT5A HDAC1 ELK4 NRF1 STAT3 HNF4A FOXA2 SMARCC1 SMARCB1 ESRRA STAT2 MYBL2 NFIC SREBP1 ARID3A CEBPD IRF4 BCL11A MTA3 FOXM1 ZNF217 HSF1 HDAC8 NFE2 IRF3 WRNIP1 GTF2B HDAC6 SMARCA4 ZKSCAN1 BRF2 IKZF1 POU5F1 RPC155 PPARGC1A BDP1 SIRT6 SMARCC2 MBD4 PRDM1 FAM48A RDBP ZZZ3 POLR3G BRF1 Bibliography [1] Bruce R Levin. Nasty viruses, costly plasmids, population dynamics, and the con- ditions for establishing and maintaining CRISPR-mediated adaptive immunity in bacteria. PLoS genetics, 6(10):e1001171, October 2010. [2] Lauren M Childs, Nicole L Held, Mark J Young, Rachel J Whitaker, and Joshua S Weitz. Multiscale model of CRISPR-induced coevolutionary dynamics: diversifi- cation at the interface of Lamarck and Darwin. Evolution; international journal of organic evolution, 66(7):2015–2029, July 2012. [3] Ali Mortazavi, Brian A. Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 5(7):621–628, July 2008. [4] Rafael A. Irizarry, Bridget Hobbs, Francois Collin, Yasmin D. Beazer-Barclay, Kristen J. Antonellis, Uwe Scherf, and Terence P. Speed. Exploration, normaliza- tion, and summaries of high density oligonucleotide array probe level data. Bio- statistics, 4(2):249–264, 2003. [5] Joseph N. Paulson, O. Colin Stine, Héctor Corrada Bravo, and Mihai Pop. Dif- ferential abundance analysis for microbial marker-gene surveys. Nature methods, 2013. [6] Aaron T. L. Lun, Karsten Bach, and John C. Marioni. Pooling across cells to nor- malize single-cell RNA sequencing data with many zero counts. Genome Biology, 17:75, 2016. [7] Ying Yu, James C. Fuscoe, Chen Zhao, Chao Guo, Meiwen Jia, Tao Qing, Desmond I. Bannon, Lee Lancashire, Wenjun Bao, Tingting Du, Heng Luo, Zhenqiang Su, Wendell D. Jones, Carrie L. Moland, William S. Branham, Feng Qian, Baitang Ning, Yan Li, Huixiao Hong, Lei Guo, Nan Mei, Tieliu Shi, Kevin Y. Wang, Russell D. Wolfinger, Yuri Nikolsky, Stephen J. Walker, Pene- lope Duerksen-Hughes, Christopher E. Mason, Weida Tong, Jean Thierry-Mieg, Danielle Thierry-Mieg, Leming Shi, and Charles Wang. A rat RNA-Seq transcrip- tomic BodyMap across 11 organs and 4 developmental stages. Nature Communi- cations, 5:3230, February 2014. 175 [8] Shinichi Sunagawa, Luis Pedro Coelho, Samuel Chaffron, Jens Roat Kultima, Karine Labadie, Guillem Salazar, Bardya Djahanschiri, Georg Zeller, Daniel R. Mende, Adriana Alberti, Francisco M. Cornejo-Castillo, Paul I. Costea, Corinne Cruaud, Francesco d’Ovidio, Stefan Engelen, Isabel Ferrera, Josep M. Gasol, Li- onel Guidi, Falk Hildebrand, Florian Kokoszka, Cyrille Lepoivre, Gipsi Lima- Mendez, Julie Poulain, Bonnie T. Poulos, Marta Royo-Llonch, Hugo Sarmento, Sara Vieira-Silva, Céline Dimier, Marc Picheral, Sarah Searson, Stefanie Kandels- Lewis, Tara Oceans Coordinators, Chris Bowler, Colomban de Vargas, Gabriel Gorsky, Nigel Grimsley, Pascal Hingamp, Daniele Iudicone, Olivier Jaillon, Fab- rice Not, Hiroyuki Ogata, Stephane Pesant, Sabrina Speich, Lars Stemmann, Matthew B. Sullivan, Jean Weissenbach, Patrick Wincker, Eric Karsenti, Jeroen Raes, Silvia G. Acinas, and Peer Bork. Structure and function of the global ocean microbiome. Science, 348(6237):1261359, May 2015. [9] Christos Argyropoulos, Alton Etheridge, Nikita Sakhanenko, and David Galas. Modeling bias and variation in the stochastic processes of small RNA sequencing. Nucleic Acids Research, 45(11):e104, June 2017. [10] Gregor Mendel and Paul C. Mangelsdorf. Experiments in Plant Hybridisation. Harvard University Press, 1965. Google-Books-ID: pzSoD55L1W0C. [11] Ronald Aylmer Fisher. The genetical theory of natural selection: a complete vari- orum edition. Oxford University Press, 1930. [12] John Burdon Haldane. The causes of evolution. Number 36. Princeton University Press, 1932. [13] James D. Watson and Francis HC Crick. Molecular structure of nucleic acids. Nature, 171(4356):737–738, 1953. [14] Philip Hedrick. Genetics of populations. Jones & Bartlett Learning, 2011. [15] Oswald T. Avery, Colin M. MacLeod, and Maclyn McCarty. Studies on the chemi- cal nature of the substance inducing transformation of pneumococcal types: induc- tion of transformation by a desoxyribonucleic acid fraction isolated from pneumo- coccus type III. Journal of experimental medicine, 79(2):137–158, 1944. [16] François Jacob and Jacques Monod. Genetic regulatory mechanisms in the synthe- sis of proteins. Journal of molecular biology, 3(3):318–356, 1961. [17] David Baltimore Lodish, et al Harvey. Molecular Cell Biology, 4th Edition. W H Freeman & Co, fourth edition edition edition, 2002. [18] David L. Nelson, Albert L. Lehninger, and Michael M. Cox. Lehninger principles of biochemistry. Macmillan, 2008. [19] Gregory J. Hannon. RNA interference. nature, 418(6894):244, 2002. 176 [20] Lin He and Gregory J. Hannon. MicroRNAs: small RNAs with a big role in gene regulation. Nature Reviews Genetics, 5(7):522, 2004. [21] Hans V. Westerhoff and Bernhard O. Palsson. The evolution of molecular biology into systems biology. Nature biotechnology, 22(10):1249, 2004. [22] Uri Alon. An introduction to systems biology: design principles of biological cir- cuits. Chapman and Hall/CRC, 2006. [23] Peter A. Jones and Peter W. Laird. Cancer-epigenetics comes of age. Nature genetics, 21(2):163, 1999. [24] Andrew P. Feinberg and Benjamin Tycko. The history of cancer epigenetics. Na- ture Reviews Cancer, 4(2):143, 2004. [25] Robin Holliday. Epigenetics: a historical overview. Epigenetics, 1(2):76–80, 2006. [26] Andrew P. Feinberg, Rolf Ohlsson, and Steven Henikoff. The epigenetic progenitor origin of human cancer. Nature reviews genetics, 7(1):21, 2006. [27] Adrian Bird. Perceptions of epigenetics. Nature, 447(7143):396, 2007. [28] Charles Darwin. On the origin of species. Routledge, 1859. [29] Theodosius Dobzhansky and Theodosius Grigorievich Dobzhansky. Genetics and the Origin of Species, volume 11. Columbia university press, 1982. [30] Dan Graur and Wen-Hsiung Li Li. Fundamentals of Molecular Evolution. Sinauer Associates is an imprint of Oxford University Press, Sunderland, Mass, 2 edition edition, January 2000. [31] Andrew P. Feinberg and Rafael A. Irizarry. Stochastic epigenetic variation as a driving force of development, evolutionary adaptation, and disease. Proceedings of the National Academy of Sciences, 107(suppl 1):1757–1764, 2010. [32] David Botstein and Neil Risch. Discovering genotypes underlying human pheno- types: past successes for mendelian disease, future approaches for complex dis- ease. Nature Genetics, 33(3s):228–237, March 2003. [33] Pedro D’Orléans-Juste, Jean-Claude Honoré, Emilie Carrier, and Julie Labonté. Cardiovascular diseases: new insights from knockout mice. Current Opinion in Pharmacology, 3(2):181–185, April 2003. [34] The Comprehensive Knockout Mouse Project Consortium, Christopher P. Austin, James F. Battey, Allan Bradley, Maja Bucan, Mario Capecchi, Francis S. Collins, William F. Dove, Geoffrey Duyk, Susan Dymecki, Janan T. Eppig, Franziska B. Grieder, Nathaniel Heintz, Geoff Hicks, Thomas R. Insel, Alexandra Joyner, Bev- erly H. Koller, K. C. Kent Lloyd, Terry Magnuson, Mark W. Moore, Andras Nagy, Jonathan D. Pollock, Allen D. Roses, Arthur T. Sands, Brian Seed, William C. 177 Skarnes, Jay Snoddy, Philippe Soriano, David J. Stewart, Francis Stewart, Bruce Stillman, Harold Varmus, Lyuba Varticovski, Inder M. Verma, Thomas F. Vogt, Harald von Melchner, Jan Witkowski, Richard P. Woychik, Wolfgang Wurst, George D. Yancopoulos, Stephen G. Young, and Brian Zambrowicz. The Knockout Mouse Project. Nature Genetics, 36:921–924, September 2004. [35] Florence Vignols, Claire Bréhélin, Yolande Surdin-Kerjan, Dominique Thomas, and Yves Meyer. A yeast two-hybrid knockout strain to explore thioredoxin- interacting proteins in vivo. Proceedings of the National Academy of Sciences, 102(46):16729–16734, 2005. [36] Tomoya Baba, Takeshi Ara, Miki Hasegawa, Yuki Takai, Yoshiko Okumura, Miki Baba, Kirill A. Datsenko, Masaru Tomita, Barry L. Wanner, and Hirotada Mori. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Molecular systems biology, 2(1), 2006. [37] Juraj Gregan, Peter K. Rabitsch, Cornelia Rumpf, Maria Novatchkova, Alexander Schleiffer, and Kim Nasmyth. High-throughput knockout screen in fission yeast. Nature protocols, 1(5):2457, 2006. [38] D. W. Threadgill, A. A. Dlugosz, L. A. Hansen, T. Tennenbaum, U. Lichti, D. Yee, C. LaMantia, T. Mourton, K. Herrup, and R. C. Harris. Targeted disruption of mouse EGF receptor: effect of genetic background on mutant phenotype. Science (New York, N.Y.), 269(5221):230–234, July 1995. [39] L. J. Kurihara, T. Kikuchi, K. Wada, and S. M. Tilghman. Loss of Uch-L1 and Uch-L3 leads to neurodegeneration, posterior paralysis and dysphagia. Human Molecular Genetics, 10(18):1963–1970, September 2001. [40] Marek Drab, Paul Verkade, Marlies Elger, Michael Kasper, Matthias Lohn, Bir- git Lauterbach, Jan Menne, Carsten Lindschau, Fanny Mende, Friedrich C. Luft, Andreas Schedl, Hermann Haller, and Teymuras V. Kurzchalia. Loss of Caveo- lae, Vascular Dysfunction, and Pulmonary Defects in Caveolin-1 Gene-Disrupted Mice. Science, 293(5539):2449–2452, September 2001. [41] F. Sanger, G. M. Air, B. G. Barrell, N. L. Brown, A. R. Coulson, C. A. Fiddes, C. A. Hutchison, P. M. Slocombe, and M. Smith. Nucleotide sequence of bacteriophage phi X174 DNA. Nature, 265(5596):687–695, February 1977. [42] F. Sanger. Sequences, sequences, and sequences. Annual Review of Biochemistry, 57:1–28, 1988. [43] T. Hunkapiller, R. J. Kaiser, B. F. Koop, and L. Hood. Large-scale and automated DNA sequence determination. Science (New York, N.Y.), 254(5028):59–67, Octo- ber 1991. [44] Clyde A. Hutchison. DNA sequencing: bench to bedside and beyond. Nucleic Acids Research, 35(18):6227–6237, 2007. 178 [45] Jay Shendure and Hanlee Ji. Next-generation DNA sequencing. Nature Biotech- nology, 26(10):1135–1145, October 2008. [46] Timothy D. Harris, Phillip R. Buzby, Hazen Babcock, Eric Beer, Jayson Bowers, Ido Braslavsky, Marie Causey, Jennifer Colonell, James Dimeo, J. William Efcav- itch, Eldar Giladi, Jaime Gill, John Healy, Mirna Jarosz, Dan Lapen, Keith Moul- ton, Stephen R. Quake, Kathleen Steinmann, Edward Thayer, Anastasia Tyurina, Rebecca Ward, Howard Weiss, and Zheng Xie. Single-molecule DNA sequencing of a viral genome. Science (New York, N.Y.), 320(5872):106–109, April 2008. [47] Eric S. Lander and Michael S. Waterman. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2(3):231–239, April 1988. [48] M. F. Bonaldo, G. Lennon, and M. B. Soares. Normalization and subtraction: two approaches to facilitate gene discovery. Genome Research, 6(9):791–806, Septem- ber 1996. [49] Joshua S. Bloom, Zia Khan, Leonid Kruglyak, Mona Singh, and Amy A. Caudy. Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays. BMC Genomics, 10(1):221, May 2009. [50] Alicia Oshlack and Matthew J. Wakefield. Transcript length bias in RNA-seq data confounds systems biology. Biology direct, 4(1):14, 2009. [51] Matthew D. Young, Matthew J. Wakefield, Gordon K. Smyth, and Alicia Oshlack. Gene ontology analysis for RNA-seq: accounting for selection bias. Genome biol- ogy, 11(2):R14, 2010. [52] Mark D. Robinson and Alicia Oshlack. A scaling normalization method for differ- ential expression analysis of RNA-seq data. Genome Biology, 11(3):R25, March 2010. [53] Alicia Oshlack, Mark D. Robinson, and Matthew D. Young. From RNA-seq reads to differential expression results. Genome Biology, 11(12):220, December 2010. [54] Lior Pachter. Models for transcript quantification from RNA-Seq. arXiv:1104.3889 [q-bio, stat], April 2011. arXiv: 1104.3889. [55] Davide Risso, Katja Schwartz, Gavin Sherlock, and Sandrine Dudoit. GC-Content Normalization for RNA-Seq Data. BMC Bioinformatics, 12(1):480, December 2011. [56] Simon T. Bennett, Colin Barnes, Anthony Cox, Lisa Davies, and Clive Brown. Toward the 1,000 dollars human genome. Pharmacogenomics, 6(4):373–382, June 2005. 179 [57] Marcel Margulies, Michael Egholm, William E. Altman, Said Attiya, Joel S. Bader, Lisa A. Bemben, Jan Berka, Michael S. Braverman, Yi-Ju Chen, Zhoutao Chen, Scott B. Dewell, Lei Du, Joseph M. Fierro, Xavier V. Gomes, Brian C. God- win, Wen He, Scott Helgesen, Chun Heen Ho, Chun He Ho, Gerard P. Irzyk, Szilveszter C. Jando, Maria L. I. Alenquer, Thomas P. Jarvie, Kshama B. Jirage, Jong-Bum Kim, James R. Knight, Janna R. Lanza, John H. Leamon, Steven M. Lefkowitz, Ming Lei, Jing Li, Kenton L. Lohman, Hong Lu, Vinod B. Makhijani, Keith E. McDade, Michael P. McKenna, Eugene W. Myers, Elizabeth Nickerson, John R. Nobile, Ramona Plant, Bernard P. Puc, Michael T. Ronan, George T. Roth, Gary J. Sarkis, Jan Fredrik Simons, John W. Simpson, Maithreyan Srinivasan, Kar- rie R. Tartaro, Alexander Tomasz, Kari A. Vogt, Greg A. Volkmer, Shally H. Wang, Yong Wang, Michael P. Weiner, Pengguang Yu, Richard F. Begley, and Jonathan M. Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376–380, September 2005. [58] Tarjei S. Mikkelsen, Manching Ku, David B. Jaffe, Biju Issac, Erez Lieber- man, Georgia Giannoukos, Pablo Alvarez, William Brockman, Tae-Kyung Kim, Richard P. Koche, William Lee, Eric Mendenhall, Aisling O’Donovan, Aviva Presser, Carsten Russ, Xiaohui Xie, Alexander Meissner, Marius Wernig, Rudolf Jaenisch, Chad Nusbaum, Eric S. Lander, and Bradley E. Bernstein. Genome- wide maps of chromatin state in pluripotent and lineage-committed cells. Nature, 448(7153):553–560, August 2007. [59] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, Mark Gerstein, and Michael Snyder. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (New York, N.Y.), 320(5881):1344– 1349, June 2008. [60] John C. Marioni, Christopher E. Mason, Shrikant M. Mane, Matthew Stephens, and Yoav Gilad. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 2008. [61] Zhong Wang, Mark Gerstein, and Michael Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nature Reviews Genetics, 10(1):57–63, January 2009. [62] Saiful Islam, Amit Zeisel, Simon Joost, Gioele La Manno, Pawel Zajac, Maria Kasper, Peter Lönnerberg, and Sten Linnarsson. Quantitative single-cell RNA-seq with unique molecular identifiers. Nature Methods, 11(2):163–166, February 2014. [63] Peter V. Kharchenko, Michael Y. Tolstorukov, and Peter J. Park. Design and anal- ysis of ChIP-seq experiments for DNA-binding proteins. Nature biotechnology, 26(12):1351, 2008. [64] Anton Valouev, David S. Johnson, Andreas Sundquist, Catherine Medina, Eliza- beth Anton, Serafim Batzoglou, Richard M. Myers, and Arend Sidow. Genome- wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature methods, 5(9):829, 2008. 180 [65] Peter J. Park. ChIP–seq: advantages and challenges of a maturing technology. Nature Reviews Genetics, 10(10):669–680, October 2009. [66] Axel Visel, Matthew J. Blow, Zirong Li, Tao Zhang, Jennifer A. Akiyama, Amy Holt, Ingrid Plajzer-Frick, Malak Shoukry, Crystal Wright, and Feng Chen. ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature, 457(7231):854, 2009. [67] Terrence S. Furey. ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nature Reviews Genetics, 13(12):840, 2012. [68] Alexander Meissner, Andreas Gnirke, George W. Bell, Bernard Ramsahoye, Eric S. Lander, and Rudolf Jaenisch. Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic acids research, 33(18):5868–5877, 2005. [69] Aaron L. Statham, Mark D. Robinson, Jenny Z. Song, Marcel W. Coolen, Clare Stirzaker, and Susan J. Clark. Bisulfite sequencing of chromatin immunoprecipi- tated DNA (BisChIP-seq) directly informs methylation status of histone-modified DNA. Genome research, 2012. [70] Sébastien A. Smallwood, Heather J. Lee, Christof Angermueller, Felix Krueger, Heba Saadeh, Julian Peat, Simon R. Andrews, Oliver Stegle, Wolf Reik, and Gavin Kelsey. Single-cell genome-wide bisulfite sequencing for assessing epigenetic het- erogeneity. Nature methods, 11(8):817, 2014. [71] ENCODE Project Consortium. The ENCODE (ENCyclopedia of DNA elements) project. Science, 306(5696):636–640, 2004. [72] ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447(7146):799, 2007. [73] Cancer Genome Atlas Network. Comprehensive molecular characterization of hu- man colon and rectal cancer. Nature, 487(7407):330, 2012. [74] Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61, 2012. [75] Cancer Genome Atlas Research Network. Comprehensive genomic characteriza- tion of squamous cell lung cancers. Nature, 489(7417):519, 2012. [76] Rehan Akbani, Patrick Kwok Shing Ng, Henrica MJ Werner, Maria Shahmorad- goli, Fan Zhang, Zhenlin Ju, Wenbin Liu, Ji-Yeon Yang, Kosuke Yoshihara, and Jun Li. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nature communications, 5:3887, 2014. 181 [77] John Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando Garcia, and Nancy Young. The genotype-tissue expression (GTEx) project. Nature genetics, 45(6):580, 2013. [78] Marta Melé, Pedro G. Ferreira, Ferran Reverter, David S. DeLuca, Jean Monlong, Michael Sammeth, Taylor R. Young, Jakob M. Goldmann, Dmitri D. Pervouchine, and Timothy J. Sullivan. The human transcriptome across tissues and individuals. Science, 348(6235):660–665, 2015. [79] Latarsha J. Carithers, Kristin Ardlie, Mary Barcus, Philip A. Branton, Angela Britton, Stephen A. Buia, Carolyn C. Compton, David S. DeLuca, Joanne Peter- Demchok, and Ellen T. Gelfand. A novel approach to high-quality postmortem tis- sue procurement: the GTEx project. Biopreservation and biobanking, 13(5):311– 319, 2015. [80] Greg Gibson. GTEx detects genetic effects. Science, 348(6235):640–641, 2015. [81] GTEx Consortium. The Genotype-Tissue Expression (GTEx) pilot analysis: mul- titissue gene regulation in humans. Science, 348(6235):648–660, 2015. [82] GTEx Consortium. Genetic effects on gene expression across human tissues. Na- ture, 550(7675):204, 2017. [83] Peter J. Turnbaugh, Ruth E. Ley, Micah Hamady, Claire M. Fraser-Liggett, Rob Knight, and Jeffrey I. Gordon. The human microbiome project. Nature, 449(7164):804, 2007. [84] Jumpstart Consortium Human Microbiome Project Data Generation Working Group. Evaluation of 16s rDNA-based community profiling for human micro- biome research. PloS one, 7(6):e39315, 2012. [85] Curtis Huttenhower, Dirk Gevers, Rob Knight, Sahar Abubucker, Jonathan H. Bad- ger, Asif T. Chinwalla, Heather H. Creasy, Ashlee M. Earl, Michael G. FitzGerald, and Robert S. Fulton. Structure, function and diversity of the healthy human mi- crobiome. Nature, 486(7402):207, 2012. [86] Barbara A. Methé, Karen E. Nelson, Mihai Pop, Heather H. Creasy, Michelle G. Giglio, Curtis Huttenhower, Dirk Gevers, Joseph F. Petrosino, Sahar Abubucker, and Jonathan H. Badger. A framework for human microbiome research. Nature, 486(7402):215, 2012. [87] Manimozhiyan Arumugam, Jeroen Raes, Eric Pelletier, Denis Le Paslier, Takuji Yamada, Daniel R. Mende, Gabriel R. Fernandes, Julien Tap, Thomas Bruls, and Jean-Michel Batto. Enterotypes of the human gut microbiome. nature, 473(7346):174, 2011. 182 [88] Junjie Qin, Ruiqiang Li, Jeroen Raes, Manimozhiyan Arumugam, Kristof- fer Solvsten Burgdorf, Chaysavanh Manichanh, Trine Nielsen, Nicolas Pons, Flo- rence Levenez, and Takuji Yamada. A human gut microbial gene catalogue estab- lished by metagenomic sequencing. nature, 464(7285):59, 2010. [89] S. Dusko Ehrlich and MetaHIT Consortium. MetaHIT: The European Union Project on metagenomics of the human intestinal tract. In Metagenomics of the human body, pages 307–316. Springer, 2011. [90] Michael Balter. Taking stock of the human microbiome and disease. American Association for the Advancement of Science, 2012. [91] Andrew B. Shreiner, John Y. Kao, and Vincent B. Young. The gut microbiome in health and in disease. Current opinion in gastroenterology, 31(1):69, 2015. [92] Daniel L. Hartl, Andrew G. Clark, and Andrew G. Clark. Principles of population genetics, volume 116. Sinauer associates Sunderland, 1997. [93] Michael Lynch and Bruce Walsh. Genetics and analysis of quantitative traits, volume 1. Sinauer Sunderland, MA, 1998. [94] Doris Vandeputte, Gunter Kathagen, Kevin D’hoe, Sara Vieira-Silva, Mireia Valles-Colomer, João Sabino, Jun Wang, Raul Y. Tito, Lindsey De Commer, Youssef Darzi, Séverine Vermeire, Gwen Falony, and Jeroen Raes. Quantitative microbiome profiling links gut community variation to microbial load. Nature, 551(7681):507–511, 2017. [95] Mukund Thattai and Alexander Van Oudenaarden. Intrinsic noise in gene regula- tory networks. Proceedings of the National Academy of Sciences, 98(15):8614– 8619, 2001. [96] Michael B. Elowitz, Arnold J. Levine, Eric D. Siggia, and Peter S. Swain. Stochas- tic gene expression in a single cell. Science, 297(5584):1183–1186, 2002. [97] Peter S. Swain, Michael B. Elowitz, and Eric D. Siggia. Intrinsic and extrinsic con- tributions to stochasticity in gene expression. Proceedings of the National Academy of Sciences, 99(20):12795–12800, 2002. [98] Jonathan M. Raser and Erin K. O’shea. Noise in gene expression: origins, conse- quences, and control. Science, 309(5743):2010–2013, 2005. [99] Mukund Thattai. Universal Poisson Statistics of mRNAs with Complex Decay Pathways. Biophysical Journal, 110(2):301–305, January 2016. [100] Joshua R. Stokell, Raad Z. Gharaibeh, Timothy J. Hamp, Malcolm J. Zapata, An- thony A. Fodor, and Todd R. Steck. Analysis of Changes in Diversity and Abun- dance of the Microbial Community in a Cystic Fibrosis Patient over a Multiyear Period. Journal of Clinical Microbiology, 53(1):237–247, January 2015. 183 [101] George A. O’Toole. Cystic Fibrosis Airway Microbiome: Overturning the Old, Opening the Way for the New. Journal of Bacteriology, 200(4):e00561–17, Febru- ary 2018. [102] SunHee Hong, John Bunge, Chesley Leslin, Sunok Jeon, and Slava S. Epstein. Polymerase chain reaction primers miss half of rRNA microbial diversity. The ISME Journal, 3(12):1365, 2009. [103] Jeffrey T. Leek, Robert B. Scharpf, Héctor Corrada Bravo, David Simcha, Ben- jamin Langmead, W. Evan Johnson, Donald Geman, Keith Baggerly, and Rafael A. Irizarry. Tackling the widespread and critical impact of batch effects in high- throughput data. Nature Reviews Genetics, 11(10):733, 2010. [104] Kasper D. Hansen, Steven E. Brenner, and Sandrine Dudoit. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic acids re- search, 38(12):e131–e131, 2010. [105] Yuval Benjamini and Terence P. Speed. Summarizing and correcting the GC con- tent bias in high-throughput sequencing. Nucleic acids research, 40(10):e72–e72, 2012. [106] Nicholas F. Lahens, Ibrahim Halil Kavakli, Ray Zhang, Katharina Hayer, Michael B. Black, Hannah Dueck, Angel Pizarro, Junhyong Kim, Rafael Irizarry, Russell S. Thomas, Gregory R. Grant, and John B. Hogenesch. IVT-seq reveals extreme bias in RNA sequencing. Genome Biology, 15(6):R86, June 2014. [107] J. Paul Brooks, David J. Edwards, Michael D. Harwich, Maria C. Rivera, Jen- nifer M. Fettweis, Myrna G. Serrano, Robert A. Reris, Nihar U. Sheth, Bernice Huang, Philippe Girerd, Vaginal Microbiome Consortium, Jerome F. Strauss, Kim- berly K. Jefferson, and Gregory A. Buck. The truth about metagenomics: quantify- ing and counteracting bias in 16s rRNA studies. BMC microbiology, 15:66, March 2015. [108] Paul I. Costea, Georg Zeller, Shinichi Sunagawa, Eric Pelletier, Adriana Alberti, Florence Levenez, Melanie Tramontano, Marja Driessen, Rajna Hercog, Ferris- Elias Jung, Jens Roat Kultima, Matthew R. Hayward, Luis Pedro Coelho, Emma Allen-Vercoe, Laurie Bertrand, Michael Blaut, Jillian R. M. Brown, Thomas Car- ton, Stéphanie Cools-Portier, Michelle Daigneault, Muriel Derrien, Anne Druesne, Willem M. de Vos, B. Brett Finlay, Harry J. Flint, Francisco Guarner, Masahira Hattori, Hans Heilig, Ruth Ann Luna, Johan van Hylckama Vlieg, Jana Junick, Ingeborg Klymiuk, Philippe Langella, Emmanuelle Le Chatelier, Volker Mai, Chaysavanh Manichanh, Jennifer C. Martin, Clémentine Mery, Hidetoshi Morita, Paul W. O’Toole, Céline Orvain, Kiran Raosaheb Patil, John Penders, Søren Pers- son, Nicolas Pons, Milena Popova, Anne Salonen, Delphine Saulnier, Karen P. Scott, Bhagirath Singh, Kathleen Slezak, Patrick Veiga, James Versalovic, Lip- ing Zhao, Erwin G. Zoetendal, S. Dusko Ehrlich, Joel Dore, and Peer Bork. To- wards standards for human fecal sample processing in metagenomic studies. Na- ture Biotechnology, 35(11):1069–1076, November 2017. 184 [109] Nathan D. Olson and Jayne B. Morrow. DNA extract characterization process for microbial detection methods development and validation. BMC research notes, 5:668, December 2012. [110] Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq. Bioinformatics, 25(8):1026–1032, April 2009. [111] Lichun Jiang, Felix Schlesinger, Carrie A. Davis, Yu Zhang, Renhua Li, Marc Salit, Thomas R. Gingeras, and Brian Oliver. Synthetic spike-in standards for RNA-seq experiments. Genome research, 21(9):1543–1551, 2011. [112] Philip Brennecke, Simon Anders, Jong Kyoung Kim, Aleksandra A. Kołodziejczyk, Xiuwei Zhang, Valentina Proserpio, Bianka Baying, Vladimir Benes, Sarah A. Teichmann, John C. Marioni, and Marcus G. Heisler. Account- ing for technical noise in single-cell RNA-seq experiments. Nature Methods, 10(11):1093–1095, November 2013. [113] Jakob Lovén, David A. Orlando, Alla A. Sigova, Charles Y. Lin, Peter B. Rahl, Christopher B. Burge, David L. Levens, Tong Ihn Lee, and Richard A. Young. Revisiting global gene expression analysis. Cell, 151(3):476–482, October 2012. [114] Oliver Stegle, Sarah A. Teichmann, and John C. Marioni. Computational and analytical challenges in single-cell transcriptomics. Nature Reviews Genetics, 16(3):133–145, March 2015. [115] James H. Bullard, Elizabeth Purdom, Kasper D. Hansen, and Sandrine Dudoit. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC bioinformatics, 11:94, 2010. [116] Simon Anders and Wolfgang Huber. Differential expression analysis for sequence count data. Genome Biology, 11(10):R106, 2010. [117] John Aitchison. The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), pages 139–177, 1982. [118] Jonathan Friedman and Eric J. Alm. Inferring Correlation Networks from Genomic Survey Data. PLoS Comput Biol, 8(9):e1002687, September 2012. [119] Karoline Faust and Jeroen Raes. Microbial interactions: from networks to models. Nature Reviews Microbiology, 10(8):538–550, August 2012. [120] Andrew D. Fernandes, Jennifer NS Reid, Jean M. Macklaim, Thomas A. McMur- rough, David R. Edgell, and Gregory B. Gloor. Unifying the analysis of high- throughput sequencing datasets: characterizing RNA-seq, 16s rRNA gene sequenc- ing and selective growth experiments by compositional data analysis. Microbiome, 2:15, 2014. 185 [121] Huaying Fang, Chengcheng Huang, Hongyu Zhao, and Minghua Deng. CCLasso: correlation inference for compositional data through Lasso. Bioinformatics, page btv349, June 2015. [122] David Lovell, Vera Pawlowsky-Glahn, Juan José Egozcue, Samuel Marguerat, and Jürg Bähler. Proportionality: A Valid Alternative to Correlation for Relative Data. PLOS Comput Biol, 11(3):e1004075, March 2015. [123] Kaifu Chen, Zheng Hu, Zheng Xia, Dongyu Zhao, Wei Li, and Jessica K. Tyler. The overlooked fact: fundamental need of spike-in controls for virtually all genome-wide analyses. Molecular and Cellular Biology, pages MCB.00970–14, December 2015. [124] Mark D. Robinson, Davis J. McCarthy, and Gordon K. Smyth. edgeR: a Biocon- ductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, January 2010. [125] Robert Schmieder and Robert Edwards. Fast Identification and Removal of Se- quence Contamination from Genomic and Metagenomic Datasets. PLOS ONE, 6(3):e17288, March 2011. [126] Susannah J. Salter, Michael J. Cox, Elena M. Turek, Szymon T. Calus, William O. Cookson, Miriam F. Moffatt, Paul Turner, Julian Parkhill, Nicholas J. Loman, and Alan W. Walker. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biology, 12:87, 2014. [127] Christopher L. Hemme, Qichao Tu, Zhou Shi, Yujia Qin, Weimin Gao, Ye Deng, Joy D. Van Nostrand, Liyou Wu, Zhili He, Patrick S. G. Chain, Susannah G. Tringe, Matthew W. Fields, Edward M. Rubin, James M. Tiedje, Terry C. Hazen, Adam P. Arkin, and Jizhong Zhou. Comparative metagenomics reveals impact of contami- nants on groundwater microbiomes. Frontiers in Microbiology, 6, October 2015. [128] Carl R. Woese, GEORGE E. Fox, Lawrence Zablen, Tsuneko Uchida, Linda Bo- nen, Kenneth Pechman, Bobby J. Lewis, and David Stahl. Conservation of primary structure in 16s ribosomal RNA. Nature, 254(5495):83, 1975. [129] Carl R. Woese and George E. Fox. Phylogenetic structure of the prokaryotic do- main: the primary kingdoms. Proceedings of the National Academy of Sciences, 74(11):5088–5090, 1977. [130] George E. Fox, Linda J. Magrum, William E. Balch, Ralph S. Wolfe, and Carl R. Woese. Classification of methanogenic bacteria by 16s ribosomal RNA charac- terization. Proceedings of the National Academy of Sciences, 74(10):4537–4541, 1977. [131] George E. Fox, Kenneth R. Pechman, and Carl R. Woese. Comparative cataloging of 16s ribosomal ribonucleic acid: molecular approach to procaryotic systematics. International Journal of Systematic and Evolutionary Microbiology, 27(1):44–57, 1977. 186 [132] Carl R. Woese. Bacterial evolution. Microbiological reviews, 51(2):221, 1987. [133] George E. Fox, Jeffrey D. Wisotzkey, and Peter Jurtshuk JR. How close is close: 16s rRNA sequence identity may not be sufficient to guarantee species identity. International Journal of Systematic and Evolutionary Microbiology, 42(1):166– 170, 1992. [134] Norman R. Pace, Jan Sapp, and Nigel Goldenfeld. Phylogeny and beyond: Scien- tific, historical, and conceptual significance of the first tree of life. Proceedings of the National Academy of Sciences of the United States of America, 109(4):1011– 1018, January 2012. [135] John C. Wooley, Adam Godzik, and Iddo Friedberg. A Primer on Metagenomics. PLOS Comput Biol, 6(2):e1000667, February 2010. [136] Philip Hugenholtz and Gene W. Tyson. Microbiology: metagenomics. Nature, 455(7212):481, 2008. [137] Morgan GI Langille, Jesse Zaneveld, J. Gregory Caporaso, Daniel McDonald, Dan Knights, Joshua A. Reyes, Jose C. Clemente, Deron E. Burkepile, Rebecca L. Vega Thurber, and Rob Knight. Predictive functional profiling of microbial communities using 16s rRNA marker gene sequences. Nature biotechnology, 31(9):814, 2013. [138] Susannah Green Tringe and Edward M. Rubin. Metagenomics: DNA sequencing of environmental samples. Nature Reviews Genetics, 6(11):805–814, November 2005. [139] Mihai Pop, Alan W Walker, Joseph Paulson, Brianna Lindsay, Martin Antonio, M Anowar Hossain, Joseph Oundo, Boubou Tamboura, Volker Mai, Irina Astro- vskaya, Hector Corrada Bravo, Richard Rance, Mark Stares, Myron M Levine, Sandra Panchalingam, Karen Kotloff, Usman N Ikumapayi, Chinelo Ebruke, Mitchell Adeyemi, Dilruba Ahmed, Firoz Ahmed, Meer Taifur Alam, Ruhul Amin, Sabbir Siddiqui, John B Ochieng, Emmanuel Ouma, Jane Juma, Euince Mailu, Richard Omore, J Glenn Morris, Robert F Breiman, Debasish Saha, Julian Parkhill, James P Nataro, and O Colin Stine. Diarrhea in young children from low-income countries leads to large-scale alterations in intestinal microbiota com- position. Genome Biology, 15(6):R76, 2014. [140] Zachary D. Kurtz, Christian L. Müller, Emily R. Miraldi, Dan R. Littman, Martin J. Blaser, and Richard A. Bonneau. Sparse and Compositionally Robust Inference of Microbial Ecological Networks. PLOS Comput Biol, 11(5):e1004226, May 2015. [141] Matthew C. B. Tsilimigras and Anthony A. Fodor. Compositional data analysis of the microbiome: fundamentals, tools, and challenges. Annals of Epidemiology, 26(5):330–335, May 2016. [142] The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature, 486(7402):207–214, June 2012. 187 [143] Gordon K. Smyth. Linear models and empirical bayes methods for assessing dif- ferential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology, 3:Article3, 2004. [144] Michael I. Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology, 15(12):550, 2014. [145] M. Senthil Kumar, Eric V. Slud, Kwame Okrah, Stephanie C. Hicks, Sridhar Han- nenhalli, and Héctor Corrada Bravo. Analysis and correction of compositional bias in sparse sequencing count data. BMC genomics, 19(1):799, November 2018. [146] Stilianos Louca, Laura Wegener Parfrey, and Michael Doebeli. Decoupling func- tion and taxonomy in the global ocean microbiome. Science, 353(6305):1272– 1277, September 2016. [147] David M. Karl, Lucas Beversdorf, Karin M. Björkman, Matthew J. Church, Asun- cion Martinez, and Edward F. Delong. Aerobic production of methane in the sea. Nature Geoscience, 1(7):473, July 2008. [148] Sara Borin, Lorenzo Brusetti, Francesca Mapelli, Giuseppe D’Auria, Tullio Brusa, Massimo Marzorati, Aurora Rizzi, Michail Yakimov, Danielle Marty, Gert J. De Lange, Paul Van der Wielen, Henk Bolhuis, Terry J. McGenity, Paraskevi N. Poly- menakou, Elisa Malinverno, Laura Giuliano, Cesare Corselli, and Daniele Daffon- chio. Sulfur cycling and methanogenesis primarily drive microbial colonization of the highly sulfidic Urania deep hypersaline basin. Proceedings of the National Academy of Sciences, 106(23):9151–9156, June 2009. [149] Beth N. Orcutt, Jason B. Sylvan, Nina J. Knab, and Katrina J. Edwards. Microbial Ecology of the Dark Ocean above, at, and below the Seafloor. Microbiology and Molecular Biology Reviews, 75(2):361–422, June 2011. [150] Edward F. DeLong, Christina M. Preston, Tracy Mincer, Virginia Rich, Steven J. Hallam, Niels-Ulrik Frigaard, Asuncion Martinez, Matthew B. Sullivan, Robert Edwards, Beltran Rodriguez Brito, Sallie W. Chisholm, and David M. Karl. Com- munity Genomics Among Stratified Microbial Assemblages in the Ocean’s Inte- rior. Science, 311(5760):496–503, January 2006. [151] Brandon K. Swan, Manuel Martinez-Garcia, Christina M. Preston, Alexander Sczyrba, Tanja Woyke, Dominique Lamy, Thomas Reinthaler, Nicole J. Poul- ton, E. Dashiell P. Masland, Monica Lluesma Gomez, Michael E. Sieracki, Ed- ward F. DeLong, Gerhard J. Herndl, and Ramunas Stepanauskas. Potential for Chemolithoautotrophy Among Ubiquitous Bacteria Lineages in the Dark Ocean. Science, 333(6047):1296–1300, September 2011. [152] Peter J. Turnbaugh, Vanessa K. Ridaura, Jeremiah J. Faith, Federico E. Rey, Rob Knight, and Jeffrey I. Gordon. The effect of diet on the human gut microbiome: 188 a metagenomic analysis in humanized gnotobiotic mice. Science Translational Medicine, 1(6):6ra14, November 2009. [153] Stephanie C. Hicks, Kwame Okrah, Joseph N. Paulson, John Quackenbush, Rafael A. Irizarry, and Hector Corrada Bravo. Smooth Quantile Normalization. bioRxiv, page 085175, November 2016. [154] Patricio S. La Rosa, J. Paul Brooks, Elena Deych, Edward L. Boone, David J. Ed- wards, Qin Wang, Erica Sodergren, George Weinstock, and William D. Shannon. Hypothesis Testing and Power Calculations for Taxonomic-Based Human Micro- biome Data. PLOS ONE, 7(12):e52078, December 2012. [155] Editorial. Toll Bridges. Nature Immunology, 5(10):969, October 2004. [156] Paul Rossiter, Elizabeth S. Williams, Linda Munson, and Seamus Kennedy. Mor- billiviral diseases. Infectious diseases of wild mammals, pages 37–76, 2001. [157] Yusuke Yanagi, Makoto Takeda, and Shinji Ohno. Measles virus: cellular recep- tors, tropism and pathogenesis. Journal of General Virology, 87(10):2767–2779, 2006. [158] Stephen E. Straus, Jeffrey M. Ostrove, Genevieve InchauspÉ, James M. Felser, Alison Freifeld, Kenneth D. Croen, and Mark H. Sawyer. Varicella-zoster virus infections: biology, natural history, treatment, and prevention. Annals of internal medicine, 108(2):221–237, 1988. [159] Kasper Hoebe, Edith Janssen, and Bruce Beutler. The interface between innate and adaptive immunity. Nature Immunology, 5:971–974, October 2004. [160] Akiko Iwasaki and Ruslan Medzhitov. Regulation of Adaptive Immunity by the Innate Immune System. Science, 327(5963):291–295, January 2010. [161] Ruud Jansen, Jan D. A. van Embden, Wim Gaastra, and Leo M. Schouls. Identi- fication of genes that are associated with DNA repeats in prokaryotes. Molecular Microbiology, 43(6):1565–1575, March 2002. [162] Luciano A. Marraffini and Erik J. Sontheimer. CRISPR interference: RNA- directed adaptive immunity in bacteria and archaea. Nature Reviews Genetics, 11(3):181–190, March 2010. [163] Eric S. Lander. The Heroes of CRISPR. Cell, 164(1):18–28, January 2016. [164] Y. Ishino, H. Shinagawa, K. Makino, M. Amemura, and A. Nakata. Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in Escherichia coli, and identification of the gene product. Journal of Bacteriology, 169(12):5429–5433, December 1987. [165] A. Nakata, M. Amemura, and K. Makino. Unusual nucleotide arrangement with repeated sequences in the Escherichia coli K-12 chromosome. Journal of Bacteri- ology, 171(6):3553–3556, June 1989. 189 [166] P. W. Hermans, D. van Soolingen, E. M. Bik, P. E. de Haas, J. W. Dale, and J. D. van Embden. Insertion element IS987 from Mycobacterium bovis BCG is located in a hot-spot integration region for insertion elements in Mycobacterium tuberculosis complex strains. Infection and Immunity, 59(8):2695–2705, August 1991. [167] F. J. Mojica, C. Ferrer, G. Juez, and F. Rodríguez-Valera. Long stretches of short tandem repeats are present in the largest replicons of the Archaea Haloferax mediterranei and Haloferax volcanii and could be involved in replicon partitioning. Molecular Microbiology, 17(1):85–93, July 1995. [168] B. Masepohl, K. Görlitz, and H. Böhme. Long tandemly repeated repetitive (LTRR) sequences in the filamentous cyanobacterium Anabaena sp. PCC 7120. Biochimica Et Biophysica Acta, 1307(1):26–30, June 1996. [169] H. P. Klenk, R. A. Clayton, J. F. Tomb, O. White, K. E. Nelson, K. A. Ketchum, R. J. Dodson, M. Gwinn, E. K. Hickey, J. D. Peterson, D. L. Richardson, A. R. Kerlavage, D. E. Graham, N. C. Kyrpides, R. D. Fleischmann, J. Quackenbush, N. H. Lee, G. G. Sutton, S. Gill, E. F. Kirkness, B. A. Dougherty, K. McKenney, M. D. Adams, B. Loftus, S. Peterson, C. I. Reich, L. K. McNeil, J. H. Badger, A. Glodek, L. Zhou, R. Overbeek, J. D. Gocayne, J. F. Weidman, L. McDon- ald, T. Utterback, M. D. Cotton, T. Spriggs, P. Artiach, B. P. Kaine, S. M. Sykes, P. W. Sadow, K. P. D’Andrea, C. Bowman, C. Fujii, S. A. Garland, T. M. Ma- son, G. J. Olsen, C. M. Fraser, H. O. Smith, C. R. Woese, and J. C. Venter. The complete genome sequence of the hyperthermophilic, sulphate-reducing archaeon Archaeoglobus fulgidus. Nature, 390(6658):364–370, November 1997. [170] C. J. Bult, O. White, G. J. Olsen, L. Zhou, R. D. Fleischmann, G. G. Sutton, J. A. Blake, L. M. FitzGerald, R. A. Clayton, J. D. Gocayne, A. R. Kerlavage, B. A. Dougherty, J. F. Tomb, M. D. Adams, C. I. Reich, R. Overbeek, E. F. Kirkness, K. G. Weinstock, J. M. Merrick, A. Glodek, J. L. Scott, N. S. Geoghagen, and J. C. Venter. Complete genome sequence of the methanogenic archaeon, Methanococ- cus jannaschii. Science (New York, N.Y.), 273(5278):1058–1073, August 1996. [171] Y. Kawarabayasi, Y. Hino, H. Horikawa, S. Yamazaki, Y. Haikawa, K. Jin-no, M. Takahashi, M. Sekine, S. Baba, A. Ankai, H. Kosugi, A. Hosoyama, S. Fukui, Y. Nagai, K. Nishijima, H. Nakazawa, M. Takamiya, S. Masuda, T. Funahashi, T. Tanaka, Y. Kudoh, J. Yamazaki, N. Kushida, A. Oguchi, and H. Kikuchi. Com- plete genome sequence of an aerobic hyper-thermophilic crenarchaeon, Aeropy- rum pernix K1. DNA research: an international journal for rapid publication of reports on genes and genomes, 6(2):83–101, 145–152, April 1999. [172] F. J. Mojica, C. Díez-Villaseñor, E. Soria, and G. Juez. Biological significance of a family of regularly spaced repeats in the genomes of Archaea, Bacteria and mitochondria. Molecular Microbiology, 36(1):244–246, April 2000. [173] Alexander Bolotin, Benoit Quinquis, Alexei Sorokin, and S Dusko Ehrlich. Clus- tered regularly interspaced short palindrome repeats (CRISPRs) have spacers of ex- 190 trachromosomal origin. Microbiology (Reading, England), 151(Pt 8):2551–2561, August 2005. [174] Francisco J M Mojica, César Díez-Villaseñor, Jesús García-Martínez, and Elena Soria. Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. Journal of molecular evolution, 60(2):174–182, February 2005. [175] C. Pourcel, G. Salvignol, and G. Vergnaud. CRISPR elements in Yersinia pestis acquire new repeats by preferential uptake of bacteriophage DNA, and provide additional tools for evolutionary studies. Microbiology (Reading, England), 151(Pt 3):653–663, March 2005. [176] Reidun K Lillestøl, Peter Redder, Roger A Garrett, and Kim Brügger. A putative viral defence mechanism in archaeal cells. Archaea (Vancouver, B.C.), 2(1):59–72, August 2006. [177] Kira S Makarova, Nick V Grishin, Svetlana A Shabalina, Yuri I Wolf, and Eu- gene V Koonin. A putative RNA-interference-based immune system in prokary- otes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biology direct, 1:7, 2006. [178] Rodolphe Barrangou, Christophe Fremaux, Hélène Deveau, Melissa Richards, Patrick Boyaval, Sylvain Moineau, Dennis A Romero, and Philippe Horvath. CRISPR provides acquired resistance against viruses in prokaryotes. Science (New York, N.Y.), 315(5819):1709–1712, March 2007. [179] Luciano A Marraffini and Erik J Sontheimer. CRISPR interference limits horizon- tal gene transfer in staphylococci by targeting DNA. Science (New York, N.Y.), 322(5909):1843–1845, December 2008. [180] Hélène Deveau, Rodolphe Barrangou, Josiane E Garneau, Jessica Labonté, Christophe Fremaux, Patrick Boyaval, Dennis A Romero, Philippe Horvath, and Sylvain Moineau. Phage response to CRISPR-encoded resistance in Streptococcus thermophilus. Journal of bacteriology, 190(4):1390–1400, February 2008. [181] Philippe Horvath, Dennis A. Romero, Anne-Claire Coûté-Monvoisin, Melissa Richards, Hélène Deveau, Sylvain Moineau, Patrick Boyaval, Christophe Fremaux, and Rodolphe Barrangou. Diversity, Activity, and Evolution of CRISPR Loci in Streptococcus thermophilus. Journal of Bacteriology, 190(4):1401–1412, Febru- ary 2008. [182] Gene W. Tyson and Jillian F. Banfield. Rapidly evolving CRISPRs implicated in acquired resistance of microorganisms to viruses. Environmental Microbiology, 10(1):200–207, 2008. 191 [183] F. J. M. Mojica, C. Díez-Villaseñor, J. García-Martínez, and C. Almendros. Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology, 155(3):733–740, March 2009. [184] Rodolphe Barrangou and Philippe Horvath. CRISPR: new horizons in phage re- sistance and strain identification. Annual review of food science and technology, 3:143–162, 2012. [185] Ido Yosef, Moran G Goren, and Udi Qimron. Proteins and DNA elements essential for the CRISPR adaptation process in Escherichia coli. Nucleic acids research, 40(12):5569–5576, July 2012. [186] Avital Brodt, Mor N. Lurie-Weinberger, and Uri Gophna. CRISPR loci reveal networks of gene exchange in archaea. Biology Direct, 6(1):65, December 2011. [187] Adi Stern, Leeat Keren, Omri Wurtzel, Gil Amitai, and Rotem Sorek. Self- targeting by CRISPR: gene regulation or autoimmunity? Trends in genetics: TIG, 26(8):335–340, August 2010. [188] Pedro F. Vale, Guillaume Lafforgue, Francois Gatchitch, Rozenn Gardan, Syl- vain Moineau, and Sylvain Gandon. Costs of CRISPR-Cas-mediated resistance in Streptococcus thermophilus. Proc. R. Soc. B, 282(1812):20151270, August 2015. [189] Adi Stern and Rotem Sorek. The phage-host arms race: shaping the evolution of microbes. BioEssays: news and reviews in molecular, cellular and developmental biology, 33(1):43–51, January 2011. [190] Kira S Makarova, Yuri I Wolf, and Eugene V Koonin. Comparative genomics of defense systems in archaea and bacteria. Nucleic acids research, 41(8):4360–4377, April 2013. [191] Rotem Sorek, Victor Kunin, and Philip Hugenholtz. CRISPR — a widespread system that provides acquired resistance against phages in bacteria and archaea. Nature Reviews Microbiology, 6(3):181–186, March 2008. [192] M. Senthil Kumar and Kevin C. Chen. Evolution of animal Piwi-interacting RNAs and prokaryotic CRISPRs. Briefings in Functional Genomics, 11(4):277–288, July 2012. [193] David Bikard and Luciano A. Marraffini. Control of gene expression by CRISPR- Cas systems. F1000Prime Reports, 5, November 2013. [194] Eugene V Koonin and Kira S Makarova. CRISPR-Cas: evolution of an RNA- based adaptive immunity system in prokaryotes. RNA biology, 10(5):679–686, May 2013. [195] Christine L Sun, Rodolphe Barrangou, Brian C Thomas, Philippe Horvath, Christophe Fremaux, and Jillian F Banfield. Phage mutations in response to CRISPR diversification in a bacterial population. Environmental microbiology, 15(2):463–470, February 2013. 192 [196] Timothy R. Sampson and David S. Weiss. Alternative Roles for CRISPR/Cas Sys- tems in Bacterial Pathogenesis. PLoS Pathog, 9(10):e1003621, October 2013. [197] Reuben B Vercoe, James T Chang, Ron L Dy, Corinda Taylor, Tamzin Gristwood, James S Clulow, Corinna Richter, Rita Przybilski, Andrew R Pitman, and Peter C Fineran. Cytotoxic chromosomal targeting by CRISPR/Cas systems can reshape bacterial genomes and expel or remodel pathogenicity islands. PLoS genetics, 9(4):e1003454, April 2013. [198] Rotem Edgar and Udi Qimron. The Escherichia coli CRISPR system protects from λ lysogenization, lysogens, and prophage induction. Journal of bacteriology, 192(23):6291–6294, December 2010. [199] David Paez-Espino, Wesley Morovic, Christine L Sun, Brian C Thomas, Ken-ichi Ueda, Buffy Stahl, Rodolphe Barrangou, and Jillian F Banfield. Strong bias in the bacterial CRISPR elements that confer immunity to phage. Nature communica- tions, 4:1430, 2013. [200] Wenyan Jiang, Inbal Maniv, Fawaz Arain, Yaying Wang, Bruce R Levin, and Lu- ciano A Marraffini. Dealing with the evolutionary downside of CRISPR immunity: bacteria and beneficial plasmids. PLoS genetics, 9(9):e1003844, 2013. [201] Ron L Dy, Andrew R Pitman, and Peter C Fineran. Chromosomal targeting by CRISPR-Cas systems can contribute to genome plasticity in bacteria. Mobile ge- netic elements, 3(5):e26831, September 2013. [202] Joseph Bondy-Denomy and Alan R Davidson. To acquire or resist: the complex biological effects of CRISPR-Cas systems. Trends in microbiology, February 2014. [203] Iwona Mruk and Ichizo Kobayashi. To be or not to be: regulation of restriction- modification systems and other toxin-antitoxin systems. Nucleic acids research, 42(1):70–86, January 2014. [204] Luciano A Marraffini and Erik J Sontheimer. Self versus non-self discrimination during CRISPR RNA-directed immunity. Nature, 463(7280):568–571, January 2010. [205] Kenn Gerdes, Susanne K Christensen, and Anders Løbner-Olesen. Prokaryotic toxin-antitoxin stress response loci. Nature reviews. Microbiology, 3(5):371–382, May 2005. [206] Tessa E F Quax, Marleen Voet, Odile Sismeiro, Marie-Agnes Dillies, Bernd Jagla, Jean-Yves Coppée, Guennadi Sezonov, Patrick Forterre, John van der Oost, Rob Lavigne, and David Prangishvili. Massive activation of archaeal defense genes during viral infection. Journal of virology, 87(15):8419–8428, August 2013. [207] Jacque C Young, Brian D Dill, Chongle Pan, Robert L Hettich, Jillian F Banfield, Manesh Shah, Christophe Fremaux, Philippe Horvath, Rodolphe Barrangou, and 193 Nathan C Verberkmoes. Phage-induced expression of CRISPR-associated pro- teins is revealed by shotgun proteomics in Streptococcus thermophilus. PloS one, 7(5):e38077, 2012. [208] Corinna Richter, James T Chang, and Peter C Fineran. Function and regulation of clustered regularly interspaced short palindromic repeats (CRISPR) / CRISPR associated (Cas) systems. Viruses, 4(10):2291–2311, October 2012. [209] Ksenia Pougach, Ekaterina Semenova, Ekaterina Bogdanova, Kirill A Datsenko, Marko Djordjevic, Barry L Wanner, and Konstantin Severinov. Transcription, pro- cessing and function of CRISPR cassettes in Escherichia coli. Molecular microbi- ology, 77(6):1367–1379, September 2010. [210] Pu Han, Liang Ren Niestemski, Jeffrey E Barrick, and Michael W Deem. Physical model of the immune response of bacteria against bacteriophage through the adap- tive CRISPR-Cas immune system. Physical biology, 10(2):025004, April 2013. [211] Jaime Iranzo, Alexander E Lobkovsky, Yuri I Wolf, and Eugene V Koonin. Evolu- tionary dynamics of the prokaryotic adaptive immunity system CRISPR-Cas in an explicit ecological context. Journal of bacteriology, 195(17):3834–3844, Septem- ber 2013. [212] Ariel D Weinberger, Yuri I Wolf, Alexander E Lobkovsky, Michael S Gilmore, and Eugene V Koonin. Viral diversity threshold for adaptive immunity in prokaryotes. mBio, 3(6):e00456–00412, 2012. [213] S. Gandon and P. F. Vale. The evolution of resistance against good and bad infec- tions. Journal of Evolutionary Biology, 27(2):303–312, February 2014. [214] Leah Edelstein-Keshet. Mathematical models in biology, volume 46. Siam, 1988. [215] Umit Pul, Reinhild Wurm, Zihni Arslan, René Geissen, Nina Hofmann, and Rolf Wagner. Identification and characterization of E. coli CRISPR-cas promoters and their silencing by H-NS. Molecular microbiology, 75(6):1495–1512, March 2010. [216] Edze R Westra, Umit Pul, Nadja Heidrich, Matthijs M Jore, Magnus Lund- gren, Thomas Stratmann, Reinhild Wurm, Amanda Raine, Melina Mescher, Luc Van Heereveld, Marieke Mastop, E Gerhart H Wagner, Karin Schnetz, John Van Der Oost, Rolf Wagner, and Stan J J Brouns. H-NS-mediated repression of CRISPR-based immunity in Escherichia coli K12 can be relieved by the transcrip- tion activator LeuO. Molecular microbiology, 77(6):1380–1393, September 2010. [217] Ritsdeliz Perez-Rodriguez, Charles Haitjema, Qingqiu Huang, Ki Hyun Nam, Sarah Bernardis, Ailong Ke, and Matthew P DeLisa. Envelope stress is a trigger of CRISPR RNA-mediated DNA silencing in Escherichia coli. Molecular microbiol- ogy, 79(3):584–599, February 2011. 194 [218] Yoshihiro Agari, Keiko Sakamoto, Masatada Tamakoshi, Tairo Oshima, Seiki Kuramitsu, and Akeo Shinkai. Transcription profile of Thermus thermophilus CRISPR systems after phage infection. Journal of molecular biology, 395(2):270– 281, January 2010. [219] Bard Ermentrout. Simulating, analyzing, and animating dynamical systems: a guide to XPPAUT for researchers and students, volume 14. Siam, 2002. [220] Bruce R Levin, Sylvain Moineau, Mary Bushman, and Rodolphe Barrangou. The population and evolutionary dynamics of phage and bacteria with CRISPR- mediated immunity. PLoS genetics, 9(3):e1003312, 2013. [221] Dan I. Andersson and Diarmaid Hughes. Antibiotic resistance and its cost: is it possible to reverse resistance? Nature Reviews Microbiology, 8(4):260–271, April 2010. [222] Kelli L Palmer and Michael S Gilmore. Multidrug-resistant enterococci lack CRISPR-cas. mBio, 1(4), 2010. [223] Timothy R. Sampson and David S. Weiss. Degeneration of a CRISPR/Cas sys- tem and its regulatory target during the evolution of a pathogen. RNA biology, 10(10):1618–1622, October 2013. [224] Honghu Sun, Yinghui Li, Xiaolu Shi, Yiman Lin, Yaqun Qiu, Jinjin Zhang, Yao Liu, Min Jiang, Zhen Zhang, Qiongcheng Chen, Qun Sun, and Qinghua Hu. Asso- ciation of CRISPR/Cas Evolution with Vibrio parahaemolyticus Virulence Factors and Genotypes. Foodborne Pathogens and Disease, 12(1):68–73, January 2015. [225] Xiangjiao Guo, Yingfang Wang, Guangcai Duan, Zerun Xue, Linlin Wang, Pengfei Wang, Shaofu Qiu, Yuanlin Xi, and Haiyan Yang. Detection and Analysis of CRISPRs of Shigella. Current Microbiology, 70(1):85–90, January 2015. [226] R. Louwen, D. Horst-Kreft, A. G. de Boer, L. van der Graaf, G. de Knegt, M. Hamersma, A. P. Heikema, A. R. Timms, B. C. Jacobs, J. A. Wagenaar, H. P. Endtz, J. van der Oost, J. M. Wells, E. E. S. Nieuwenhuis, A. H. M. van Vliet, P. T. J. Willemsen, P. van Baarlen, and A. van Belkum. A novel link between Campylobacter jejuni bacteriophage defence, virulence and Guillain-Barré syn- drome. European Journal of Clinical Microbiology & Infectious Diseases: Offi- cial Publication of the European Society of Clinical Microbiology, 32(2):207–226, February 2013. [227] Nigel F. Delaney, Susan Balenger, Camille Bonneaud, Christopher J. Marx, Ge- offrey E. Hill, Naola Ferguson-Noel, Peter Tsai, Allen Rodrigo, and Scott V. Ed- wards. Ultrafast evolution and loss of CRISPRs following a host shift in a novel wildlife pathogen, Mycoplasma gallisepticum. PLoS genetics, 8(2):e1002511, February 2012. 195 [228] Tao Liu, Yingjun Li, Xiaodi Wang, Qing Ye, Huan Li, Yunxiang Liang, Qunxin She, and Nan Peng. Transcriptional regulator-mediated activation of adaptation genes triggers CRISPR de novo spacer acquisition. Nucleic Acids Research, 43(2):1044–1055, January 2015. [229] J. D. van Embden, T. van Gorkom, K. Kremer, R. Jansen, B. A. van Der Zeijst, and L. M. Schouls. Genetic variation and evolutionary origin of the direct repeat locus of Mycobacterium tuberculosis complex bacteria. Journal of Bacteriology, 182(9):2393–2401, May 2000. [230] Anders F. Andersson and Jillian F. Banfield. Virus Population Dynamics and Acquired Virus Resistance in Natural Microbial Communities. Science, 320(5879):1047–1050, May 2008. [231] Asaf Levy, Moran G. Goren, Ido Yosef, Oren Auster, Miriam Manor, Gil Amitai, Rotem Edgar, Udi Qimron, and Rotem Sorek. CRISPR adaptation biases explain preference for acquisition of foreign DNA. Nature, 520(7548):505–510, April 2015. [232] Stephen P. Hubbell. The unified neutral theory of species abundance and diversity. Princeton University Press, Princeton, NJ. Hubbell, SP (2004) Quarterly Review of Biology, 79:96–97, 2001. [233] Jerôme Chave. Neutral theory and community ecology. Ecology letters, 7(3):241– 253, 2004. [234] Mathew A. Leibold and Mark A. McPeek. Coexistence of the niche and neutral perspectives in community ecology. Ecology, 87(6):1399–1410, 2006. [235] Stephen P. Hubbell. Neutral theory in community ecology and the hypothesis of functional equivalence. Functional ecology, 19(1):166–172, 2005. [236] Carl Edward Rasmussen. The infinite Gaussian mixture model. In Advances in neural information processing systems, pages 554–560, 2000. [237] Simon J. Labrie, Julie E. Samson, and Sylvain Moineau. Bacteriophage resistance mechanisms. Nature Reviews. Microbiology, 8(5):317–327, May 2010. [238] Aurora M. Nedelcu, William W. Driscoll, Pierre M. Durand, Matthew D. Herron, and Armin Rashidi. On the Paradigm of Altruistic Suicide in the Unicellular World. Evolution, 65(1):3–20, January 2011. [239] Kira S. Makarova, Vivek Anantharaman, L. Aravind, and Eugene V. Koonin. Live virus-free or die: coupling of antivirus immunity and programmed suicide or dor- mancy in prokaryotes. Biology Direct, 7:40, 2012. [240] Edze R. Westra, Angus Buckling, and Peter C. Fineran. CRISPR-Cas systems: beyond adaptive immunity. Nature Reviews Microbiology, 12(5):317–326, May 2014. 196 [241] Angus Buckling and Paul B. Rainey. Antagonistic coevolution between a bac- terium and a bacteriophage. Proceedings. Biological Sciences / The Royal Society, 269(1494):931–936, May 2002. [242] Virginie Poullain, Sylvain Gandon, Michael A. Brockhurst, Angus Buckling, and Michael E. Hochberg. The Evolution of Specificity in Evolving and Coevolving Antagonistic Interactions Between a Bacteria and Its Phage. Evolution, 62(1):1– 11, January 2008. [243] Alex R. Hall, Pauline D. Scanlan, Andrew D. Morgan, and Angus Buckling. Host- parasite coevolutionary arms races give way to fluctuating selection. Ecology Let- ters, 14(7):635–642, July 2011. [244] Deo Prakash Pandey and Kenn Gerdes. Toxin-antitoxin loci are highly abundant in free-living but lost from host-associated prokaryotes. Nucleic Acids Research, 33(3):966–976, January 2005. [245] F. Débarre, S. Lion, M. van Baalen, and S. Gandon. Evolution of host life-history traits in a spatially structured host-parasite system. The American Naturalist, 179(1):52–63, January 2012. [246] Masaki Fukuyo, Akira Sasaki, and Ichizo Kobayashi. Success of a suicidal defense strategy against infection in a structured habitat. Scientific Reports, 2, January 2012. [247] Thomas W. Berngruber, Sébastien Lion, and Sylvain Gandon. Evolution of sui- cide as a defence strategy against pathogens in a spatially structured environment. Ecology Letters, 16(4):446–453, April 2013. [248] S. Lion and S. Gandon. Evolution of spatially structured host-parasite interactions. Journal of Evolutionary Biology, 28(1):10–28, January 2015. [249] Jaime Iranzo, Alexander E. Lobkovsky, Yuri I. Wolf, and Eugene V. Koonin. Im- munity, suicide or both? Ecological determinants for the combined evolution of anti-pathogen defense systems. BMC Evolutionary Biology, 15(1):43, March 2015. [250] Michael Doebeli, Christoph Hauert, and Timothy Killingback. The Evolutionary Origin of Cooperators and Defectors. Science, 306(5697):859–862, October 2004. [251] F. C. Santos and J. M. Pacheco. Scale-free networks provide a unifying framework for the emergence of cooperation. Physical Review Letters, 95(9):098104, August 2005. [252] F. C. Santos, J. M. Pacheco, and Tom Lenaerts. Evolutionary dynamics of social dilemmas in structured heterogeneous populations. Proceedings of the National Academy of Sciences of the United States of America, 103(9):3490–3494, February 2006. 197 [253] Jeffrey A. Fletcher and Michael Doebeli. A simple and general explanation for the evolution of altruism. Proceedings of the Royal Society of London B: Biological Sciences, 276(1654):13–19, January 2009. [254] Jürgen Heinze and Bartosz Walter. Moribund Ants Leave Their Nests to Die in Social Isolation. Current Biology, 20(3):249–252, February 2010. [255] Dominik Refardt, Tobias Bergmiller, and Rolf Kümmerli. Altruism can evolve when relatedness is low: evidence from bacteria committing suicide upon phage infection. Proceedings of the Royal Society B: Biological Sciences, 280(1759), May 2013. [256] Ellen L. Simms and Jim Triplett. Costs and benefits of plant responses to disease: resistance and tolerance. Evolution, pages 1973–1985, 1994. [257] R G Bowers, M Boots, and M Begon. Life history trade-offs and the evolution of pathogen resistance: competition between host strains. Proceedings. Biological sciences / The Royal Society, 257(1350):247–253, September 1994. [258] Wendy L. Fineblum, Mark D. Rausher, and others. Tradeoff between resistance and tolerance to herbivore damage in a morning glory. Nature, 377(6549):517– 520, 1995. [259] Michael Boots and Yoshihiro Haraguchi. The Evolution of Costly Resistance in Host-Parasite Systems. The American Naturalist, 153(4):359–370, April 1999. [260] M. Boots and R. G. Bowers. Three mechanisms of host resistance to microparasites-avoidance, recovery and tolerance-show different evolutionary dy- namics. Journal of Theoretical Biology, 201(1):13–23, November 1999. [261] B A Roy and J W Kirchner. Evolutionary dynamics of pathogen resistance and tol- erance. Evolution; international journal of organic evolution, 54(1):51–63, Febru- ary 2000. [262] Lars Råberg, Derek Sim, and Andrew F. Read. Disentangling genetic variation for resistance and tolerance to infectious diseases in animals. Science (New York, N.Y.), 318(5851):812–814, November 2007. [263] Michael Boots, Alex Best, Martin R. Miller, and Andrew White. The role of ecological feedbacks in the evolution of host defence: what does theory tell us? Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 364(1513):27–36, January 2009. [264] Jessica A. Hill, Teresa R. O’Meara, and Leah E. Cowen. Fitness Trade-Offs As- sociated with the Evolution of Resistance to Antifungal Drug Combinations. Cell Reports, February 2015. [265] Martin A. Nowak. Five Rules for the Evolution of Cooperation. Science, 314(5805):1560–1563, December 2006. 198 [266] Karl Sigmund, Christoph Hauert, and Martin A. Nowak. Reward and punishment. Proceedings of the National Academy of Sciences, 98(19):10757–10762, Septem- ber 2001. [267] James H. Fowler. Altruistic punishment and the origin of cooperation. Pro- ceedings of the National Academy of Sciences of the United States of America, 102(19):7047–7049, May 2005. [268] Stuart A. West, Ashleigh S. Griffin, Andy Gardner, and Stephen P. Diggle. Social evolution theory for microorganisms. Nature Reviews Microbiology, 4(8):597–607, August 2006. [269] Mayuko Nakamaru and Yoh Iwasa. The coevolution of altruism and punishment: role of the selfish punisher. Journal of Theoretical Biology, 240(3):475–488, June 2006. [270] Hannelore Brandt, Christoph Hauert, and Karl Sigmund. Punishing and abstaining for public goods. Proceedings of the National Academy of Sciences, 103(2):495– 497, January 2006. [271] Christoph Hauert, Arne Traulsen, Hannelore Brandt, Martin A. Nowak, and Karl Sigmund. Via Freedom to Coercion: The Emergence of Costly Punishment. Sci- ence, 316(5833):1905–1907, June 2007. [272] Grace Wahba. Spline models for observational data, volume 59. Siam, 1990. [273] Chong Gu. Smoothing spline ANOVA models, volume 297. Springer Science & Business Media, 2013. [274] Yuedong Wang. Smoothing splines: methods and applications. Chapman and Hall/CRC, 2011. [275] Lawrence A. David, Arne C. Materna, Jonathan Friedman, Maria I. Campos- Baptista, Matthew C. Blackburn, Allison Perrotta, Susan E. Erdman, and Eric J. Alm. Host lifestyle affects human microbiota on daily timescales. Genome Biol- ogy, 15(7):R89, July 2014. [276] Garrett Jenkinson, Elisabet Pujadas, John Goutsias, and Andrew P. Feinberg. Potential energy landscapes identify the information-theoretic nature of the epigenome. Nature Genetics, 49(5):719–729, May 2017. [277] Karline Soetaert and yale sparse matrix package authors. rootSolve: Nonlin- ear Root Finding, Equilibrium and Steady-State Analysis of Ordinary Differential Equations, December 2016. [278] Young-Ju Kim and Chong Gu. Smoothing spline Gaussian regression: more scal- able computation via efficient approximation. Journal of the Royal Statistical So- ciety: Series B (Statistical Methodology), 66(2):337–356, 2004. 199