ABSTRACT Title of Document: NOVEL METHODS FOR METAGENOMIC ANALYSIS James Robert White, Doctor of Philosophy, 2010 Directed By: Profesor Mihai Pop, Department of Computer Science, Applied Mathematics, Statistics, and Scientific Computation Program Afiliate By sampling the genetic content of microbes at the nucleotide level, metagenomics has rapidly established itself as the standard in characterizing the taxonomic diversity and functional capacity of microbial populations throughout nature. The decreasing cost of sequencing technologies and the simultaneous increase of throughput per run has given scientists the ability to deeply sample highly diverse communities on a reasonable budget. The Human Microbiome Project is representative of the flood of sequence data that wil arive in the coming years. Despite these advancements, there remains the significant chalenge of analyzing masive metagenomic datasets to make appropriate biological conclusions. This disertation is a collection of novel methods developed for improved analysis of metagenomic data: (1) We begin with Figaro, a statistical algorithm that quickly and acurately infers and trims vector sequence from large Sanger-based read sets without prior knowledge of the vector used in library construction. (2) Next, we perform a rigorous evaluation of methodologies used to cluster environmental 16S rRNA sequences into species-level operational taxonomic units, and discover that many published studies utilize highly stringent parameters, resulting in overestimation of microbial diversity. (3) To asist in comparative metagenomics studies, we have created Metastats, a robust statistical methodology for comparing large-scale clinical datasets with up to thousands of subjects. Given a collection of annotated metagenomic features (e.g. taxa, COGs, or pathways), Metastats determines which features are diferentialy abundant betwen two populations. (4) Finaly, we report on a new methodology that employs the generalized Lotka-Voltera model to infer microbe-microbe interactions from longitudinal 16S rRNA data. It is our hope that these methods wil enhance standard metagenomic analysis techniques to provide beter insight into the human microbiome and microbial communities throughout our world. To asist metagenomics researchers and those developing methods, al software described in this thesis is open-source and available online. NOVEL METHODS FOR METAGENOMIC ANALYSIS By James Robert White Disertation submited to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfilment of the requirements for the degre of Doctor of Philosophy 2010 Advisory Commite: Profesor Mihai Pop, Chair Profesor Steven Salzberg Profesor James A. Yorke Profesor Najib El-Sayed Profesor Charles Delwiche (Dean?s Representative) ? Copyright by James Robert White 2010 i Preface The four works in this thesis have either ben published in per-reviewed journals or are currently in preparation for publication. I am indebted to each of my co-authors on these studies ? their dedication and expertise resulted in much stronger scientific papers both in statistical rigor and biological scope. At the time of this writing, Chapters 2 and 4 have appeared in print and are slightly modified here. Chapter 3 has been acepted for publication. Permision for republication of this material has been granted and is available upon request. Chapter 2: White, J.R., Roberts, M., Yorke, J.A. and M. Pop, Figaro: a novel statistical method for vector sequence removal. Bioinformatics, 2008. 24(4): p. 462-7. We thank Art Delcher and Aleksey Zimin for their suggestions and fedback. We are grateful to Steven Salzberg for help naming our method. M.P. was supported in part by grant R01-LM006845 from NIH and grant HU001-06-1-0015 from the Uniformed Services University of the Health Sciences administered by the Henry Jackson Foundation. J.Y. and M.R. were supported in part under NSF grant DMS 0616585 and NIH grant 1R01HG0294501. J.W. was supported in part by University of Maryland NSF VIGRE felowship DMS0240049. Chapter 3: White, J.R., Navlakha, S., Nagarajan, N., Ghodsi, M., Kingsford, C. and M. Pop. Alignment and clustering of phylogenetic markers ? implications for microbial diversity studies. BMC Bioinformatics, 2010. 11(152). Chapter 4: White, J.R., Nagarajan, N. and M. Pop, Statistical methods for detecting diferentially abundant features in clinical metagenomic samples. PLoS Comput Biol, 2009. 5(4): p. e1000352. ii Funded in part by Bil and Melinda Gates Foundation and the Henry Jackson Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We are very grateful to Jef Gordon, Peter Turnbaugh, and Ruth Ley for their input and assistance with data. We thank Frank Siewerdt, Aleksey Zimin, Radu Balan, Kunmi Ayanbule, and Kenneth Ryals for helpful discusions. Finally we would like to thank the anonymous reviewers for their constructive comments. Our manuscript was greatly improved by their contribution. Chapter 5: White, J.R., Turnbaugh, P., Paulson, J., Gordon, J.I. and M. Pop. Infering microbial interaction webs from time-series metagenomic data. In preparation. iv Dedication In early 2008, I experienced a profound life-changing epiphany. I dedicate this disertation to the person I was before then. I finished this for you, I mis you, and I wil not forget you. v Acknowledgements First and foremost I wish to acknowledge the dedication of my advisor Mihai Pop. Mihai introduced me to the field of metagenomics and helped to shape me into a meticulous researcher. I am indebted to Jim Yorke, my Coach, who has had a larger impact on my profesional direction than anyone else. Jim guided me from pure to applied mathematics, a path that resulted in more personal satisfaction and excitement than I could have imagined. I must also thank other members of my defense commite (Steven Salzberg, Chuck Delwiche and Najib El-Sayed) for their input and appreciation of this work. I have had many excelent teachers over the years, but Linda Barnes and Dianne O?leary hold a special place ? both of your approaches to teaching matched my learning style perfectly. I undoubtedly would not have been able to finish this thesis if I had not been in such wonderful company. In the throws of this chalenging experience, I met some of the best people I wil ever know. Mike, Adam, Dave, and Saket ? working alongside you has been fantastic, I wouldn?t trade it for anything, and I can?t wait to se where we al end up. Smeds, Tom, Rachel, and Jeremy ? thank you so much for your friendship and support. Andrea Otesen ? my travel buddy and collaborator ? you have made me appreciate the natural world so much more. A special thanks to Denise Cross and Alverda McCoy for helping me sort through sometimes frustrating university policies and keeping me on track to finish. vi I finaly must expres my gratitude to my family. They have supported me through thick and thin, and kept me balanced in times of stres. I?ve learned how lucky I am to have unconditional love from my parents and sisters, and I hope I continue to cherish this throughout the remaining chapters of my life. vii Table of Contents Preface...........................................................i Dedication........................................................iv Acknowledgements..................................................v List of Tables......................................................ix List of Figures......................................................x Chapter 1: Introduction...............................................1 Early microbiology................................................1 16S rRNA gene surveys and metagenomics..............................2 The Human Microbiome Project......................................6 This work.......................................................7 Mathematical and computational contributions.........................8 Chapter 2: Figaro ? a novel statistical method for vector sequence removal.......11 Background.....................................................11 Methods.......................................................15 Detection of vectormers..........................................16 Vector clip estimation...........................................19 Results.........................................................21 Vector triming sensitivity and specificity...........................21 Improving asemblies with Figaro..................................25 Discussion......................................................28 Chapter 3: Alignment and clustering of phylogenetic markers ? implications for microbial diversity studies............................................31 Background.....................................................31 Results.........................................................33 Simulated environments..........................................33 Comprehensive search of OTU methodologies.........................34 OTU variability................................................34 Nonparametric estimators of richnes and diversity.....................40 Partial masking of MSAs.........................................42 Pairwise versus multiple sequence alignments.........................42 Supervised clustering alternatives..................................45 Consistency of methods across multiple datasets.......................48 Materials & Methods..............................................49 Creation of simulated datasets.....................................49 Multiple sequence alignment......................................50 Distance corrections and clustering methods..........................50 Measures of similarity for clusterings...............................51 Computation of the Rand index and variation of information for clusterings..52 VI-cut method for defining OTUs..................................53 ANOVA of methodology components...............................54 Chapter 4: Statistical methods for detecting diferentialy abundant features in clinical metagenomic samples...............................................56 Background.....................................................56 vii Materials & Methods..............................................58 Data normalization.............................................59 Analysis of diferential abundance..................................60 Asesing significance...........................................61 Multiple hypothesis testing correction...............................63 Handling sparse counts..........................................64 Creating the Feature Abundance Matrix..............................66 Data used in this paper...........................................66 Results.........................................................67 Comparison with other statistical methods............................67 Taxa asociated with human obesity................................75 Diferentialy abundant COGs betwen mature and infant human gut microbiomes..................................................78 Diferentialy abundant metabolic subsystems in microbial and viral metagenomes..................................................84 Discussion......................................................85 Chapter 5: Infering microbial interaction webs from time-series metagenomic data87 Background.....................................................87 Materials & Methods..............................................89 odeling microbial communities...................................89 Learning a model from the data....................................95 An inference methodology with confidence values.....................97 Smal interaction network simulation design..........................98 Humanized gnotobiotic mouse gut dataset............................99 Results........................................................101 Prediction of smal interaction webs................................101 Validation of regresion approach.................................104 Microbial dynamics of mice on a Western diet........................108 Discussion.....................................................113 Chapter 6: Conclusions and further study...............................124 Appendices......................................................127 Bibliography.....................................................140 ix List of Tables Table 1...........................................................14 Table 2...........................................................24 Table 3...........................................................25 Table 4...........................................................27 Table 5...........................................................39 Table 6...........................................................49 Table 7...........................................................52 Table 8...........................................................55 Table 9...........................................................74 Table 10..........................................................77 Table 11..........................................................83 Table 12..........................................................85 Table 13..........................................................95 Table 14.........................................................102 Table 15.........................................................111 Table 16.........................................................111 Table 17.........................................................123 Table 18........................................................139 x List of Figures Figure 1..........................................................12 Figure 2..........................................................13 Figure 3..........................................................18 Figure 4..........................................................18 Figure 5..........................................................21 Figure 6..........................................................37 Figure 7..........................................................38 Figure 8..........................................................41 Figure 9..........................................................44 Figure 10.........................................................47 Figure 11.........................................................59 Figure 12.........................................................65 Figure 13.........................................................69 Figure 14.........................................................71 Figure 15.........................................................73 Figure 16........................................................103 Figure 17........................................................107 Figure 18........................................................110 Figure 19........................................................118 Figure 20........................................................122 1 Chapter 1: Introduction Early microbiology As with most new fields of science, microbiology was born out of a major technological innovation ? the microscope. Developed in the early 1600s, microscopes had limited application in biology until Antonie van Leuwenhoek, a clothing merchant and amateur scientist, used a simple magnifying lens of exceptionaly high quality to examine water from a lake near his home. Leuwenhoek discovered a world of ?litle animalcules? many of which were actual bacteria (though this term did not appear for another 150 years [1]). In 1674 he shared his findings and sketches with the British Royal Society [2], revealing a mysterious and complex world hidden from our view. Improvements in microscopy through the late 19 th century (e.g. staining) helped to encourage the field, but observation alone was not sufficient to infer the composition of these organisms or their natural functions. Novel techniques were needed to isolate and study each microorganism independently, but it was impossible to interogate one cel at a time. Rather, the more practical approach was to study large populations of identical cels ? a concept known as pure culture. Louis Pasteur and Robert Koch, two of the founders of modern microbiology, designed methods for the isolation, cultivation, and study of pure cultures, and these techniques have defined the field for more than a century. As researchers discovered many more microbes through cultivation, taxonomic clasification posed a significant chalenge. Morphological characteristics of bacterial cels were simple and limited, the cels reproduced asexualy, and common metabolic 2 properties were not trustworthy indicators of phylogeny. In 1923, the Society of American Bacteriologists published the first edition of Bergey?s Manual of Determinative Bacteriology [3], a reference for the clasification of culturable bacteria using morphological, physiological, and experimental characteristics. This reference has expanded dramaticaly into several large volumes, and now employs some of the molecular techniques I discuss below. Despite the rapid acumulation of information on culturable microbes in the first half of the 20 th century, there was a glaring problem - most of the microbial world could not be cultured. This was evidenced by the ?great plate count anomaly? in which population abundance estimates determined through microscope density measurements and dilution plating difered by several orders of magnitude [4, 5]. These diferences were particularly extreme in soil environments, where it was estimated that les than 1% of the microbial community could be cultured using standard techniques [6]. The chalenge of learning anything about this sizable majority of microbes semed insurmountable, and most scientists focused on microorganisms that could be cultivated. 16S rRNA gene surveys and metagenomics More than 300 years after the first observation of microbes a new technological innovation exposed microbiology to the unculturable majority. In the late 1970s and early 80s, Carl Woese discovered that the 16S rRNA gene was an excelent phylogenetic marker due to its high information content, structuraly conservative nature, and ubiquitous presence among prokaryotes [7-10]. Motivated by this result, Norman Pace and colleagues devised a method for rapidly sequencing 16S rRNA genes to 3 phylogeneticaly clasify organisms [11]. Augmented by the development of universal PCR primers for 16S gene amplification, Pace?s sequencing methodology enabled scientists to sample microbial populations in virtualy any habitat without culturing-bias. Since then, the 16S gene has proven to be one of the most important tools in microbial ecology [12], revealing a vast biodiversity of prokaryotes in many environments such as the ocean [13], soil [14], food products [15, 16], crude oil [17], and even the human gut [18-21]. Analysis of 16S markers now employs high-throughput sequencing technologies (e.g. Sanger and 454 pyrosequencing), which provide deeper sampling to observe community members that make up tiny fractions of the total population. Basic sequence analysis is easily automated, so much so that large computational infrastructures are already in place ? webservers such as the Ribosomal Database Project (RDP) [22], GrenGenes [23], and MG-RAST [24] give researchers superior computing power and informative analysis of their data. After the paradigm shift to 16S rRNA surveys, there were new eforts to obtain more information about these microbes than simply their phylogeny. Researchers now sought to study environmental DNA samples with multiple species using shotgun sequencing. By 1997, the term ?metagenomics? was coined to describe this new approach to environmental microbiology [25]. Over the last decade, several pioneering studies have generated a great deal of interest and set precedents for future projects [26]. Here I discuss thre of these landmark studies. Acid mine drainage. Published in 2004, the Acid Mine Drainage project (AMD) sampled biofilms growing on the acidic outflows located deep in the Richmond mine of Iron Mountain, California [27]. This environment had been carefully studied prior to 4 metagenomic analysis, and preliminary experiments indicated a low-complexity native microbial community with only five dominant species (thre Bacteria and two Archaea). Despite this low diversity and a total over 100,000 Sanger reads generated from shotgun sequencing, asemblies of thre of the species were largely incomplete. The AMD project ilustrated the dificulty in acquiring sufficient genome coverage for organisms with lower relative abundances in a community, but also demonstrated how metagenomic data could be analyzed to infer how microbes potentialy interact biochemicaly in a specific environment. The Sargasso Sea. This study, led by Craig Venter, sought to characterize the microbial diversity in the Sargaso Sea, which represents the middle of the North Atlantic Ocean, east of the Gulf Stream and south of North Atlantic current. Using a series of filters to isolate bacterial and archaeal cels from ocean water, researchers took surface samples at multiple sites and performed extensive shotgun sequencing. Over 1.66 milion reads were generated totaling 1.36 bilion base-pairs of DNA sequence, far more than any other previous metagenomic study [28]. This amazing volume of data resulted in 1.2 milion predicted genes, roughly an order of magnitude greater than the entire SwisProt database at the time. Examining the depth of coverage distribution across the metagenomic asembly, Venter and his team found high phylogenetic diversity in the Sargaso Sea with estimates of at least 1800 species in the environment. Unfortunately, it was later determined that the largest sample taken was contaminated by Shewanela and Burkholderia species, rendering it useles for ecological analysis [29]. This work was the first to perform significant deep sequencing of a high-complexity microbial population, and through its technical problems, afirmed the importance of sampling techniques, 5 quality control experiments and validation. Moreover, the Sargaso Sea served as a pilot study for the Global Ocean Sampling project, an around-the-world voyage that collected ocean samples approximately every 200 nautical miles, resulting in 7.7 milion shotgun sequences, the largest raw metagenomic dataset to date. The Obese Gut Microbiome. While some scientists used metagenomics to investigate traditional environments like soil and water, others were interested in the structure and function of microbial communities within a host. At Washington University in St. Louis, Jef Gordon and Peter Turnbaugh wanted to characterize microbes inhabiting the distal gut of obese and lean mice to determine if and how gut microbiota contribute to the pathology of obesity [30]. Using metagenomic and biochemical analyses to compare samples taken from geneticaly obese mice and their lean litermates, Gordon and Turnbaugh discovered that the obese gut microbiome maintained increased capacity for energy harvest and furthermore, that this trait was transmisible to germ-fre mice. This work established an important application of metagenomics: characterization prokaryotic communities in a clinical seting to study how human diseases correlate with microflora. The field of metagenomics has quickly expanded from microbial ecology to other disciplines including medical microbiology, food safety, and wastewater treatment. The next section details the most comprehensive metagenomics collaboration currently underway. 6 The Human Microbiome Project A pinnacle achievement in human knowledge, the Human Genome Project represented the largest scientific collaboration in biology ever, spanning areas in molecular biology, computer science and statistics, engineering, and biotechnology. 3.3 bilion base-pairs later, as results from the analysis of the genome reached the scientific community, it became clear that the genomic diferences betwen humans and other distantly related eukaryotes were more subtle than anyone had ever anticipated. The human genome contains roughly 20,000-25,000 protein-coding genes [31, 32], remarkably close to the mouse genome [33], and about 40% more than a fruit fly [34]. However, if we think of humans as superorganisms that house thousands of microbial species, then the total number of genes increases to over 100,000. It is estimated that the microbial cels inhabiting a person outnumber somatic and germ cels by an order of magnitude. While bacteria are not present throughout the entire body, they are esential in many of our functions including digestion of complex carbohydrates, synthesis of helpful vitamins, defense against pathogens, and the production of fat cels. Therefore, we must alter our model of human beings to capture the fundamental symbiosis betwen our microbiota and ourselves. This new perspective on human genetic variation combined with advancements in paralel DNA sequencing technology has led to a natural extension to human genomic research: the Human Microbiome Project. Initiated in 2007, the Human Microbiome Project (HMP) is a large interdisciplinary collection of metagenomics projects which as a whole aim to characterize the microbial communities inhabiting humans across the globe. The HMP 7 focuses on five major areas: the gut, oral cavity, sinus cavity, skin, and the female urogenital tract [35, 36]. Each region presents unique chalenges to microbiologists. Some bacterial communities are incredibly diverse, whereas others are smal and dificult to extract from human tisue. The HMP wil help to develop an infrastructure for clinical studies of human microbiota, and it is hoped that we wil find ways to identify bacterial factors asociated with human disease and learn how to modify our microbiota to improve our overal health. Major goals of the project include [35]: I. Developing a reference set of microbial genome sequences and preliminary characterization of the human microbiome I. Examining the relationship betwen disease and changes in the human microbiome II. Developing new technologies and tools for computational analysis IV. Establishing a Data Analysis and Coordinating Center (DAC) V. Asesing ethical, legal and social implications of HMP research This work This disertation is a series of projects targeted toward achieving goal II, the development of new tools for computational analysis of large and complex metagenomic datasets. The direction of my research has ben a function of the HMP and metagenomic datasets currently available and others in production. I have organized these projects based on their location in a sequence analysis pipeline: preprocesing (Chapter 2), procesing (Chapter 3), and post-procesing/modeling (Chapters 4 and 5). It is my hope 8 that these ideas and asociated software packages wil be used to improve the analysis of data not only from the HMP, but future metagenomics studies of any environment. We begin with Figaro, a novel algorithm for triming vector and other contaminant sequence from genomic and metagenomic datasets (generated by Sanger or potentialy pyrosequencing technology) without prior knowledge of the artificial sequence itself (Chapter 2). The second study (Chapter 3) is a rigorous analysis of computational methodologies employed to cluster 16S rRNA sequences into species-like groups caled operational taxonomic units (OTUs). In Chapter 4, we addres chalenges in post-procesing large clinical metagenomic datasets with Metastats, a statistical methodology for detecting diferentialy abundant features betwen two populations. Finaly, in an efort to push HMP data as far as posible, I have designed and validated a method for infering microbe-microbe interactions using only longitudinal 16S rRNA data (Chapter 5). To close, Chapter 6 summarizes these works and discusses future research directions of considerable importance. Mathematical and computational contributions The following outlines my original mathematical and computational contributions made in each study: Chapter 2. I developed Figaro to utilize a novel statistical approach that infers unknown vector sequences from the data by examining overepresented kmers at the beginnings of reads. Specificaly, I model the frequency of each kmer as a Poison proces, and weight kmers acording to the likelihood that they are indeed part of a vector sequence. 9 Moreover, I carefully selected statistics to model which kmers are most likely to represent the end of each vector sequence, providing more acurate trim points and thus maximizing overal read length. I implemented Figaro and supporting scripts in Perl and C+ to run quickly on milions of reads. In testing, Figaro trimed 1.5 milion Sanger reads in ~11 minutes. Chapter 3. I collaborated with Saket Navlakha to extend the semi-supervised clustering algorithm VI-cut in order to improve its performance when clustering 16S sequences into OTUs. Specificaly, we incorporate the concept of forbidden nodes ? nodes in a hierarchical decomposition that cannot be cut to create clusters. In the context of OTUs, this prevents the creation of large ambiguous clusters when parts of a tre lack sufficient taxonomic annotation. Chapter 4. I designed the Metastats statistical methodology to specificaly suit the changing characteristics of annotated metagenomic data. Each component of the methodology is wel known in statistical analysis (the nonparametric t-test, Fisher?s exact test, the false discovery rate), but to my knowledge the unique combination of these tests for large- scale analysis of count data has not been employed, and certainly not in the context of comparative metagenomics. Chapter 5. 10 The generalized Lotka-Voltera model has been used in traditional ecology for many years. However, because the number of parameters in this model scales quadraticaly with the number of taxa, most studies only fit the gLV model to datasets involving two or thre organisms. My original computational contribution is a comprehensive Monte Carlo optimization procedure that finds many gLV model fits of suboptimal quality and infers fundamental ecological interactions betwen members of a community based on culling the resulting distributions of parameter estimates. To my knowledge, this approach has never been taken before in the context of fiting gLV models or in metagenomics. I implemented this computationaly intensive optimization procedure both in Matlab and multi-threaded C+ code to improve eficiency. 11 Chapter 2: Figaro ? a novel statistical method for vector sequence removal Background Even as new sequencing technologies become increasingly available [37], Sanger sequencing remains the most widely used technique for decoding the DNA of organisms [38]. High-throughput Sanger sequencing begins by cloning a DNA fragment into a vector (usualy a plasmid) that is then transfected into Escherichia coli in order to amplify the original DNA fragment. Short adapter sequences are often atached to the ends of the fragment to improve the eficiency of the cloning proces [39]. The sequencing reaction is usualy performed using universal sequencing primers that anneal within the vector in the vicinity of the fragment insertion site (splice site). As a result of this proces (highlighted in Figure 1), each sequence contains a smal section of the vector, as wel as the adapters used during cloning, in addition to the original DNA fragment. For the purpose of this paper, we wil refer to any such artifacts as vector sequence. These sequences must be flagged prior to further analysis of the data, in a proces caled vector triming or vector clipping. Several software tools are available for vector removal: Lucy [40], Crossmatch (ww.phrap.org/phredphrapconsed.html), and VecScren (ww.ncbi.nlm.nih.gov/ VecScren). These programs compare each read to the sequence of the cloning vector, then flag sections of the read that have strong similarity to the vector (Crossmatch replaces vector sequence with Xs, Lucy provides a list of clipping coordinates in the fasta header, and VecScren provides a BLAST-like output). The alignments are performed 12 with relaxed parameters in order to acount for the higher eror rates at the beginning of reads (se Figure 2). Furthermore this approach requires thre sets of information: (i) the sequence of the cloning vector; (i) the splice site used for sequencing; and (ii) the sequence of the cloning adapters (if used?information that is often lost when the sequences are deposited in public databases). Note that the NCBI Trace Archive provides a mechanism for recording the location within the read where the vector ends (vector clip point), however this information is often mising or incorrect. Figure 1. DNA from a sample (black) is cloned into a smal circular piece of DNA caled a vector (light gray). Short adapters (white) are used to improve eficiency of cloning the sample DNA. The molecule is then transfected into E. coli, amplified, and then sequenced from both ends starting from priming sites inside the vector. 13 Figure 2. Raw output from sequencing machines contains poor quality sequence on the ends as wel as vector and adapter sequence, in addition to the DNA being sequenced. As an example, at the beginning of September, 2007, approximately 60% (735 milion out of 1.24 bilion) of al shotgun reads from the NCBI Trace Archive had either no vector clip information, or a vector clip point of 0 or 1, indicating the vector clipping information was not provided (clip_vector_left = 0) or was arbitrarily set to the beginning of the read (clip_vector_left = 1). Even when a vector coordinate is provided it is often incorrect, as described below. We examined the shotgun reads used to asemble the Xanthomonas oryzae px099a genome, a dataset for which both vector and quality clipping coordinates had been submited to the Trace Archive by the sequencing center. We considered al reads whose vector clip coordinate occurred at least 8 base pairs (bp) inside the high-quality region, then talied the final 8 bp (8mer) of the supposed vector sequence. These 8mers should represent the end of the vector sequence; therefore, they should be virtualy identical across al reads with the exception of diferences caused by sequencing erors. We examined 7,997 reads originating from a single sequencing library (library id 14 1041054961988). Furthermore, we separately examined reads sequenced with the ?Forward?, and ?Reverse? trace direction in order to avoid any variability due to diferences betwen the vector sequences flanking the splice site. The results, summarized in table 1, highlight a much higher variability in the set of 8mers than can be explained by sequencing eror alone, suggesting the vector clip points are incorectly asigned. Trace direction Forward Reverse Number of reads 3,687 4,310 Four most frequent 8mers and frequency GCGCAGCG GCGCAGC GATCAT GTGCTGA 40 29 29 26 GCGCAGCG GTGCTGA GCGATCG TGCGAT 46 42 37 35 Number of distinct 8mers 1,679 (45.5%) 1,858 (43.1%) Table 1. Frequency of 8mers extracted upstream from the annotated vector clip point in shotgun reads from Xanthomonas oryzae px099a. We only considered reads from the library where the 5? vector clip point was at least 8 bp to the right of the 5? quality clip point. The reads were further binned by sequencing direction. The four most frequent 8mers are shown together with their frequency. The high level of variability indicates erors in the reported clipping coordinates. 15 In this work, we present an algorithm for detecting and removing the vector sequence from the 5? end of reads without prior knowledge of the vector sequences used. This algorithm can, therefore, be used to correctly identify the vector clipping points for sequences obtained from public databases. The code was implemented as a single streamlined module, named Figaro, which can be easily integrated into a high-throughput computational pipeline. The code is distributed under an open-source license through the AMOS package (http:/amos.sourceforge.net). Below we provide a detailed description of the triming algorithm, and highlight its performance on thre datasets: ~1.5 milion Drosophila pseudoobscura reads; and in the de novo asembly of two bacterial genomes. Methods For a set of shotgun reads, Figaro infers the vector sequence from the frequency of occurrence of kmers (DNA segments of length k). Under the asumption that the vector DNA flanking the inserted sequences is the same for al the sequences in a dataset, the most frequent kmers in the data likely represent vector DNA. This asumption is generaly true for shotgun sequencing data, with the following exceptions: (i) diferent sequencing libraries may use diferent vectors; (i) the vector sequences upstream and downstream the splice-site are often diferent (hence ?Forward? and ?Reverse? reads are prefixed by diferent vector DNA); and (ii) when cloning adapters are used, two diferent strings, corresponding to distinct adapter sequence, may prefix the reads even from a single library and orientation. To improve acuracy, the reads are partitioned by library and sequencing direction, if such information is available. 16 Figaro operates in two phases: (i) identification of frequent kmers likely to represent vector DNA (caled vectormers throughout the text); and (i) estimation of the vector clip point for every read, on the basis of the vectormers identified in step (i). These two components of the algorithm are described in detail below. Detection of vectormers The vector sequence can be recognized by identifying kmers that are more frequent at the beginning of reads than anywhere else. Intuitively, the beginning of reads represents the DNA from the vector which is shared by the majority of reads in a dataset. The remaining section of each read should be randomly sampled from the genome, leading to few commonalities betwen distinct reads in the dataset. A kmer frequency table is created which records the number of ocurences of each word of length k within adjacent windows of length L over the first E bases of al reads (a kmer is asigned to the window in which it starts, thus alowing us to count kmers that cross window boundaries). We truncate al reads to a same length E in order to avoid artifacts due to the increased eror rates at the ends of reads. Given a maximum vector cut length, M, we declare the safe zone of the reads to be the region from base M to E (Figure 3). For each kmer K i , if s i s the number of occurences of K i in the safe zone across al reads, then we define its arival rate ? i to be: ! " i = s i E#M Given ? i , we model the number of occurrences of K i as a Poison proces. Leting X be the frequency of K i in a window of length t, X follows a Poison distribution with parameter ? = t? i . Considering f j , the frequency of occurrence of K i within the j th window 17 of length L (Fig. 4), we can estimate the likelihood of observing at least f j occurrences of K i in L base pairs given ? i . Mathematicaly, ! P(X"f j )=1#P(X 1000. These thre independent methods of validation strongly suggested that there are no spurious annotations in the simulated sample. Multiple sequence alignment Al 1677 were aligned using MUSCLE, ClustalW, and NAST [59-61] using default parameters. ClustalW was run with the ?Fast? option for pairwise alignments. In the NAST alignment, al columns containing only gaps were removed, and each MSA was trimed so that every sequence spanned the entire alignment. Distance corrections and clustering methods Distance matrices with Jukes-Cantor, Kimura 2-parameter, and Felsenstein84 corrections were computed using DNADIST with default parameters from the PHYLIP package [53]. 51 Olsen sand F84 distance-corrected matrices were also generated using the ARB package [75] for additional validation. Al distance matrices served as input to DOTUR [62] which uses nearest neighbor, average neighbor, and furthest neighbor clustering to create OTUs. DOTUR additionaly creates OTUs by varying a constant distance threshold D which is used as a criterion for merging two clusters in one. Distance thresholds ranged from 0, 0.01, 0.02, ? 0.45, resulting in a total of 749 OTU sets created by diferent methodologies. Measures of similarity for clusterings We employed two measures of similarity betwen clusterings: the Rand index [76] and the Variation of Information (VI) metric [56]. Examining the values of al clusterings acording to the Rand index and the VI, we found identical rankings betwen the two metrics. Because the Rand index tends to concentrate near 1 given more clusters, we use the VI as the measure of comparison betwen clusterings. In order to provide a reference set of VI distances for known clusters, we measured the VI betwen the true species clustering and the true phylum, clas, order, family, and genus clusterings (Table 7). True clustering VI Phyla 0.171 Clases 0.109 Orders 0.058 Families 0.026 Genera 0.008 Species 0 52 Table 7. Variation of information (VI) distances of true taxonomic clusterings to the known species clustering. Computation of the Rand index and variation of information for clusterings The variation of information criterion is a measure of similarity betwen two partitions (or clusterings) of a given set [56]. For this study, the set is the 1677 16S sequences selected for the artificial environmental sample. Mathematicaly, a given clustering C, is a partition of a set S into disjoint subsets (clusters) where: ! C={C 1 , 2 ,K,C M }, C i "C j =#, and $ i=1 M C i =S. If there are m elements in set S, and we let m i be the number of elements in cluster C i , then ! m=m i i=1 M " . Given two clusterings, C and D, we can examine al pairs of points in S and se whether C and D agre on whether or not they should be in the same cluster. Any pair of points wil fal exclusively into one of the four following categories: 11 ? The point pair is in the same cluster for both C and D. 00 ? The point pair is in diferent clusters for both C and D. 10 ? The point pair is in the same cluster for C, but not for D. 01 ? The point pair is in the same cluster for D, but not for C. Acordingly, the total number of point pairs faling into each category is N 1 , N 0 , N 10 , and N 01 . Given these values, the Rand index is computed as: ! RC,D () = N 11 +N 00 N 11 + 0010 +N 01 . 53 To compute the Variation of information betwen two clusterings, we first find the probability that a randomly selected sequence is in a particular cluster, that is, ! Pi () = m i . Given this discrete probability distribution, the uncertainty of the random variable i, is the entropy asociated with clustering C, defined as: ! H(C)="P(i)#logP(i). i=1 M $ Now, suppose we have two clusterings ! C={C 1 , 2 ,K,C M }, and D={D 1 , 2 ,K,D " M }. Then we calculate the joint distribution ! P(i,j)= C i "D j m describing the similarity of al pairs of clusters betwen C and D. The mutual information betwen the clusterings C and D is then defined to be ! IC,D () =P(i,j)log P(i,j) (i)(j) j=1 " M # i=1 M , and finaly, the variation of information betwen C and D is defined as the sum of the individual clustering entropies les 2 times the mutual information: ! VIC,D () =H(C)+H(D)"2IC,D () . If C and D are identical clusterings, then H(C)=H(D)=I(C,D), and the VI = 0. The VI distance is a true metric, satisfying symmetry, non-negativity, and the triangle inequality. VI-cut method for defining OTUs VI-cut is a procedure that finds a clustering from a hierarchical tre decomposition T that optimaly matches a partial set of known labels, as defined by the variation of information metric [73]. A clustering is defined by choosing a set of nodes in T. Each 54 chosen node c corresponds to a single cluster consisting of al the leaves (sequences) in the subtre rooted at c. The chosen nodes represent a node-cut in the tre such that each leaf belongs to exactly one cluster. Let D represent the partial clustering of annotated sequences such that sequences with the same label are grouped together. The set of chosen nodes corresponds to a node- cut K, which induces a clustering A K . The VI-cut algorithm finds the A K that minimizes the VI distance to D: ! min K VI(A K ,D) Although there are exponentialy many number of possible node-cuts in T, VI-cut finds the optimal one eficiently using dynamic programing. For this study, we modified the VI-cut algorithm by incorporating forbidden nodes, i.e. nodes in T that VI-cut is not alowed to choose. Specificaly, any node n with a corresponding distance ? 0.07 was forbidden. This means that if the cluster induced by n contains a pair of sequences which have a pairwise distance ? 0.07 then n is not alowed to be chosen. To incorporate forbidden nodes into the VI-cut algorithm, we first ran the standard VI-cut algorithm. If the clustering returned contained a forbidden node n, we moved down the tre and replaced n with its closest unforbidden descendants such that each sequence is stil placed in only one cluster. This modification forces the method to cut the tre at distances < 0.07, which helps to cluster large subtres with multiple species that may not have any known labels. ANOVA of methodology components In order to isolate the individual impact of each component in an OTU methodology, we 55 examined the 200 methodologies resulting in the lowest VI distance from the true species clustering, and performed a multi-way analysis of variance (ANOVA) considering four factors: multiple sequence alignment, evolutionary distance correction, clustering algorithm, and distance threshold. Using a linear model with no interactions, we found that the distance threshold alone explains 56% of the total variance in VI (Table 8). This impact was followed by the MSA, the clustering algorithm, and finaly the distance correction, which explained 33%, 7%, <0.01% of the total variance, respectively. This model explains 97% of the total variance, indicating that component interactions are negligible for our purposes. An F test did not detect any statisticaly significant diferences betwen distance corrections (F = 0.002, P = 0.998). We extended this comparison to include the Olsen distance correction in ARB [75], which we found produced OTUs virtualy identical to those created using the F84 correction (Figure 6a). Parameter Sum of Squares Degres of fredom Mean Sq. F Prob > F Distance threshold 0.4411 11 0.0401 23.0160 < 0.0001 MSA 0.0480 2 0.0240 13.7843 < 0.0001 Clustering 0.0099 2 0.0050 2.8503 0.0604 Distance correction < 0.0001 2 < 0.0001 0.0020 0.9980 Eror 0.3171 182 0.0017 Total 0.7910 199 0.0708 Table 8. Multi-way ANOVA table asesing components used in OTU methodologies. The factor with the largest efect on the quality of the OTUs was the distance threshold, followed by the MSA, and then the clustering algorithm. The distance correction explained < 0.01% of the variance and no statisticaly significant diference could be detected betwen the corrections (P = 0.998). 56 Chapter 4: Statistical methods for detecting differentialy abundant features in clinical metagenomic samples Background Broad sequencing of bacterial populations alows us a first glimpse at the many microbes that cannot be analyzed through traditional means (only ~1% of al bacteria can be isolated and independently cultured with current methods [77]). Studies of environmental samples initialy focused on targeted sequencing of individual genes, in particular the 16S subunit of ribosomal RNA [67, 78-80], though more recent studies take advantage of high-throughput shotgun sequencing methods to ases not only the taxonomic composition, but also the functional capacity of a microbial community [18, 30, 81]. Several software tools have been developed in recent years for comparing diferent environments on the basis of sequence data. DOTUR [62], Libshuff [82], ?- libshuff [83], SONs [84], MEGAN [85], UniFrac [86], and TreClimber [87] al focus on diferent aspects of such an analysis. DOTUR clusters sequences into operational taxonomic units (OTUs) and provides estimates of the diversity of a microbial population thereby providing a coarse measure for comparing diferent communities. SONs extends DOTUR with a statistic for estimating the similarity betwen two environments, specificaly, the fraction of OTUs shared betwen two communities. Libshuff and ?- libshuff provide a hypothesis test (Cramer von Mises statistics) for deciding whether two communities are diferent, and TreClimber and UniFrac frame this question in a phylogenetic context. Note that these methods aim to ases whether, rather than how two communities difer. The later question is particularly important as we begin to 57 analyze the contribution of the microbiome to human health. Metagenomic analysis in clinical trials wil require information at individual taxonomic levels to guide future experiments and treatments. For example, we would like to identify bacteria whose presence or absence contributes to human disease and develop antibiotic or probiotic treatments. This question was first addresed by Rodriguez-Brito et al. [88], who use bootstrapping to estimate the p-value asociated with diferences betwen the abundance of biological subsytems. More recently, the software MEGAN of Huson et al. [85] provides a graphical interface that alows users to compare the taxonomic composition of diferent environments. Note that MEGAN is the only one among the programs mentioned above that can be applied to data other than that obtained from 16S rRNA surveys. These tools share one common limitation ? they are al designed for comparing exactly two samples ? therefore have limited applicability in a clinical seting where the goal is to compare two (or more) treatment populations each comprising multiple samples. In this paper, we describe a rigorous statistical approach for detecting diferentialy abundant features (taxa, pathways, subsystems, etc.) betwen clinical metagenomic datasets. This method is applicable to both high-throughput metagenomic data and to 16S rRNA surveys. Our approach extends statistical methods originaly developed for microaray analysis. Specificaly, we adapt these methods to discrete count data and correct for sparse counts. Our research was motivated by the increasing focus of metagenomic projects on clinical applications (e.g. Human Microbiome Project [36]). Note that a similar problem has been addresed in the context of digital gene expresion studies (e.g. SAGE [89]). Lu et al. [90] employ an overdispersed log-linear 58 model and Robinson and Smyth [91] use a negative binomial distribution in the analysis of multiple SAGE libraries. Both approaches can be applied to metagenomic datasets. We compare our tool to these prior methodologies through comprehensive simulations, and demonstrate the performance of our approach by analyzing publicly available datasets, including 16S surveys of human gut microbiota and random sequencing-based functional surveys of infant and mature gut microbiomes and microbial and viral metagenomes. The methods described in this paper have been implemented as a web server and are also available as fre source-code (in R) from http:/metastats.cbcb.umd.edu. Materials & Methods Our approach relies on the following asumptions: (i) we are given data corresponding to two treatment populations (e.g. sick and healthy human gut communities, or individuals exposed to diferent treatments) each consisting of multiple individuals (or samples); (i) for each sample we are provided with count data representing the relative abundance of specific features within each sample, e.g. number of 16S rRNA clones asigned to a specific taxon, or number of shotgun reads mapped to a specific biological pathway or subsystem (se below how such information can be generated using currently available software packages). Our goal is to identify individual features in such datasets that distinguish betwen the two populations, i.e. features whose abundance in the two populations is diferent. Furthermore, we develop a statistical measure of confidence in the observed diferences. The input to our method can be represented as a Feature Abundance Matrix whose rows correspond to specific features, and whose columns correspond to individual 59 metagenomic samples. The cel in the i th row and j th column is the total number of observations of feature i in sample j (Figure 11). Every distinct observation is represented only once in the matrix, i.e. overlapping features are not alowed (the rows correspond to a partition of the set of sequences). Figure 11. Format of the feature abundance matrix. Each row represents a specific taxon, while each column represents a subject or replicate. The frequency of the i th feature in the j th subject (c(i,j) is recorded in the corresponding cel of the matrix. If there are g subjects in the first population, they are represented by the first g columns of the matrix, while the remaining columns represent subjects from the second population. Data normalization To acount for diferent levels of sampling acros multiple individuals, we convert the raw abundance measure to a fraction representing the relative contribution of each feature to each of the individuals. This results in a normalized version of the matrix described 60 above, where the cel in the i th row and the j th column (which we shal denote f ij ) is the proportion of taxon i observed in individual j. We chose this simple normalization procedure because it provides a natural representation of the count data as a relative abundance measure, however other normalization approaches can be used to ensure observed counts are comparable across samples, and we are currently evaluating several such approaches. Analysis of diferential abundance For each feature i, we compare its abundance across the two treatment populations by computing a two-sample t statistic. Specificaly, we calculate the mean proportion it x, and variance 2 it of each treatment t from which n t subjects (columns in the matrix) were sampled: ! x it = 1 n t f ij j" treatment t # s it 2 = 1 n t $1 f ij $x it ( ) 2 j" treatment t We then compute the two-sample t statistic: ! t i = x i1 "x i2 s i1 2 n 1 + s i2 n 2 . Features whose t statistics exceds a specified threshold can be infered to be diferentialy abundant across the two treatments (two-sided t-test). 61 Asesing significance The threshold for the t statistic is chosen such as to minimize the number of false positives (features incorrectly determined to be diferentialy abundant). Specificaly, we try to control the p-value?the likelihood of observing a given t statistic by chance. Traditional analyses compute the p-value using the t distribution with an appropriate number of degres of fredom. However, an implicit asumption of this procedure is that the underlying distribution is normal. We do not make this asumption, but rather estimate the null distribution of t i non-parametricaly using a permutation method as described in Storey and Tibshirani [92]. This procedure, also known as the nonparametric t-test has been shown to provide acurate estimates of significance when the underlying distributions are non-normal [93, 94]. Specificaly, we randomly permute the treatment labels of the columns of the abundance matrix and recalculate the t statistics. Note that the permutation maintains that there are n 1 replicates for treatment 1 and n 2 replicates for treatment 2. Repeating this procedure for B trials, we obtain B sets of t statistics: t 1 0b , ?, t M 0b , b = 1, ?, B, where M is the number of rows in the matrix. For each row (feature), the p-value asociated with the observed t statistic is calculated as the fraction of permuted tests with a t statistic greater than or equal to the observed t i : ! p i = #t i 0b "t i ,b=1,...,B { } B . This approach is inadequate for smal sample sizes in which there are a limited number of possible permutations of al columns. As a heuristic, if les than 8 subjects are used in 62 either treatment, we pool al permuted t statistics together into one null distribution and estimate p-values as: ! p i = 1 BM #j:t j 0b "t i ,j=1,...,M { } b=1 B . Note that the choice of 8 for the cutoff is simply heuristic based on experiments during the implementation of our method. Our approach is specificaly targeted at datasets comprising multiple subjects ? for smal data-sets approaches such as that proposed by Rodriguez-Brito et. al. [88] might be more appropriate. Unles explicitly stated, al experiments described below used 1000 permutations. In general, the number of permutations should be chosen as a function of the significance threshold used in the experiment. Specificaly, a permutation test with B permutations can only estimate p-values as low as 1/B (in our case 10 -3 ). In datasets containing many features, larger numbers of permutations are necesary to acount for multiple hypothesis testing isues (further corrections for this case are discussed below). Precision of the p- value calculations is obviously improved by increasing the number of permutations used to approximate the null distribution, at a cost, however, of increased computational time. For certain distributions, smal p-values can be eficiently estimated using a technique caled importance sampling. Specificaly, the permutation test is targeted to the tail of the distribution being estimated, leading to a reduction in the number of permutations necesary of up to 95% [95, 96]. We intend to implement such an approach in future versions of our software. 63 Multiple hypothesis testing correction For complex environments (many features/taxa/subsystems), the direct application of the t statistic as described can lead to large numbers of false positives. For example, choosing a p-value threshold of 0.05 would result in 50 false positives in a dataset comprising 1000 organisms. An intuitive correction involves decreasing the p-value cutoff proportional to the number of tests performed (a Bonferoni correction), thereby reducing the number of false positives. This approach, however, can be too conservative when a large number of tests are performed [21]. An alternative approach aims to control the false discovery rate (FDR), which is defined as the proportion of false positives within the set of predictions [97], in contrast to the false positive rate defined as the proportion of false positives within the entire set of tests. In this context, the significance of a test is measured by a q-value, an individual measure of the FDR for each test. We compute the q-values using the following algorithm, based on Storey and Tibshirani [92]. This method asumes that the p-values of truly null tests are uniformly distributed, asumption that holds for the methods used in Metastats. Given an ordered list of p-values, p (1) ? p (2) ? ? ? p (m) , (where m is the total number of features), and a range of values ? = 0, 0.01, 0.02, ?, 0.90, we compute ! ? " 0 # () = #{p j >#} m1$# () . Next, we fit ! ? " 0 # () with a cubic spline with 3 degres of fredom, which we denote ! ? f , and let ! ? " 0 = ? f (1). Finaly, we estimate the q-value corresponding to each ordered p- value. First, ! ? q p (m) ( =minp (m) " ? # 0 , 1 ( ) . Then for i = m-1, m-2, ?, 1, 64 ! ? q p (i) ( =min ? " 0 #m#p (i) i , ? q p (i+1) ( $ % & ? ( . Thus, the hypothesis test with p-value ! p (i) has a corresponding q-value of ! ? q p (i) ( . Note that this method yields conservative estimates of the true q-values, i.e. ! ? q p (i) ( "qp (i) ( . Our software provides users with the option to use either p-value or q-value thresholds, irespective of the complexity of the data. Handling sparse counts For low frequency features, e.g. low abundance taxa, the nonparametric t?test described above is not acurate [98]. We performed several simulations (data not shown) to determine the limitations of the nonparametric t-test for sparsely-sampled features. Correspondingly, our software only applies the test if the total number of observations of a feature in either population is greater than the total number of subjects in the population (i.e. the average across subjects of the number of observations for a given feature is greater than one). We compare the diferential abundance of sparsely-sampled (rare) features using Fisher?s exact test. Fisher?s exact test models the sampling proces acording to a hypergeometric distribution (sampling without replacement). The frequencies of sparse features within the abundance matrix are pooled to create a 2x2 contingency table (Figure 12), which acts as input for a two-tailed test. Using the notation from Figure 12, the null hypergeometric probability of observing a 2x2 contingency table is: 65 ! p= R 1 f 11 " # $ % & ? R 2 f 21 " # $ % & ? n C 1 " # $ % & ? , where ! R 1 =f 11 +f 12 , 1 =f 21 +f 22 , C 1 =f 11 +f 21 , n=f 11 +f 12 +f 21 +f 22 . By calculating this probability for a given table, and al tables more extreme than that observed, one can calculate the exact probability of obtaining the original table by chance asuming that the null hypothesis (i.e. no diferential abundance) is true [98]. Note that an alternative approach to handling sparse features is proposed in microaray literature. The Significance Analysis of Microarays (SAM) method [99] addreses low levels of expresion using a modified t statistic. We chose to use Fisher?s exact test due to the discrete nature of our data, and because prior studies performed in the context of digital gene expresion indicate Fisher?s test to be efective for detection of diferential abundance [100]. Figure 12. Detecting diferential abundance for sparse features. A 2x2 contingency table is used in Fisher?s exact test for diferential abundance betwen rare features. f 1 is the number of observations of feature i in al individuals from treatment 1. f 21 is the number of observations that are not feature i in al individuals from treatment 1. f 12 and f 2 are similarly defined for treatment 2. 66 Creating the Feature Abundance Matrix The input to our method, the Feature Abundance Matrix, can be easily constructed from both 16S rRNA and random shotgun data using available software packages. Specificaly for 16S taxonomic analysis, tools such as the RDP Bayesian clasifier [52] and Grengenes SimRank [23] output easily-parseable information regarding the abundance of each taxonomic unit present in a sample. As a complementary, unsupervised approach, 16S sequences can be clustered with DOTUR [62] into operational taxonomic units (OTUs). Abundance data can be easily extracted from the ?*.list? file detailing which sequences are members of the same OTU. Shotgun data can be functionaly or taxonomicaly clasified using MEGAN [85], CARMA [101], or MG-RAST [24]. MEGAN and CARMA are both capable of outputting lists of sequences asigned to a taxonomy or functional group. MG-RAST provides similar information for metabolic subsystems that can be downloaded as a tab-delimited file. Al data-types described above can be easily converted into a Feature Abundance Matrix suitable as input to our method. In the future we also plan to provide converters for data generated by commonly-used analysis tools. Data used in this paper Human gut 16S rRNA sequences were prepared as described in Eckburg et al. and Ley et al. (2006) and are available in GenBank, acesion numbers: DQ793220-DQ802819, DQ803048, DQ803139-DQ810181, DQ823640-DQ825343, AY974810-AY986384. In our experiments we asigned al 16S sequences to taxa using a na?ve Bayesian clasifier curently employed by the Ribosomal Database Project I (RDP) [52]. COG profiles of 67 13 human gut microbiomes were obtained from the supplementary material of Kurokawa et al. [102]. We acquired metabolic functional profiles of 85 metagenomes from the online supplementary materials of Dinsdale et al. (2008) (http:/ww.thesed.org/ DinsdaleSupplementalMaterial/). Results Comparison with other statistical methods As outlined in the introduction, statistical packages developed for the analysis of SAGE data are also applicable to metagenomic datasets. In order to validate our method, we first designed simulations and compared the results of Metastats to Student?s t-test (with pooled variances) and two methods used for SAGE data: a log-linear model (Log-t) by Lu et al. [90], and a negative binomial (NB) model developed by Robinson and Smyth [91]. We designed a metagenomic simulation study in which ten subjects are drawn from two groups - the sampling depth of each subject was determined by random sampling from a uniform distribution betwen 200 and 1000 (these depths are reasonable for metagenomic studies). Given a population mean proportion p and a dispersion value ?, we sample sequences from a beta-binomial distribution ?(?,?), where ? = p(1/? -1) and ? = (1-p)(1/? -1). Note that data from this sampling procedure fits the asumptions for Lu et al. as wel as Robinson and Smyth and therefore we expect them to do wel under these conditions. Lu et al. designed a similar study for SAGE data, however, for each simulation, a fixed dispersion was used for both populations and the dispersion 68 estimates were remarkably smal (? = 0, 8e-06, 2e-05, 4.3e-05). Though these values may be reasonable for SAGE data, we found that they do not acurately model metagenomic data. Figure 13 displays estimated dispersions within each population for al features of the metagenomic datasets examined below. Dispersion estimates range from 1e-07 to 0.17, and rarely do the two populations share a common dispersion. Thus we designed our simulation so that ? is chosen for each population randomly from a uniform distribution betwen 1e-08 and 0.05, alowing for potential significant diferences betwen population distributions. For each set of parameters, we simulated 1000 feature counts, 500 of which are generated under p 1 = p 2 , the remainder are diferentialy abundant where a*p 1 = p 2 , and compared the performance of each method using receiver-operating-characteristic (ROC) curves. Figure 14 displays the ROC results for a range of values for p and a. For each set of parameters, Metastats was run using 5000 permutations to compute p-values. Metastats performs as wel as other methods, and in some cases is preferable. We also found that in most cases our method was more sensitive than the negative binomial model, which performed poorly for high abundance features. 69 Figure 13. Dispersion estimates (?) for thre metagenomic datasets used in this study. These plots compare dispersion values betwen (A) obese and lean human gut taxonomic data, (B) infant and mature human gut COG asignments, and (C) microbial and viral subsystem annotations. We find a wide range of possible dispersions in this data and significant diferences in dispersions betwen two populations. 70 71 Figure 14. ROC curves comparing statistical methods in a simulation study. Sequences were selected from a beta-binomial distribution with variable dispersions and group mean proportions p 1 and p 2 . For each set of parameters, we simulated 1000 trials, 500 of which are generated under the null hypothesis (p 1 = p 2 ), and the remainder are diferentialy abundant where a*p 1 = p 2 . For example, p=0.2 and a=2 indicates features comprising 20% of the population that difer two-fold in abundance betwen two populations of interest. Parameter values for p 1 and a are shown above each plot. 72 Our next simulation sought to examine the acuracy of each method under extreme sparse sampling. As shown in the datasets below, it is often the case that a feature may not have any observations in one population, and so it is esential to employ a statistical method that can addres this frequent characteristic of metagenomic data. Under the same asumptions as the simulation above, we tested a = 0 and 0.01, thereby significantly reducing observations of a feature in one of the populations. The ROC curves presented in Figure 15 reveal that Metastats outperforms other statistical methods in the face of extreme sparsenes. Holding the false positive rate (x-axis) constant, Metastats shows increased sensitivity over al other methods. The poor performance of Log-t is noteworthy given it is designed for SAGE data that is also potentialy sparse. Further investigation revealed that the Log-t method results in a highly inflated dispersion value if there are no observations in one population, thereby reducing the estimated significance of the test. Finaly, we selected a subset of the Dinsdale et al. [18] metagenomic subsystem data (described below), and randomly asigned each subject to one of two populations (20 subjects per population). Al subjects were actualy from the same population (microbial metagenomes), thus the null hypothesis is true for each feature tested (no feature is diferentialy abundant). We ran each methodology on this data, recording computed p-values for each feature. Repeating this procedure 200 times, we simulated tests of 5200 null features. Table 9 displays the number of false positives incurred by each methodology given diferent p-value thresholds. The results indicate that the negative binomial model results in an exceptionaly high number of false positives 73 relative to the other methodologies. Student?s t-test and Metastats perform equaly wel in estimating the significance of these null features, while Log-t performs slightly beter. These studies show that Metastats consistently performs as wel as al other applicable methodologies for deeply-sampled features, and outperforms these methodologies on sparse data. Below we further evaluate the performance of Metastats on several real metagenomic datasets. Figure 15. ROC curves comparing statistical methods in a simulation study for extreme sparse sampling. Sequences were selected from a beta-binomial distribution with variable dispersions and group mean proportions p 1 and p 2 . For each set of parameters, we simulated 1000 trials, 500 of which are generated under the null hypothesis (p 1 = p 2 ), and the remainder are diferentialy abundant where a*p 1 = p 2 . For example, p=0.2 and a=2 74 indicates features comprising 20% of the population that difer two-fold in abundance betwen two populations of interest. Parameter values for p 1 and a are shown above each plot. Number of False Positives P ? Metastats Student-t Log-t NB 0.001 7 4 4 109 0.005 25 25 24 121 0.01 51 52 43 133 Table 9. Comparison of false positives found by diferent methodologies. Using real metagenomic data, we simulated features with no diferential abundance by randomly dividing subjects from a single population into two subpopulations. We found that for a stringent p-value threshold of 0.001, the negative binomial model (NB) resulted in a false positive rate 20 times higher than the other methodologies. The Log-t of Lu et al. resulted in the lowest false positive rate among the methods tested while Student?s test and Metastats performed equaly wel. 75 Taxa associated with human obesity In a recent study, Ley et al. [20] identified gut microbes asociated with obesity in humans and concluded that obesity has a microbial element, specificaly that Firmicutes and Bacteroidetes are bacterial divisions diferentialy abundant betwen lean and obese humans. Obese subjects had a significantly higher relative abundance of Firmicutes and a lower relative abundance of Bacteriodetes than the lean subjects. Furthermore, obese subjects were placed on a calorie-restricted diet for one year, after which the subjects? gut microbiota more closely resembled that of the lean individuals. We obtained the 20,609 16S rRNA genes sequenced in Ley et al. and asigned them to taxa at diferent levels of resolution (note that 2,261 of the 16S sequences came from a previous study [19]). We initialy sought to re-establish the primary result from this paper using our methodology. Table 10 ilustrates that our method agred with the results of the original study: Firmicutes are significantly more abundant in obese subjects (P = 0.003) and Bacteroidetes are significantly more abundant in the lean population (P < 0.001). Furthermore, our method also detected Actinobacteria to be diferentialy abundant, a result not reported by the original study. Approximately 5% of the sample was composed of Actinobacteria in obese subjects and was significantly les frequent in lean subjects (P = 0.004). Collinsela and Eggerthela were the most prevalent Actinobacterial genera observed, both of which were overabundant in obese subjects. These organisms are known to ferment sugars into various faty acids [103], further strengthening a possible connection to obesity. Note that the original study used Students t-test, leading to a p-value for the observed diference within Actinobacteria of 0.037, 9 76 times larger than our calculation. This highlights the sensitivity of our method and explains why this diference was not originaly detected. To explore whether we could refine the broad conclusions of the initial study, we re-analyzed the data at more detailed taxonomic levels. We identified thre clases of organisms that were diferentialy abundant: Clostridia (P = 0.005), Bacteroidetes (P < 0.001), and Actinobacteria (P = 0.003). These thre were the dominant members of the corresponding phyla (Firmicutes, Bacteroides, Actinobacteria, respectively) and followed the same distribution as observed at a coarser level. Metastats also detected nine diferentialy abundant genera acounting for more than 25% of the 16S sequences sampled in both populations (P ? 0.01). Syntrophococcus, Ruminococcus, and Collinsela were al enriched in obese subjects, while Bacteroides on average were eight times more abundant in lean subjects. For taxa with several observations in each subject, we found good concordance betwen our results (p-value estimates) and those obtained with most of the other methods (Table 10). Surprisingly, we found that the negative binomial model of Robinson and Smyth failed to detect several strongly diferentialy abundant features in these datasets (e.g the hypothesis test for Firmicutes results in a p-value of 0.87). This may be due in part to dificulties in estimating the parameters of their model for our datasets and further strengthens the case for the design of methods specificaly tuned to the characteristics of metagenomic data. For cases where a particular taxon had no observations in one population (e.g. Terasakiela), the methods proposed for SAGE data sem to perform poorly. 77 P values Taxon Obese Lean Metastats Student-t Log-t NB Phyla Bacteroidetes 2.902 ?1.067 25.652?4.576 0.002 0.00 0.004 0.00 Firmicutes 89.318?2.23 72.83?4.812 0.028 0.025 0.030 0.8701 Actinobacteria 4.490?1.345 0.47?0.179 0.037 0.0371 0.004 0.073 Clases Bacteroidetes (clas) 2.72?1.065 25.652?4.576 0.001 0.00 0.005 0.001 Actinobacteria (clas) 4.490?1.345 0.47?0.179 0.024 0.0371 0.004 0.1858 Clostridia 84.63?2.38 6.907?5.79 0.036 0.042 0.052 0.9797 Genera Syntrophococus 2.380?0.383 0.66?0.37 0.014 0.07 0.067 0.4860 Terasakiela 0.00?0 0.15?0.15 0.016 0.1986 0.963 0.016 Ruminococus 26.276?4.454 10.707?2.094 0.023 0.0207 0.039 0.639 Marinilabilia 0.010?0.010 0.138?0.138 0.024 0.2353 0.0467 0.01 Colinsela 3.565?1.187 0.154?0.154 0.052 0.0451 0.046 0.6545 Bacteroides 1.841?0.963 14.623?4.44 0.056 0.023 0.0105 0.012 Paludibacter 0.00?0 0.093?0.069 0.059 0.0896 0.963 0.00 Bryantela 0.461?0.051 0.151?0.102 0.065 0.072 0.0304 0.0487 Desulfovibrio 0.031?0.031 0.145?0.145 0.073 0.390 0.2315 0.0156 Table 10. Diferentialy abundant taxa betwen lean and obese human gut microflora. For the phylum, clas, and genus levels (mean percentage ? s.e., p-value ? 0.01) we succesfully re-established the major result of Ley et al., and uncovered a new diference within Actinobacteria. Both Firmicutes and Actinobacteria have significantly higher relative abundances in obese people, while Bacteroidetes make up a higher proportion of the gut microbiota in the lean population. Results reveal Clostridia as the primary component of the diferential abundance observed within Firmicutes, and Bacteroidetes and Actinobacteria clases explain the diferential abundance observed within the 78 corresponding phyla. Using this p-value threshold, we expect les than one false positive among these results. The last four columns display the computed p-values for diferent statistical methods, including Metastats and the overdispersion methods of Lu et al. (Log- t) and Robinson and Smyth (NB). These results reveal NB and Student?s t-test to be overly-conservative. Diferentially abundant COGs betwen mature and infant human gut microbiomes Targeted sequencing of the 16S rRNA can only provide an overview of the diversity within a microbial community but cannot provide any information about the functional roles of members of this community. Random shotgun sequencing of environments can provide a glimpse at the functional complexity encoded in the genes of organisms within the environment. One method for defining the functional capacity of an environment is to map shotgun sequences to homologous sequences with known function. This strategy was used by Kurokawa et al. [102] to identify clusters of orthologous groups (COGs) in the gut microbiomes of 13 individuals, including four unweaned infants. We examined the COGs determined by this study across al subjects and used Metastats to discover diferentialy abundant COGs betwen infants and mature (> 1 year old) gut microbiomes. This is the first direct comparison of these two populations as the original study only compared each population to a reference database to find enriched gene sets. Due to the high number of features (3868 COGs) tested for this dataset and the limited number of infant subjects available, our method used the pooling option to compute p- values (we chose 100 permutations), and subsequently computed q-values for each feature. Using a threshold of Q ? 0.05 (controlling the false discovery rate to 5%), we 79 detected 192 COGs that were diferentialy abundant betwen these two populations (Table A1). Se Table 11 for most abundant detected COGs and others discussed below. The most abundant enriched COGs in mature subjects included signal transduction histidine kinase (COG0642), outer membrane receptor proteins, such as Fe transport (COG1629), and Beta-galactosidase/beta-glucuronidase (COG3250). These COGs were also quite abundant in infants, but depleted relative to mature subjects. Infants maintained enriched COGs related to sugar transport systems (COG1129) and transcriptional regulation (COG1475). This over-abundance of sugar transport functions was also found in the original study, strengthening the hypothesis that the unweaned infant gut microbiome is specificaly designed for the digestion of simple sugars found in breast milk. Similarly, the depletion of Fe transport proteins in infants may be asociated with the low concentration of iron in breast milk relative to cow?s milk [104]. Despite this low concentration, infant absorption of iron from breast milk is remarkably high, and becomes poorer when infants are weaned, indicating an alternative mechanism for uptake of this mineral. The potential for a diferent mechanism is supported by the detection of a Feredoxin-like protein (COG2440) that was 11 times more abundant in infants than in mature subjects, while Feredoxin (COG1145) was significantly enriched in mature subjects. 80 Mature Infants COG id Description mean stderr mean stderr Metastat qvalue COG0205 6- phosphofructokinase 0.017 0.001 0.006 0.002 0.0313 COG0358 DNA primase (bacterial type) 0.024 0.001 0.008 0.001 0.072 COG0507 ATP-dependent exoDNAse (exonuclease V), alpha subunit - helicase superfamily I member 0.016 0.001 0.008 0.001 0.0349 COG0526 Thiol-disulfide isomerase and thioredoxins 0.028 0.002 0.014 0.002 0.0371 COG0621 2-methylthioadenine synthetase 0.017 0.001 0.008 0.002 0.045 COG0642 Signal transduction histidine kinase 0.0132 0.009 0.07 0.004 0.027 COG067 Predicted oxidoreductases (related to aryl- alcohol dehydrogenases) 0.012 0.001 0.021 0.001 0.0282 COG0739 Membrane proteins related to metalloendopeptidases 0.024 0.001 0.006 0.001 0.072 COG0745 Response regulators consisting of a CheY- like receiver domain and a winged-helix DNA-binding domain 0.076 0.003 0.051 0.004 0.0352 COG0747 ABC-type dipeptide transport system, periplasmic component 0.01 0.001 0.027 0.003 0.0352 81 COG113 Gamma- aminobutyrate permease and related permeases 0.002 0.001 0.018 0.003 0.0349 COG129 ABC-type sugar transport system, ATPase component 0.013 0.001 0.028 0.003 0.0492 COG145 Ferredoxin 0.017 0.001 0.005 0.002 0.0217 COG196 Chromosome segregation ATPases 0.017 0.001 0.007 0.001 0.0108 COG1249 Pyruvate/2- oxoglutarate dehydrogenase complex, dihydrolipoamide dehydrogenase (E3) component, and related enzymes 0.006 0.001 0.01 0.001 0.0349 COG1263 Phosphotransferase system IIC components, glucose/maltose/N- acetylglucosamine- specific 0.012 0.001 0.031 0.003 0.0313 COG1475 Predicted transcriptional regulators 0.025 0.002 0.014 0.002 0.0435 COG1595 DNA-directed RNA polymerase specialized sigma subunit, sigma24 homolog 0.053 0.004 0.013 0.003 0.0206 COG1609 Transcriptional regulators 0.03 0.002 0.092 0.013 0.0424 COG1629 Outer membrane receptor proteins, mostly Fe transport 0.012 0.016 0.013 0.007 0.0313 COG1762 Phosphotransferase system mannitol/fructose- specific IIA domain (Ntr-type) 0.004 0.001 0.017 0.002 0.0293 82 COG1961 Site-specific recombinases, DNA invertase Pin homologs 0.059 0.004 0.018 0.006 0.0345 COG204 Response regulator containing CheY-like receiver, AA-type ATPase, and DNA- binding domains 0.019 0.002 0.005 0.002 0.0421 COG24 Membrane protein involved in the export of O-antigen and teichoic acid 0.019 0.001 0.009 0.001 0.029 COG2376 Dihydroxyacetone kinase 0.002 0 0.009 0.001 0.0278 COG240 Ferredoxin-like protein 0 0 0.002 0 0.0394 COG2893 Phosphotransferase system, mannose/fructose- specific component IIA 0.003 0.001 0.01 0.001 0.0352 COG3250 Beta- galactosidase/beta- glucuronidase 0.056 0.004 0.023 0.006 0.0435 COG3451 Type IV secretory pathway, VirB4 components 0.03 0.001 0.009 0.003 0.0157 COG3505 Type IV secretory pathway, VirD4 components 0.029 0.001 0.01 0.003 0.0278 COG3525 N-acetyl-beta- hexosaminidase 0.016 0.002 0.004 0.001 0.0352 COG3537 Putative alpha-1,2- mannosidase 0.02 0.003 0.002 0.002 0.0352 COG371 Transcriptional antiterminator 0.004 0.001 0.02 0.003 0.039 COG3712 Fe2+-dicitrate sensor, membrane component 0.023 0.003 0 0 0.028 COG4206 Outer membrane cobalamin receptor protein 0.021 0.003 0.003 0.001 0.0313 83 COG471 Outer membrane receptor for ferrienterochelin and colicins 0.039 0.005 0.006 0.003 0.036 Table 11. Diferentialy abundant COGs betwen infant and mature human gut microbiomes using a q-value threshold of 0.05. Of the 192 diferentialy abundant COGs detected, this table displays the most abundant 25 COGs in either mature or infant gut microbiomes. Using this threshold we expect les than 10 false positives in this dataset. 84 Diferentially abundant metabolic subsystems in microbial and viral metagenomes A recent study by Dinsdale et al. profiled 87 diferent metagenomic shotgun samples (~15 milion sequences) using the SED platform (http:/ww.thesed.org) [18] to se if biogeochemical conditions correlate with metagenome characteristics. We obtained functional profiles from 45 microbial and 40 viral metagenomes analyzed in this study. Within the 26 subsystems (abstract functional roles) analyzed in the Dinsdale et al. study, we found 13 to be significantly diferent (P ? 0.05) betwen the microbial and viral samples (Table 12). Subsystems for RNA and DNA metabolism were significantly more abundant in viral metagenomes, while nitrogen metabolism, membrane transport, and carbohydrates were al enriched in microbial communities. The high levels of RNA and DNA metabolism in viral metagenomes ilustrate their need for a self-sufficient source of nucleotides. Though the diferences described by the original study did not include estimates of significance, our results largely agred with the authors? qualitative conclusions. However, due to the continuously updated annotations in the SED database since the initial publication, we found several diferences betwen our results and those originaly reported. In particular we found virulence subsystems to be les abundant overal than previously reported, and could not find any significant diferences in their abundance betwen the microbial and viral metagenomes. 85 Subsystem microbial viral Metastats p value Carbohydrates 17.01 ? 0.7 12.87 ? 0.82 0.01 Amino Acids and Derivatives 9.29 ? 0.46 7.58 ? 0.5 0.019 Respiration 8.24 ? 1.34 3.89 ? 0.46 0.01 Photosynthesis 7.13 ? 2.38 1.16 ? 0.36 0.017 Cofactors, Vitamins, and Pigments 5.54 ? 0.27 6.4 ? 0.26 0.02 Experimental Subsystems 4.8 ? 0.31 5.80 ? 0.36 0.050 DNA Metabolism 3.9 ? 0.24 9.18 ? 1.06 0.01 Cel Wal and Capsule 3.73 ? 0.27 5.64 ? 0.71 0.09 RNA Metabolism 3.65 ? 0.21 5.23 ? 0.71 0.03 Nucleosides and Nucleotides 3.38 ? 0.18 7.72 ? 0.74 0.01 Membrane Transport 2.04 ? 0.1 1.30 ? 0.15 0.01 Nitrogen Metabolism 1.47 ? 0.08 0.93 ? 0.10 0.01 Faty Acids and Lipids 1.46 ? 0.07 1.05 ? 0.1 0.04 Table 12. Diferentialy abundant metabolic subsystems betwen microbial and viral metagenomes (mean percentage ? s.e., p-values ? 0.05). Using this threshold we expect les than one false positive in the dataset. We find that viral metagenomes are significantly enriched for nucleotides and nucleosides and DNA metabolism, consistent with the viruses? need for self-sufficiency. Proceses for respiration, photosynthesis, and carbohydrates are overepresented in microbial metagenomes. Discussion We have presented a statistical method for handling frequency data to detect diferentialy abundant features betwen two populations. This method can be applied to the analysis of any count data generated through molecular methods, including random shotgun sequencing of environmental samples, targeted sequencing of specific genes in a metagenomic sample, digital gene expresion surveys (e.g. SAGE [100]), or even whole- genome shotgun data (e.g. comparing the depth of sequencing coverage across asembled 86 genes). Comparisons on both simulated and real dataset indicate that the performance of our software is comparable to other statistical approaches when applied to wel-sampled datasets, and outperforms these methods on sparse data. Our method can also be generalized to experiments with more than two populations by substituting the t-test with a one-way ANOVA test. Furthermore, if only a single sample from each treatment is available, a chi-squared test can be used instead of the t-test. [98]. In the coming years metagenomic studies wil increasingly be applied in a clinical seting, requiring new algorithms and software tols to be developed that can exploit data from hundreds to thousands of patients. The methods described above represent an initial step in this direction by providing a robust and rigorous statistical method for identifying organisms and other features whose diferential abundance correlates with disease. These methods, asociated source code, and a web interface to our tools are frely available at http:/metastats.cbcb.umd.edu. 87 Chapter 5: Inferring microbial interaction webs from time-series metagenomic data Note in this chapter, my contributions include designing the interaction interference methodology and performing all computational experiments. Peter Turnbaugh and Jef Gordon performed the actual humanized mouse experiments in a previously published study. I apply my methodology to their data as described below. Background In the newly established field of metagenomics, high-throughput DNA sequencing technologies enable researchers to examine the taxonomic composition and functional capabilities of complex microbial environments. Most recent metagenomic studies have focused on samples from a single time-point, however evidence is mounting that microbial communities are often not at equilibrium, rather are constantly shifting state and even oscilating [105-108]. Consequently, there is an imediate need for studies examining the temporal variation in microbial populations [26, 109]. Only a limited number of metagenomics studies have investigated the spatial and temporal dynamics of microbial communities. Eckburg et al. performed 16S rRNA analyses on mucosal samples along the human endogenous intestinal tract (as wel as fecal samples), revealing not only extensive bacterial diversity, but also remarkable variation throughout the major sections of the colon [19]. Ley et al. analyzed the temporal changes in obese human gut microbiota over the course of a diet [20]. This study found that in obese subjects placed on a diet, the gut microflora shifted towards a state similar to that of their lean counterparts. Using 16S-based oligonucleotide arays to characterize taxonomic diversity, Palmer and colleagues followed the development of gut microbiota in infants from birth through the first year of life, and found that although the 88 same groups of microbes dominate gut microflora, the entire community is highly variable during the first year of life for each individual newborn [80]. Additional metagenomic studies have explored the dynamic change of microbial communities within ocean water and sediment at diferent depths [110, 111], on apple surfaces during crop cycles [15], and inside the human gut throughout the course of antibiotic treatments [48]. Longitudinal studies wil not only describe a new dimension of bacterial populations for scientists, they wil also aid in modeling these systems. Computational models, supplemented by longitudinal data, wil provide an opportunity to realisticaly model community dynamics and validate predictions. In this context, mathematical models can be used to study the underlying interactions betwen microbes, evaluate the efects of environmental factors, and ultimately, forecast the reaction of a microbial population to perturbation. In this study, we employ the generalized Lotka-Voltera (gLV) model to predict microbe-microbe interactions from time-series metagenomic data. This model has been widely applied in studies of microbial and macro-scale (i.e. animal) ecology to characterize trophic and non-trophic interactions betwen organisms [112-114]. A variety of dynamic regimes can be captured using the gLV model including equilibrium convergence, periodicity, and chaos (i.e. unpredictable behavior beyond some time window). Furthermore, in contrast to other modeling approaches (such as generalized additive models [115, 116]), the gLV model formulation alows an intuitive interpretation of its parameters as natural ecological measures and interactions betwen members of a community. For the purposes of this work, we focus on (i) the prediction of interaction 89 direction (e.g. taxon i inhibits/enhances the growth rate of taxon j) and (i) the ancilary estimation of interaction strengths in the approximated web. We present a reliable prediction methodology that computes confidence scores for interaction direction based on the distributions of estimated parameters in the gLV model. We further validate our approach on several simulated microbial populations. Applying our method to a metagenomic dataset following ?humanized? mouse gut microbiota over a period of eight weks, we identify several compeling interactions betwen dominant members of the intestinal tract. Materials & Methods Modeling microbial communities Recent studies atempting to model interacting microbial populations have typicaly used one of two approaches: generalized additive models (GAMs) and generalized Lotka- Voltera (gLV) models, which we now compare below. GAMs are a general statistical regresion approach that incorporate smoothing splines to describe nonlinear relationships betwen the predictor and response variables [117]. In the context of organismal communities, the change in abundance (or logarithmic abundance) of the i th organism, ! "N i , is modeled as a sum of nonparametric smooth functions: ! "N i =b i +f ij N j () j # +$ i . (1) Here, N j is the abundance (or log abundance) of taxon j, b i s an intercept term, ! f ij N j () is a smooth function specifying the efects of taxon j on taxon i, and ! " i is a noise term. 90 Trosvik and colleagues have used this approach to model both artificialy constructed and natural bacterial communities [116, 118]. In contrast, generalized Lotka-Voltera models are deterministic models developed specificaly for analyzing communities comprised of interacting individuals (originaly in the context of predator-prey relationships). These models have been employed for decades in ecological studies of natural interacting populations at macro- and micro-scales [112-114]. A discrete version of the gLV model is formulated as the following system of first-order diference equations: ! N i t+1 () =N i t () expr i 1+ 1 K i " ij N j t () j # $ % & ? ( * + , - , . / , 0 , (2) where ! N i t () is the density of taxon i at time t, ! r i is the reproductive rate of taxon i, and K i is its carying capacity within the environment (i.e. the theoretical equilibrium of taxon i in the absence of al other taxa). Each coeficient ! " ij is a measure of the overal efect of taxon j on taxon i. The general interaction web can berepresented as a matrix: ! " 11 " 12 L" 1n 21222n MMOM " n1 " n2 L" nn # $ % % & ? ( ( , in which any symmetric pair ! " ij , ji ( ) , defines the relationship betwen taxa i and j (se Table 13). Organisms within the same taxon are asumed to be competing, thus ? i = -1 for al i. This model represents the discrete version of the generalized Lotka-Voltera ODE model: 91 ! ? N i =r i N i + r i K i " ij N ij j=1 n # , (3) where the rate of change of the i th taxon ! ? N i , depends on its current abundance ! N i as wel as the abundances of al other taxa ! N j . An excelent description of the Lotka-Voltera model, as wel as other ecological models, is found in [112]. There are several notable diferences betwen the GAM and gLV approaches. GLV models presume the widely acepted law of mass action, that is, two populations interact at a rate proportional to the product of their abundances [119, 120], while in a GAM seting such efects are dificult to model (the influence of taxon j on taxon i in equation (1) does not depend on the abundance of taxon i). Both models require estimation of (m 2 +m) parameters corresponding to m 2 interactions betwen organisms and m additional species-specific parameters (intercept terms in GAM, and carying capacities and growth rates in gLV). In gLV, the interactions are defined by constant parameters, while in GAM each interaction is a smooth function whose parameters (number of knots and their positions) must also be estimated. Thus, GAMs require a larger number of parameters to be estimated, making it dificult to acurately learn these models from the limited data available. In conclusion, the GAM and gLV approaches are complementary ? GAMs provide more flexibility in modeling the density dependence of the interaction betwen organisms, while gLVs beter approximate the physical proces of interacting populations and are easier to interpret. Here we provide a first application of the gLV ecological models to metagenomic time-series data. An atractive property of the gLV model is the ability to generalize over aggregates of organisms, e.g. by modeling an environment at a higher taxonomic level. 92 Suppose we have a community with thre interacting taxa, then the equation describing the abundance of taxon 1, is: ! N 1 t+1 () =N 1 t () expr 1 + "N 1 t () K 1 + # 12 N 2 t () K 1 + # 13 N 3 t () K 1 $ % & ? ( * + , - . / , which reduces to ! N 1 t+1 () =N 1 t () exp" 0 +" 1 N 1 t () +" 2 N 2 t () +" 3 N 3 t (){ }. Asuming that taxa 2 and 3 are members of the same higher-level taxon, which we denote as 2?, they can be grouped: ! N " 2 t () #N 2 t () +N 3 t () . Therefore, ! N 1 t+1 () =N 1 t () exp" 0 +" 1 N 1 t () +" # 2 N # 2 t (){ } where " # 2 = " 2 N 2 t () +" 3 N 3 t () # 2 t () . Now, we se that the coeficient ? 2? is dependent on the abundances of its asociated members, implying it is changing over time. We must then require that the relative abundances of taxa 2 and 3 do not change with respect to each other, i.e. for some positive rational value p, ! N 2 t () =p"N 3 t () . Under this criterion, ! " # 2 = " 2 p+" 3 (p+1) and is constant. Thus the efect of the higher-level taxon (2?) is a linear aggregate of its members if the relative abundances of the members do notchange over time. Additionaly, we must also show how this aggregation modifies the equations defining taxa 2 and 3 in order to formulate an equation for the higher-level taxon (2?). We need to reconcile the following equations: 93 ! N 2 t+1 () =N 2 t () exp" 0 +" 1 N 1 t () +" 2 N 2 t () +" 3 N 3 t (){ } N 3 t+1 () =N 3 t () exp# 0 +# 1 N 1 t () +# 2 N 2 t () +# 3 N 3 t (){ } N $ 2 t+1 () =N $ 2 t () exp% 0 +% 1 N 1 t () +% 2 N $ 2 t (){ } where ??s, ??s, and ??s are al real numbers. Using the same requirement relative abundance criterion as above, ! N 2 t () =p"N 3 t () , we have: ! N 2 t+1 () =N 2 t () exp" 0 +" 1 N 1 t () +" # 2 N # 2 t (){ } N 3 t+1 () =N 3 t () exp$ 0 +$ 1 N 1 t () +$ # 2 N # 2 t (){ } , and summed, ! N 2 t+1 () +N 3 t+1 () =N 2 t () exp" 0 +" 1 N 1 t () +" # 2 N # 2 t (){ } +N 3 t () exp$ 0 +$ 1 N 1 t () +$ # 2 N # 2 t (){ } . (1) Because ! N " 2 t () #N 2 t () +N 3 t () , ! N " 2 t+1 () =N 2 t () +N 3 t ()( exp# 0 +# 1 N 1 t () +# 2 N " 2 t (){ } =N 3 t () p+1 () exp# 0 +# 1 N 1 t () +# 2 N " 2 t (){ } . (2) From equations (1) and (2), we se that if we require the efects of each taxon on taxa 2 and 3 to be equivalent, (i.e. ! " 0 =# 0 , " 1 =# 1 , " $ 2 =# $ 2 ), then we can reconcile the equations: ! N 2 t+1 () +N 3 t+1 () =N 2 t () +N 3 t ()( exp" 0 +" 1 N 1 t () +" # 2 N # 2 t (){ } =N # 2 t+1 () . Therefore, if we asume that the relative abundances of a set of taxa do not change with respect to each other, and further, that other members of the population have identical individual efects on these taxa, then the set may be aggregated into one higher taxon in 94 the gLV model. This aggregate property generalizes to any number of members in a higher-level taxon. 95 ? ij - 0 + - competition amensalism parasitism 0 amensalism neutrality commensalism ? ji + predation commensalism mutualism Table 13. Signs of interaction coeficients asociated with major population interactions. ? ij and ? ji are symmetric components in the interaction matrix. In this notation, the predation and parasitism relationships imply that taxon i is the prey and the parasite relative to taxon j. If ? ij is zero, it implies that taxon j has no efect on the growth rate of taxon i. In our model formulation, we asume members of the same taxon are competing, hence, ? i = -1 for al i. Learning a model from the data Usualy the parameters of the gLV model are determined empiricaly through controlled laboratory experiments. In the context of metagenomic studies, such experiments are impractical or even impossible. Thus, we explored whether these parameters can be directly learned from time-series data. Specificaly, we atempt to find a set of parameters that minimize the diference (usualy expresed as a least-squares criterion) betwen the observed data and the model predictions. We evaluated several regresion methods: dynamic regresion ? an analytical method traditionaly used in the context of Lotka Voltera models [112-114]; nonlinear least squares ? a gradient-descent approach; and 96 two gradient-fre methods: Nelder-Mead and patern search. These methods are briefly described below: Dynamic regresion converts the Lotka-Voltera model into a linear formulation acording to the equation: () () 0 1 ln i i iij iij Nt r Nt K ! " #$+ = %& ? alowing model parameters to be estimated through linear regresion. The nonlinear least squares method [121] employs a gradient descent algorithm to estimate model parameters. The basic asumption of this approach is that surface of the objective function being minimized (in our case the least squares diference betwen model predictions and time-series data) is smooth. The Nelder-Mead [122] and patern search [123] methods do not make this asumption, rather they explore the search space in a systematic fashion while atempting to minimize the objective function. Nelder-Mead explores the space through a series of operations performed on simplices within the search space, while patern search follows the direction determined by the values of the objective function along the vectors of a positive basis of the search space. These methods were found to be more robust when optimizing potentialy non-smooth functions or in high-dimensional spaces where gradient computations are expensive. Nonlinear least squares, Nelder-Mead, and patern search optimizations were performed using implemented MATLAB routines. Al thre methods employed the same constraints on the number of steps alowed (?MaxIter? = 1000) and the maximum number of function evaluations (?MaxFunEvals? = 1e6), and stopping criteria. 97 To avoid geting trapped in local minima, we restart the NLS, NM, and PS searches from a collection of random starting points. Specificaly, when evaluating the performance of diferent regresion methods we relied on simulated communities comprising 2,3, and 4 taxa (as described below), and sampled 10,000 random points within the search space, selected the best 10 in terms of fit betwen model and data, and used these as the starting point for the optimization procedure. Al thre methods were run on the same set of initial starting points. For the analysis of the actual metagenomic data we selected the best 10,000 starting points for the NLS procedure from among 500,000 random samples of the search space. Al computational analyses were performed in MATLAB v. R2009a (The Mathworks Inc. Natick, MA). An inference methodology with confidence values Recal that given a time-series metagenomic dataset our goals are (i) to predict the qualitative interaction directions (+/-) for each pair of taxa, and (i) to acurately estimate the parameters of the full gLV model. The inference methods described in the previous section answer these goals however do not provide any measure of confidence in the predicted parameters given the fact that the underlying data are noisy. In other words, if smal changes in the underlying data or the inference algorithm lead to large diferences in estimated parameters (either in sign or magnitude) the resulting model cannot be trusted. To evaluate the stability of the fiting procedure we developed a stochastic extension of the NLS technique. As described before, NLS minimizes a least-squares objective function O(x) starting from some initial point in the parameter space x 0 . The resulting minimizer, x 0 *, is a set of parameters in the gLV model that minimizes the 98 diference betwen the observed longitudinal data and the model predictions. However, x 0 * is likely only a local minimizer, that is, there exists a diferent set of parameters x* such that O(x*) < O(x 0 *). We cannot exhaustively search the parameter space for a global minimum, so instead we randomly select a set of initial points x 1 ,?, x m and perform NLS starting from each one, resulting in minimizers x 1 *,?, x m *, respectively. We then examine the distribution of each parameter?s estimates across these minimizers to compute a confidence value for each predicted interaction. Given a set of minimizers x 1 *,?, x m *, we first sort these in order of goodnes fit (that is, by corresponding values of O(x 1 *),?,O(x m *) from smalest to largest). To reduce the efect of outliers we focus on just subset of the estimated models by excluding the bottom ? fraction of the models (with respect to goodnes of fit). The sign of each interaction coeficient is set to the majority vote within the remaining models. For a particular interaction coeficient, ? ij , the confidence in our prediction is the proportion of selected models that agre with the majority-vote interaction (+ or -) for that coeficient. For example, if ? = 0.05, we compute confidences after discarding the bottom 5% of models. Similarly, the magnitude of each parameter in the gLV model is found by computing its average (and standard eror) over al selected models. Small interaction network simulation design We simulated 11 five-taxon communities by randomly selecting model parameters acording to the following equations: 99 ! K i ~Unif500,10 5 ( ) r i ~Unif0.8r * ,1.2r * ( ) , where r * =0.5 " ij ~s#Unif0.5,4 () , where s= 1 with probability 0.25 -1 otherwise $ % & ? ( ) N i 0 () ~K i +K i #G0,0.25 ( ) where G(0,0.25) is a Gaussian distribution with mean 0 and standard deviation 0.25. Al datasets represented fully connected networks, i.e. every taxon influences every other taxon in some way. From these initial abundances, we simulated 20 consecutive time points, and required al taxon abundances to remain within the range of [10, 1e6] (prior to introducing eror), thus preventing extinction or explosion of any taxon. Once a satisfactory model was generated, we added noise to the data acording to an eror parameter ! " such that ! N i t () =N i t () +"#? i #G0,1 () , where ? i is the mean abundance of taxon i throughout the time-series. The validation set of 11 five-taxa communities used ? = 0.03. Humanized gnotobiotic mouse gut dataset In brief, purified adult human fecal microbiota were first transplanted via gavage into germ-fre C57BL/6J mice. After initial colonization, mice remained on a low-fat mouse chow diet for four weks. Subsequently, half of the mice were switched to a model Western diet high in fat and sugar, and folowed over the course of two months. Wekly fecal samples were collected from each mouse and prepared for deep 16S rRNA 454 FLX pyrosequencing. Details of experimental protocols including mouse humanization, gut 100 microbial community DNA preparation, diet treatments, and 16S rRNA environmental pyrosequencing (and asignment) are described in Turnbaugh et al. [124]. Relative abundance measurements in each sample were calculated from corresponding 16S sequence taxonomic asignments. Our analysis screned out rarely observed taxonomic clases (< 1% of the population on average) due to poor measurement of relative abundance. We normalized relative abundances for each sample by multiplying by 10 5 (to approximate 16S copies/nl). This roughly corresponds to the ~10 1 cels/ml observed in mouse and human faecal samples [125]. To compensate for the fact that only wekly time points available per subject, we fit the average abundances of each clas using a shape-preserving piecewise cubic Hermite interpolation. Daily abundance numbers were extracted from the interpolated curve in order to ensure a smooth model fit. 10,000 stochastic NLS iterations were run for each individual mouse time-series dataset. Constraints on model parameters during the fiting procedure required: (1) interaction coeficients to remain within (-10, 10), (2) the universal growth rate betwen 0 and 2, and (3) the carying capacity of each taxon to remain within its minimum observed value and 10 times its maximum observed value (maximum observed values ranged from 2.4e3 to 7.2e5 16S/nl across al taxa). Constraints were required to fit the model to realistic parameters in reasonable computational time, and fited parameter sets typicaly did not approach the limits of the constraints. 101 Results Prediction of small interaction webs We first designed a simulation study to evaluate the quality of diferent methods for predicting microbial interaction networks in environments with few taxa. Using the discrete-time gLV model, we generated time-series datasets describing the dynamics of systems with up to four interacting taxa, and then atempted to re-discover the structure of the interaction network, as wel as evaluate the quality of the fited parameters. The interaction network in each simulation is fully connected (i.e. no interaction coeficients are 0), and alows for mutualistic (+,+), competitive (-,-) and antagonistic (+,- ) relationships. We asesed several techniques for data fiting: dynamic regresion (DyR), nonlinear least squares (NLS), Nelder-Mead (NM), and patern search (PS) (se Methods for details). The diferent procedures are compared through the false interpretation rate (FIR), defined as the proportion of interspecific interaction coeficients with incorrect asignments (i.e. the sign of the estimated coeficient is wrong). Table 14?displays the acuracy of the predictions found in our simulations. In simulations involving two taxa, al methods performed wel, frequently resulting in FIRs <1%. In general, the dynamic regresion approach handled data with no eror very wel (FIRs < 4%), but had decreased performance for datasets with high eror-rates. Model fits from NLS, NM, and PS methods typicaly outperformed dynamic regresion (Figure 16). On average, the NLS method produced beter results than the other methods for realistic data (eror rates betwen 3% and 5%). 102 False Interpretation Rate Dynamic regresion NLS NM PS 2 taxa Eror rate (?) 0% 0 0 0.015 0 3% 0.020 0.001 0.005 0 5% 0.035 0.005 0.005 0.01 3 taxa Eror rate (?) 0% 0.018 0.071 0.168 0.175 3% 0.125 0.061 0.180 0.170 5% 0.175 0.070 0.155 0.155 4 taxa Eror rate (?) 0% 0.0367 0.0783 0.2508 0.2742 3% 0.2158 0.0767 0.2525 0.2575 5% 0.2825 0.0992 0.2442 0.2442 Table 14. Structure accuracy results for smal networks. For each eror rate, we simulated the dynamics of 100 microbial environments with known interaction webs. Inference methods were run on the same time-series datasets in each trial. False interpretation rates are defined as the proportion of incorrectly infered interactions across each corresponding set of 100 simulated datasets. Se Methods for details of simulated communities. 103 Figure 16. Model acuracy (a. mean ? s.e.m, b. median) in 400 simulated two-taxa systems. One hundred time-series datasets (each with a unique gLV model parameters) were simulated for each eror rate (0, 1, 3, and 5%, shown in legend). The NLS, NM, and PS methods resulted in more acurate model fits than dynamic regresion in 96%, 99%, and 86% of the trials, respectively. We found in these simulations that dynamic regresion often resulted in very poor model fits, preventing further predictive modeling and simulation of microbial systems. In trials where the residual eror of the DyR method was beyond floating-point representation (25 of the 400 trials), we reasigned the eror to the largest computed residual eror found across al DyR trials. Residual eror is defined 104 as the sum-of-squared diferences betwen the observed data and model predictions (se Methods). We also tested the use of simple Pearson (i.e. linear) correlations of growth rates (and absolute abundance) betwen taxa as a way to detect interactions. However, strong linear correlations cannot translate to interaction direction betwen organisms, so one could not infer a competitive, mutualistic, or antagonistic relationship, but simply an interaction of some type. Simulating 1000 4-taxa communities (with an eror rate of 3%), we examined what proportion of interactions would be mised if we required growth rates (or absolute abundances) to have a correlation coeficient of at least 0.15 to indicate an interaction. In this case, linear correlations of growth rates and abundances failed to detect 30% and 45% of the true interactions, respectively. Due to their poor detection ability and vague interpretation, we advise against usage of linear correlations to approximate microbial interactions in this context. It is clear from these trials that estimating interactions even for smal food webs can be a formidable task. Validation of regresion approach The methods employed above are limited in that they do not provide a measure of the significance of each infered interaction coeficient. The chalenge in resolving an interaction web requires a more reliable approach where the confidence in each interaction prediction is measured, and predictions can therefore be ranked in terms of 105 confidence. We propose a stochastic fiting methodology that generates a consensus interaction matrix, in which the sign of each interaction coeficient is given a confidence level based on the distribution of optimized fits. Briefly, our method first runs the NLS fiting procedure for a predetermined number of iterations. We then sort the resulting parameters sets in order of goodnes of fit, and discard a proportion of infered models (designated by a parameter ? betwen 0 and 1) before generating consensus confidence values. For example, if ? = 0.05, we compute confidences using the top 95% of optimized fits (i.e. we ignore model fits in the bottom 5%). Given a particular interaction coeficient, ? ij , the confidence value is the proportion of selected fits that agre with the majority-vote interaction (+ or -) for that coeficient. (Se Methods for details.) We validated our methodology using 11 five-taxa simulated datasets, each with a unique set of model parameters. These five gLV models correspond to 220 interspecific interaction coeficients, which form our test set. We ran the stochastic NLS procedure for 10,000 iterations per dataset, and subsequently generated consensus confidence scores using a series of values for ?. Figure 17a displays the computed ROC curves (sensitivity vs. FIR) for the test set. We discovered that culling a large proportion of optimal fits (e.g. ? = 0.95 or 0.99) produced higher FIRs than trials utilizing smaler values of ?. Additionaly, we found that applying a stringent confidence threshold alowed for reliable prediction of the majority of interactions with a negligible FIR. As an example, using ? = 0.5 with a confidence cutoff of 0.98 (i.e. 98% among the models ranked in the top half, acording to goodnes of fit to the simulated time-series data), our method correctly predicts 69.5% of the interactions, and achieves an FIR of les than 1%. 106 The ecology research community has made extensive eforts to describe the ?strength? of interactions betwen members of a food web [126]. Diferent goals and driving questions among researchers have led to conflicting definitions of interaction strength, but the purposes of this study, we define strength as the quantitative value of the interaction coeficients in the gLV model; these values provide a normalized measure of the per capita efect of members of the population on each taxon. In line with aim (i), we sought to ases the acuracy of the estimated interaction coeficients in our methodology, considering the same range of values for ? and confidence cutoffs used to generate the ROC curves in Figure 17a. For each confidence cutoff and ? value, we computed the average eror rate of the predicted interaction coeficients (Figure 17b), and observed that larger values of ? resulted in more acurate approximations of interaction strength. Indeed, using a confidence threshold of 0.95, a ? value of 0.5 resulted in a 36% decrease in the average eror rate over a more stringent ? = 0.99. For each particular value of ?, a decrease in the confidence threshold tended to increase the average eror rate. For ? values ? 0.5 and confidence thresholds ? 0.75, the average eror rate remained below 0.8. In our simulations, the magnitude of an interaction coeficient was on average 2.25. This high relative eror rate suggests that despite our succes in predicting the general interaction (a sign of + or -), there is considerable room for improvement in estimating interaction strength. Considering the empirical performance of the parameters in our validation study, we let ? = 0.25 and employ a confidence threshold of 0.85 in al analyses described below. 107 ? Figure 17. Sensitivity analysis on validation data. (a) Sensitivity vs. FIR for 11 simulated five-taxa communities. Here we define sensitivity as the proportion of interactions that are correctly predicted. The legend displays values of ? used for each ROC curve. By considering a large number of putative model fits, we can infer the majority true interactions betwen taxa with a reasonably low FIR. Observing the ?0.99? curve (in which we only use the top 1% of fits), we se worse performance in predicting interactions betwen taxa. (b) Corresponding mean eror rates of estimated interaction coeficients. As ? decreases, the mean eror rate of the interaction coeficients decreases significantly, indicating that considering more putative model fits results in beter overal estimation of interaction strength. 108 Microbial dynamics of mice on a Western diet We applied our methodology to data from a metagenomic study investigating the dynamics of humanized mouse gut microflora. Twelve gnotobiotic mice were augmented with human gut microbiota and fed a mouse chow diet for four weks. Subsequently, six of the mice continued on a mouse chow diet, while the remaining six mice were switched to a representative Western diet high in fat and sugar. For each mouse, deep pyrosequencing of the 16S rRNA V2 hypervariable region was performed on bacterial communities isolated from fecal samples over the course of eight weks. Sequences were asigned to a taxonomy creating a taxonomic profile for each sample. Though the average gut microflora of the chow-fed mice remained relatively stable, the microbe populations in mice on the Western diet shifted dramaticaly throughout the study (Figure 18a). To ases the potential microbial interactions in these models of the human gut, each time-series profile of mice switched to the Western diet was evaluated using our methodology. We asume that the microbial interaction web betwen any two mice in the study is the same, so our goal was to se if predicted microbial interactions were conserved across the diferent mice. Model consistency. We separately learned the parameters of the gLV model for each individual mouse and found a high level of concordance betwen the individual models. Despite the fact that some time-series profiles exhibited remarkably diferent dynamics over the course of the study (Figure 18a), computed confidence values for each interaction coeficient were strongly correlated across al mice (Table 15). Furthermore, 109 estimated carying capacities were highly correlated (mean pairwise Pearson?s r 2 value = 0.948). This lends support to the robustnes of our methods to variable temporal paterns. Infered growth rates. Our implementation of the gLV model employs a universal net growth parameter for al taxa. Averaged across al humanized mice, the infered growth rate was approximately 0.44 (with a standard deviation of 0.01), implying the bacterial population in the distal gut has a very slow turnover rate (~1.6 days). This stagnant state has also been observed previously in humans [20, 127] and has been atributed to host- asociated factors including the imune response and the neutral pH levels of the colon [127, 128]. 110 Figure 18. Humanized mouse gut microbiota analysis. (a) Time-series 16S profiles of six humanized mice fed a high fat Western diet. Each plot represents a diferent mouse. The y-axes represent normalized 16S gene copies per nanoliter of fecal material (se Methods for normalization details). Taxa are shown in the corresponding colors: Bacteroidetes (orange), Bacili (blue), Clostridia (purple), Erysipelotrichi (red), and other Firmicutes (black). (b) Predicted interactions with high confidence betwen bacterial members of the humanized mouse gut community. Indicated interactions maintained confidence values greater than 0.85 for al studied mice. The remaining 13 possible interactions had relatively low levels of confidence. Displayed with each arow is the general efect in parentheses (+ or -) along with the average confidence value across al mice. In this case, al arows suggest an overal inhibitory efect (-) of one organism on another. No taxon was found to significantly enhance the growth of any other organism. 111 WM1 WM2 WM3 WM4 WM5 WM6 WM1 1 0.941 0.930 0.923 0.977 0.920 2 1 0.819 0.829 0.887 0.875 WM3 1 0.981 0.972 0.972 4 1 0.975 0.876 WM5 1 0.911 6 1 Table 15. Pairwise correlation coeficients of confidence values for predicted interactions. Each cel displays the Pearson?s r 2 value of confidence scores betwen humanized germ-fre mice. We observe that computed confidence scores are highly correlated across each time-series dataset, indicating the microbial interactions with the greatest confidence are conserved across mice. Note grey cels are redundant, i.e. the r 2 value of (WM1,WM2) is equal to the r 2 of (WM2,WM1). Bacteroid. Bacili Clost. Erysip. Other Firm. Bacteroidetes 0.78 (0.04) 1.15 (0.05) 0.96 (0.05) 1.39 (0.12) Bacili 0.03 (0.06) 0.32 (0.02) 0.36 (0.02) 0.12 (0.06) Clostridia 1.3 (0.12) 1.18 (0.10) 1.28 (0.10) 0.94 (0.13) Erysipelotrichi 0.19 (0.01) 0.61 (0.02) 0.45 (0.01) 0.20 (0.06) Other Firmicutes 0.43 (0.02) 0.43 (0.03) 0.51 (0.02) 0.51 (0.01) Table 16. Average (std. er) of interaction coeficient magnitude estimates across six Western-fed humanized mice. 112 Predicted interactions. A full diagram of the interactions predicted by our model is shown in Figure 18b. Interaction coeficients with the greatest confidence typicaly involved the Bacteroidetes or Clostridia populations. Our model predicts that Bacili, Clostridia, Erysipelotrichi, and the subpopulation of remaining Firmicutes al inhibit the growth of Bacteroidetes with confidence values greater than 0.85 (for al individual mice). Similarly, Bacteroidetes, Bacili, and Erysipelotrichi al inhibit the growth of Clostridia with corresponding confidences values greater than 0.90. No taxa were predicted to enhance the growth of any other group in our results. Table 16 displays the range of estimated magnitudes of al interaction coeficients. Several of the interactions infered from the data are supported by prior studies. For example, our model implies that Clostridia and Bacteroidetes are strongly competitive, a result also found in microaray-based studies of infant gut microflora [116]. We have observed this same predicted interaction from preliminary modeling analysis following the gut microbiota of obese humans on a low-calorie fat-restricted or carbohydrate-restricted diet for 1 year (data not shown). Furthermore, a recent genomic study reported on transcription profile modification in members of these two clases when introduced simultaneously in the guts of gnotobiotic mice. When co-colonized with Eubacterium rectale, Bacteroidetes thetaiotaomicron adapts by up-regulating a subset of genes for degrading gylcans that E. rectale is unable to metabolize. In turn, E. rectale down-regulates a large number of genes encoding for gylcoside hydrolases and specializes in the breakdown of simple sugars such as celobiose and lactose when co- colonized with B. thetaiotaomicron [129]. The alteration of each species toward difering metabolisms and the limited efects of co-colonization on gene transcription rates for cel 113 replication supports the notion of a general competitive interaction betwen Clostridia and Bacteroidetes as predicted by our methods. Carrying capacities. Our methodology consistently predicted that the Clostridia population had the highest carying capacity in the environment, followed by Bacteroidetes; Erysipelotrichi and Bacili had relatively similar carying capacities. It is crucial to understand these carying capacities are measured in 16S gene copies per nanoliter, rather than cels/nl, due to the multicopy nature of the 16S rRNA gene (se Discussion below). Examining predicted interactions that influence the growth rate of Bacteroidetes, the efects of Clostridia and Other Firmicutes were significantly greater than that of Bacili and Erysipelotrichi (paired T-test, P < 0.008 for al concomitant tests). There was no significant diference betwen the interaction strengths of Clostridia and Other Firmicutes on Bacteroidetes. Additionaly, the interaction strength of Erysipelotrichi on Clostridia was significantly greater than that of Bacili (paired T-test, P < 0.003). Discussion We have presented a systematic methodology for predicting microbial interactions in time-series metagenomic datasets. Our methods were validated using simulated temporal dynamics of interacting communities and we further applied this framework to a series of metagenomic datasets describing mouse gut microbial dynamics during a prototypic Western diet. The key to our approach is a measurement of confidence for each predicted interaction coeficient based on the distribution of parameter estimates in the gLV model. 114 Our method is linked to the asumptions of the generalized Lotka-Voltera model, which may be violated in some studies of microbial communities. For example, the model also asumes that its parameters (carying capacities, growth rates, and the interaction coeficients) do not depend on the abundance of the individual members of the community. These asumptions are clearly an over-simplification (e.g. quorum sensing mechanisms are density-dependent), and future work is necesary to evaluate how these simplifications afect the overal results of our analysis. In the application of our methods to the humanized mouse gut data, our observations are based on the abundances of taxonomic clases, each of which is a combination of multiple species. A related property of the gLV model is the ability to generalize over aggregates of organisms, e.g. by modeling an environment at a higher taxonomic level. However, to reasonably merge a set of taxa S into a higher taxon, strong asumptions are required (e.g. the relative abundances of the taxa within S do not change over time). Se Methods for a mathematical discussion of these asumptions. Note strict asumptions based-on taxonomic aggregation play a role in other approaches (e.g. generalized additive models [116]), and this isue remains an open problem. Several inherent limitations exist when studying spatiotemporal dynamics using metagenomic sequence data. Though the 16S rRNA gene is an excelent candidate for amplification with universal PCR primers, the number of known copies per genome ranges from one to 15 (e.g. Clostridium paradoxum) [130], suggesting that taxonomic abundance profiles of microbial communities are highly skewed. Additionaly, phylogenetic marker studies (and environmental genome shotgun projects in general) often lack the important measure of cel density - defined as the number of cels per unit 115 volume (e.g. cels/mL) (note however for the Western diet mouse gut study described above, previous studies have found stable microbial densities in mouse cecal samples [131]). Several experimental approaches hold significant promise in mitigating these uncertainties, including: microarays, qPCR [111], flow cytometry [132] and other microfluidics devices [133]. The methods we present in this paper wil remain applicable even as metagenomic data improves in acuracy. While our methods predict qualitative interactions quite wel, we have also shown that measurement of interaction strength is significantly more dificult. Because no model can capture al aspects of these communities, methodologies for predictive modeling wil require comprehensive datasets for training as wel as rigorous experimental evaluation. Nonetheles, in the spirit of optimism and hope, we conclude this discussion with a speculative application in which we use gLV models to study the dynamics of gut microbiota in dieting obese human subjects. Our goal is simple to state but incredibly chalenging to reach: determine which taxonomic groups require manipulation in order to shift the obese gut microbiome structure to a lean-like state. Although the following results should be taken with a grain of salt, the approach we take ilustrates how we could hypotheticaly utilize mathematical modeling to forecast the efects of an environmental perturbation to achieve a desired alteration. We have selected this application because of the available available and obesity?s poignant impact on our lives ? everyone in the United States knows someone who is obese ? and this is why you shal humor the pages below. Recent studies have revealed that human obesity is correlated with a detectable shift in the phylum-level microbiota of the distal gut [20, 21, 134]. Specificaly, 116 Firmicutes and Actinobacteria are enriched in obese individuals, while lean subjects maintain relatively higher abundances of Bacteroidetes. Ley et al. further report that obese subjects following a one-year low-calorie fat-restricted or carbohydrate-restricted diet showed significant changes in their overal gut microbial populations. As the subjects lost weight, their gut microbe levels more closely resembled that of their lean counterparts. Similar studies of germ-fre mouse models have re-enforced the correlation of host adiposity not only with taxonomic composition, but also the microbial capacity for energy harvest [30, 135]. We analyzed data from ten obese subjects placed on a low-calorie diet over the course of one year [20]. Each subject provided faecal samples throughout the year; the taxonomic composition of each sample was approximated using targeted 16S rRNA gene sequencing. On average, the relative abundance of Bacteroidetes increased during dieting, while Actinobacteria and Firmicutes populations were depleted. These community shifts are inclined towards a host ecology similar to the lean human population [20, 21]. After fiting our gLV model to the data, we discovered several parameter sets resulted in both a quantitatively and qualitatively sufficient fit. Examining the top 100 model fits, we found parameter sets largely agred on several phylum interactions (se Figure 19). There exists strong evidence for thre competitive interactions: Bacteroidetes- Proteobacteria, Bacteroidetes-Firmicutes, and Firmicutes-Verucomicrobia. Actinobacteria appear to enhance the growth of Bacteroidetes, and Bacteroidetes in turn promote Verucomicrobia. In contrast, the overal efect of Bacteroidetes on 117 Actinobacteria and Verucomicrobia on Bacteroidetes could not be established due to conflicting interaction signs in the parameter sets. Despite the consensus on interactions indicated by estimated parameters, the gLV model is very sensitive to minor changes and is capable of exhibiting nonlinear behavior including chaos. We manualy manipulated one of the best fiting parameter sets and discovered that altering a single variable can dramaticaly afect the overal dynamics of the community. Figure 20 displays a bifurcation diagram created by varying only one interaction coeficient, (the efect of Firmicutes on Bacteroidetes). Depending on this parameter, the community may converge to a single equilibrium, a periodic oscilation, or transition to chaotic behavior. 118 Figure 19 Phylum-level interaction matrix. Analyzing parameter estimates of the top 100 model fits, we find strong evidence for several protagonistic and antagonistic relationships. Each cel in the matrix displays the sign of the estimate, along with a level of confidence based on the top best 100 fits. Red cels indicate competitive interactions betwen species, while blue cels indicate an asymetric antagonistic relationship. Gray cels are interactions disregarded by preliminary analysis of the data. 119 The past decade has sen several advancements in prebiotic and probiotic therapies for maintaining a healthy gut microbial population and treating population imbalances that result in diseases such as ulcerative colitis, inflamatory bowel disease, Crohn?s disease, and pouchitis. Prebiotics are non-digestible food elements (e.g. inulin, fructo-oligosacharides, or galacto-oligosacharides) aimed to stimulate growth of specific bacterial groups, whereas probiotics contain live bacterial cultures to be ingested in moderation to benefit the host. Regardles of the approach, these treatments sek to encourage a healthy ecosystem through microbial manipulation. Though knowledge of the specific mechanisms of action for these treatments is not known, their efects on the general health of the intestinal tract are wel studied [136, 137]. For the average obese subject, our best-fiting model predicts that only the Bacteroidetes and Firmicutes converge to within 5% of their equilibrium in the first year of dieting. The remaining thre dominant phyla require at least an additional six months to reach comparable levels. To investigate whether we could increase the overal convergence rate, we simulated the microbial responses to probiotic/prebiotic therapies by altering the initial abundances of each phylum. As most treatments are only known to afect a subset of the microbial population in the gut, we limited our experimentation to manipulating each phylum individualy. Each treatment has an asociated microbial load efect ? i.e. the proportional diference betwen a phylum and its equilibrium after treatment. For example, a load efect of 0.7 means the abundance of the phylum is 30% les than the equilibrium; 1.2 means that abundance is 20% greater than equilibrium. We examined a range of load efects (0.7 ? 2.0), as practical treatments wil not produce identical results in al patients. 120 We discovered that boosting the Bacteroidetes levels toward the lean equilibrium increases the convergence rate of al bacterial phyla. Table 17 displays the time each phylum takes to converge given a particular therapy designed to increase the abundance of Bacteroidetes. If a treatment with a load efect of 2.0 exists, our model predicts that al phyla would converge (within 5% of equilibrium) in six months or les; this cuts the overal convergence time roughly in half. If we could achieve such a convergence rate, then the altered microbiota would hypotheticaly extract les energy from the patient?s nutrients and potentialy acelerate the fat-depletion and weight loss efects of dieting. Manipulation of other populations did not produce a succesful convergence rate increase for al phyla. This result suggests that the Bacteroidetes population may be an ideal target for obesity-related probiotic/prebiotic therapies, which is intriguing given that most probiotic therapies for the gastrointestinal tract utilize members of the Firmicutes or Actinobacteria [128, 138, 139]. To step back for perspective, we have gone from 16S longitudinal data in a clinical study ? to modeling the dynamics of the obese gut microbiome ? to forecasting which taxonomic groups are targets for shifting this community type toward a lean-like state. The Bacteroidetes hypothesis is most likely incorrect (I hope to eventualy check it), but the general approach demonstrates how we can go from data to a dynamic hypothesis with a clear follow-up experiment. It is a beautiful example of the circular relationship of scientific experiments and quantitative analysis. There is tremendous promise for the field of metagenomics, particularly in its translation to biotechnology and medicine. However, crucial prerequisites for this translation include not only comprehensive temporal datasets, but also novel quantitative 121 approaches including mathematical modeling to forecast how these complex systems react to treatments and environmental changes. The results we present here are just a first step in this direction. 122 Figure 20. Microbial community transition to chaos. This bifurcation diagram displays the long term dynamics of the Bacteroidetes population versus the value of a single parameter in our model. By varying ?(BF), (the interaction coeficient of Firmicutes and Bacteroidetes) we observe dramatic shifts in the stability of the community. For example, when ?(BF) equals -0.06, the Bacteroidetes population achieves a long-term equilibrium. However, if we set ?(BF) to -0.09, the Bacteroidetes population converges to a period- two steady state oscilation. Further decreasing of this parameter leads to a period doubling cascade and eventual transition into chaos, where the long-term dynamics of the population are highly sensitive to this one parameter. 123 Probiotic/prebiotic load efect No treatment 0.70 0.90 1.00 1.50 1.80 2.00 Actinobacteria 79 29 22 21 15 13 12 Proteobacteria 166 117 102 93 18 6 5 Bacteroidetes 91 42 2 3 4 4 4 Firmicutes 28 21 20 20 19 19 19 T i m e t o c onve r ge nc e ( w e ks ) Verucomicrobia 39 27 26 26 24 24 24 Table 17. Potential population impacts of probiotic/prebiotic therapies on Bacteroidetes. We define the treatment ?load efect? as the relative abundance of Bacteroidetes after treatment compared to its equilibrium abundance predicted by our model (e.g. 0.7 implies the treatment increased the level of Bacteroidetes to within 30% of the true equilibrium, 1.0 implies the treatment increased the Bacteroidetes abundance to the exact equilibrium value.) Time to convergence represents the number of weks needed to maintain within 5% of the predicted equilibrium. Without treatment, 4 out of 5 phyla converge within two years. Our model predicts that stimulating Bacteroidetes population growth decreases the time to convergence for al observed phyla. We find that overloading the Bacteroidetes levels to twice the equilibrium dramaticaly increases convergence rates such that al phyla converge within six months. Similar manipulation of other phyla did not produce the same level of succes, suggesting the Bacteroidetes should be the focus of future probiotic/prebiotic obesity therapies. 124 Chapter 6: Conclusions and further study The primary goal of my graduate research has been the development of improved methods for metagenomic analysis in order to advance our understanding of the human microbiome and other microbial populations. The ideas presented here represent novel contributions spanning elements of preprocesing, procesing, and post-procesing of metagenomic sequence data. Figaro, a novel vector-triming algorithm, can rapidly detect and remove vector sequence from multiple metagenomic sequence libraries without prior knowledge of the vector sequences themselves, thereby asisting researchers in many aspects of metagenomics including asembly, gene-finding and annotation. Since its publication in 2008, this open-source software has over 950 downloads at SourceForge.net. In the direct procesing of environmental 16S rRNA sequences, we performed a comprehensive analysis of OTU clustering methodologies that have been employed to estimate the diversity of microbial communities in landmark studies for the last decade. We have found that the choice of parameters in these methodologies is extremely important for acurate clusters, and that most studies have used parameters that are too stringent, resulting in inflated estimates of microbial diversity. While this observation has been slow to catch on, many leaders of the HMP are now aware and wil hopefully require further validation of OTU-based analysis. As most HMP studies now utilize 454 pyrosequencing technology (and potentialy Ilumina in the future), there is an imediate need for rigorous evaluation of OTUs created from reads much shorter than the Sanger- based sequences used in our study. Pyrosequencing reads currently cannot span multiple hypervariable regions of the 16S gene, and thus there is les phylogenetic information to 125 clasify and compare sequences. Additionaly, 454 technology has been reported to produce unique artifacts in metagenomic data such as perfect and near-perfect replicates, which can severely skew relative abundance estimates [140]. In the context of 16S rRNA surveys, laboratory preparation, PCR primer bias, and chimeric sequences can also dramaticaly afect results. To understand the extent of each of these efects, a validation study must be performed in which a bacterial community of known composition (e.g. identifiable species, relative abundance information) is sampled and surveyed using standard techniques. Thus, a 16S taxonomic profile could be compared to an approximate truth, and the sequence dataset may be analyzed for sequencing erors, chimeras, unobserved species, and biased relative abundance measurements. This approach would give the microbial ecology community important insight into how wel these protocols describe the true microbial population. For post-procesing annotated metagenomic and 16S rRNA datasets, we presented Metastats, a statistical methodology for detecting diferentialy abundant metagenomic features betwen two populations in large-scale clinical studies. Implemented as a fully automated websever, to date Metastats has received over 700 jobs by 80 unique users. In future work, this methodology could be extended to include comparisons of thre or more populations using nonparametric ANOVA with the F- statistic, or perhaps multiway-ANOVA to find interactions betwen multiple factors. Often software packages risk feature overload, thereby alienating the user; Metastats has been designed to be rigorous but streamlined, and additional extensions wil need to conform to this standard. 126 Finaly, moving from post-procesing into modeling, we described a methodology for infering microbial interaction webs from time-series 16S datasets. While there are many technical and experimental isues that require further validation, this project represents a step toward the holy grail of metagenomics: to model the dynamics of a microbial community and acurately forecast how a perturbation could atain a desired result. This achievement would dramaticaly impact many fields of science including medical microbiology, environmental sustainability, bioenergy generation, waste disposal, and industrial crop management. As we continue to move toward this dream, we already know many of our chalenges ? specific technological and experimental innovations must be realized including precise estimation of microbial cel density, improvements in DNA sequencing technology, and unbiased taxonomic profiling protocols. These innovations wil happen, and they wil take us ever closer toward our ultimate goal. The future of metagenomics is not in the hands of microbiologists alone. There is room for many areas of expertise including mathematics, computer science, enginering, medicine, chemistry, geology, and oceanography ? and al are vital. 127 Appendices Appendix 1: Diferentially abundant COGs in comparison of infant and adult gut microbiomes COG id Description mature mean infant mean Metastat qvalue COG0249 Mismatch repair ATPase (MutS family) 0.01601 0.00527 0.072 COG0358 DNA primase (bacterial type) 0.02438 0.0076 0.072 COG0427 Acetyl-CoA hydrolase 0.00542 0.00131 0.072 COG0482 Predicted tRNA(5- methylaminomethyl-2- thiouridylate) methyltransferase, contains the P-lop ATPase domain 0.00917 0.00270 0.072 COG0574 Phosphoenolpyruvate synthase/pyruvate phosphate dikinase 0.01271 0.00407 0.072 COG0739 Membrane proteins related to metalloendopeptidases 0.0241 0.00608 0.072 COG0793 Periplasmic protease 0.01465 0.0033 0.072 COG1808 Predicted membrane protein 0.00168 0.000 0.072 COG3152 Predicted membrane protein 0.0086 0.00424 0.072 COG3956 Protein containing tetrapyrrole methyltransferase domain and MazG-like (predicted pyrophosphatase) domain 0.00301 0.002 0.072 128 COG427 Predicted DNA-binding protein with the Helix- hairpin-helix motif 0.00319 0.0032 0.072 COG500 Signal transduction histidine kinase involved in nitrogen fixation and metabolism regulation 0.00239 0.0003 0.072 COG545 Predicted P-lop ATPase and inactivated derivatives 0.0142 0.00185 0.072 COG0543 2-polyprenylphenol hydroxylase and related flavodoxin oxidoreductases 0.01034 0.00581 0.0894 COG037 Predicted ATPase of the PP-lop superfamily implicated in cell cycle control 0.01276 0.00497 0.01084 COG032 3-oxoacyl-[acyl-carrier- protein] synthase III 0.00951 0.00243 0.01084 COG0612 Predicted Zn-dependent peptidases 0.01507 0.00451 0.01084 COG1013 Pyruvate:ferredoxin oxidoreductase and related 2- oxoacid:ferredoxin oxidoreductases, beta subunit 0.00794 0.00279 0.01084 COG1014 Pyruvate:ferredoxin oxidoreductase and related 2- oxoacid:ferredoxin oxidoreductases, gamma subunit 0.01017 0.00207 0.01084 COG1074 ATP-dependent exoDNAse (exonuclease V) beta subunit (contains helicase and exonuclease domains) 0.0093 0.00398 0.01084 129 COG112 Superfamily I DNA and RNA helicases and helicase subunits 0.00872 0.0092 0.01084 COG196 Chromosome segregation ATPases 0.01676 0.00651 0.01084 COG149 Alpha-amylase/alpha- mannosidase 0.00181 0.000 0.01084 COG1636 Uncharacterized protein conserved in bacteria 0.0035 0.0034 0.01084 COG2385 Sporulation protein and related proteins 0.00646 0.0012 0.01084 COG480 Secreted protein containing C-terminal beta-propeller domain distantly related to WD- 40 repeats 0.0069 0.000 0.01084 COG174 Uncharacterized homolog of PSP1 0.00523 0.0097 0.01275 COG0208 Ribonucleotide reductase, beta subunit 0.00215 0.00675 0.01565 COG045 NAD/FAD-utilizing enzyme apparently involved in cell division 0.01017 0.00295 0.01565 COG1086 Predicted nucleoside- diphosphate sugar epimerases 0.00869 0.00198 0.01565 COG3451 Type IV secretory pathway, VirB4 components 0.03289 0.00942 0.01565 COG375 Phosphotransferase system, galactitol- specific IIC component 0.00182 0.00498 0.01565 COG348 Predicted thiol oxidoreductase 0.00121 0.000 0.01708 COG1797 Cobyrinic acid a,c- diamide synthase 0.00454 0.0092 0.01934 COG1595 DNA-directed RNA polymerase specialized sigma subunit, sigma24 homolog 0.05279 0.01297 0.02057 130 COG0192 S-adenosylmethionine synthetase 0.0079 0.00479 0.02167 COG0465 ATP-dependent Zn proteases 0.01276 0.00704 0.02167 COG145 Ferredoxin 0.01656 0.00496 0.02167 COG2059 Chromate transport protein ChrA 0.00818 0.00186 0.02167 COG0514 Superfamily II DNA helicase 0.0192 0.00560 0.0270 COG24 Membrane protein involved in the export of O-antigen and teichoic acid 0.01940 0.00897 0.0291 COG046 ATP-dependent Lon protease, bacterial type 0.00871 0.00320 0.0230 COG384 Acyl-ACP thioesterase 0.0030 0.000 0.0230 COG181 Phospholipid-binding protein 0.0052 0.00359 0.02348 COG0674 Pyruvate:ferredoxin oxidoreductase and related 2- oxoacid:ferredoxin oxidoreductases, alpha subunit 0.00832 0.00256 0.02435 COG1748 Saccharopine dehydrogenase and related proteins 0.00493 0.0063 0.02518 COG0323 DNA mismatch repair enzyme (predicted ATPase) 0.0073 0.0036 0.02701 COG0642 Signal transduction histidine kinase 0.013205 0.07023 0.02701 COG0653 Preprotein translocase subunit SecA (ATPase, RNA helicase) 0.01037 0.00452 0.02701 COG2878 Predicted NADH:ubiquinone oxidoreductase, subunit RnfB 0.00526 0.00121 0.02701 COG4864 Uncharacterized protein conserved in bacteria 0.00190 0.0017 0.02701 COG162 Predicted GTPases 0.00691 0.00241 0.02783 COG2376 Dihydroxyacetone kinase 0.00187 0.00919 0.02783 131 COG3505 Type IV secretory pathway, VirD4 components 0.0287 0.0095 0.02783 COG1493 Serine kinase of the HPr protein, regulates carbohydrate metabolism 0.00463 0.0010 0.02795 COG3712 Fe2+-dicitrate sensor, membrane component 0.0259 0.0031 0.02795 COG067 Predicted oxidoreductases (related to aryl-alcohol dehydrogenases) 0.0171 0.02134 0.02818 COG1409 Predicted phosphohydrolases 0.01283 0.00361 0.02818 COG3294 Uncharacterized conserved protein 0.0016 0.000 0.02818 COG5368 Uncharacterized protein conserved in bacteria 0.0026 0.000 0.02818 COG1864 DNA/RNA endonuclease G, NUC1 0.00215 0.0020 0.02823 COG1762 Phosphotransferase system mannitol/fructose- specific IIA domain (Ntr-type) 0.00436 0.01694 0.02929 COG487 Uncharacterized protein conserved in bacteria 0.00182 0.000 0.02932 COG1083 CMP-N- acetylneuraminic acid synthetase 0.00191 0.0025 0.03131 COG1629 Outer membrane receptor proteins, mostly Fe transport 0.0197 0.01276 0.03131 COG234 AT-rich DNA-binding protein 0.00424 0.0061 0.03131 COG385 FOG: Transposase and inactivated derivatives 0.00276 0.003 0.03131 COG0205 6-phosphofructokinase 0.01712 0.0058 0.03131 132 COG0602 Organic radical activating enzymes 0.00839 0.00437 0.03131 COG0731 Fe-S oxidoreductases 0.00191 0.000 0.03131 COG1035 Coenzyme F420- reducing hydrogenase, beta subunit 0.00351 0.0063 0.03131 COG1072 Panthothenate kinase 0.0007 0.00230 0.03131 COG1263 Phosphotransferase system IIC components, glucose/maltose/N- acetylglucosamine- specific 0.015 0.03075 0.03131 COG1350 Predicted alternative tryptophan synthase beta-subunit (paralog of TrpB) 0.00213 0.0002 0.03131 COG1351 Predicted alternative thymidylate synthase 0.0025 0.000 0.03131 COG1541 Coenzyme F390 synthetase 0.00647 0.00180 0.03131 COG1757 Na+/H+ antiporter 0.01080 0.00395 0.03131 COG2152 Predicted glycosylase 0.00518 0.0046 0.03131 COG3426 Butyrate kinase 0.00291 0.0054 0.03131 COG3635 Predicted phosphoglycerate mutase, AP superfamily 0.00230 0.000 0.03131 COG3943 Virulence protein 0.0090 0.00197 0.03131 COG4206 Outer membrane cobalamin receptor protein 0.02057 0.00252 0.03131 COG4658 Predicted NADH:ubiquinone oxidoreductase, subunit RnfD 0.0052 0.00203 0.03131 COG3414 Phosphotransferase system, galactitol- specific IIB component 0.0062 0.0056 0.038 COG371 Transcriptional antiterminator 0.00408 0.01954 0.038 133 COG1961 Site-specific recombinases, DNA invertase Pin homologs 0.05869 0.01827 0.03454 COG2365 Protein tyrosine/serine phosphatase 0.00250 0.0040 0.03454 COG03 Phosphoglucomutase 0.0050 0.00381 0.03492 COG0507 ATP-dependent exoDNAse (exonuclease V), alpha subunit - helicase superfamily I member 0.01603 0.00850 0.03492 COG0708 Exonuclease III 0.0079 0.00381 0.03492 COG0790 FOG: TPR repeat, SEL1 subfamily 0.00726 0.00169 0.03492 COG113 Gamma-aminobutyrate permease and related permeases 0.00178 0.01834 0.03492 COG160 Predicted GTPases 0.0089 0.0049 0.03492 COG18 Ribosome-associated heat shock protein implicated in the recycling of the 50S subunit (S4 paralog) 0.0038 0.0015 0.03492 COG1249 Pyruvate/2-oxoglutarate dehydrogenase complex, dihydrolipoamide dehydrogenase (E3) component, and related enzymes 0.00635 0.0138 0.03492 COG140 Phosphotransferase system cellobiose- specific component IIB 0.0010 0.00672 0.03492 COG2195 Di- and tripeptidases 0.01438 0.00580 0.03492 COG2509 Uncharacterized FAD- dependent dehydrogenases 0.00871 0.00205 0.03492 COG287 RecB family exonuclease 0.00219 0.0035 0.03492 134 COG3487 Uncharacterized iron- regulated protein 0.0084 0.000 0.03492 COG3560 Predicted oxidoreductase related to nitroreductase 0.00182 0.0015 0.03492 COG3935 Putative primosome component and related proteins 0.00472 0.00189 0.03492 COG3968 Uncharacterized protein related to glutamine synthetase 0.0102 0.00193 0.03492 COG4912 Predicted DNA alkylation repair enzyme 0.0032 0.0068 0.03492 COG037 3-dehydroquinate synthetase 0.0058 0.00290 0.03515 COG0367 Asparagine synthase (glutamine- hydrolyzing) 0.0076 0.00280 0.03515 COG0549 Carbamate kinase 0.00230 0.00763 0.03515 COG0686 Alanine dehydrogenase 0.00259 0.0039 0.03515 COG0724 RNA-binding proteins (RRM domain) 0.00296 0.000 0.03515 COG0745 Response regulators consisting of a CheY- like receiver domain and a winged-helix DNA-binding domain 0.07579 0.0510 0.03515 COG0747 ABC-type dipeptide transport system, periplasmic component 0.01052 0.02682 0.03515 COG1592 Rubrerythrin 0.00845 0.00235 0.03515 COG1875 Predicted ATPase related to phosphate starvation-inducible protein PhoH 0.0020 0.0003 0.03515 COG239 Mg/Co/Ni transporter MgtE (contains CBS domain) 0.00604 0.00184 0.03515 COG2374 Predicted extracellular nuclease 0.00185 0.0012 0.03515 135 COG2893 Phosphotransferase system, mannose/fructose- specific component IA 0.0025 0.01083 0.03515 COG3525 N-acetyl-beta- hexosaminidase 0.01608 0.00385 0.03515 COG3537 Putative alpha-1,2- mannosidase 0.01956 0.0021 0.03515 COG3950 Predicted ATP-binding protein involved in virulence 0.00174 0.0026 0.03515 COG42 Bacteriophage protein gp37 0.00472 0.0085 0.03515 COG4657 Predicted NADH:ubiquinone oxidoreductase, subunit RnfA 0.00468 0.0013 0.03515 COG5495 Uncharacterized conserved protein 0.00248 0.0037 0.03515 COG125 ABC-type proline/glycine betaine transport systems, ATPase components 0.0075 0.00381 0.03560 COG3292 Predicted periplasmic ligand-binding sensor domain 0.00649 0.000 0.03560 COG202 Regulators of stationary/sporulation gene expression 0.00478 0.0078 0.03627 COG374 Mannosyltransferase OCH1 and related enzymes 0.00245 0.0026 0.03627 COG019 Diaminopimelate decarboxylase 0.0153 0.00618 0.0364 COG0137 Argininosuccinate synthase 0.00502 0.0020 0.0364 COG0164 Ribonuclease HII 0.00532 0.00271 0.0364 COG0540 Aspartate carbamoyltransferase, catalytic chain 0.00610 0.0039 0.0364 COG0618 Exopolyphosphatase- related proteins 0.0054 0.00108 0.0364 COG083 Amino acid transporters 0.007 0.00493 0.0364 136 COG1390 Archaeal/vacuolar-type H+-ATPase subunit E 0.00176 0.0040 0.0364 COG382 Predicted enzyme involved in methoxymalonyl-ACP biosynthesis 0.0080 0.000 0.0364 COG471 Outer membrane receptor for ferrienterochelin and colicins 0.03898 0.0051 0.0364 COG0526 Thiol-disulfide isomerase and thioredoxins 0.0281 0.0141 0.03706 COG0572 Uridine kinase 0.0065 0.0021 0.0370 COG0459 Chaperonin GroEL (HSP60 family) 0.0065 0.0043 0.03826 COG0532 Translation initiation factor 2 (IF-2; GTPase) 0.00875 0.00453 0.03826 COG0632 Holiday junction resolvasome, DNA- binding subunit 0.0047 0.0023 0.03826 COG0646 Methionine synthase I (cobalamin-dependent), methyltransferase domain 0.00453 0.0015 0.03826 COG0785 Cytochrome c biogenesis protein 0.0040 0.0018 0.03826 COG107 Actin-like ATPase involved in cell morphogenesis 0.00730 0.00281 0.03826 COG1089 GDP-D-mannose dehydratase 0.00324 0.0051 0.03826 COG1262 Uncharacterized conserved protein 0.0017 0.0014 0.03826 COG1362 Aspartyl aminopeptidase 0.00764 0.00195 0.03826 COG1579 Zn-ribon protein, posibly nucleic acid- binding 0.00139 0.000 0.03826 COG234 Predicted aminopeptidases 0.00437 0.0096 0.03826 COG264 Ribosomal protein L1 methylase 0.00436 0.00190 0.03826 COG3174 Predicted membrane protein 0.0092 0.000 0.03826 137 COG3643 Glutamate formiminotransferase 0.00194 0.0014 0.03826 COG4360 ATP adenylyltransferase (5',5''-P-1,P-4- tetraphosphate phosphorylase II) 0.0049 0.000 0.03826 COG5015 Uncharacterized conserved protein 0.00163 0.0012 0.03826 COG0124 Histidyl-tRNA synthetase 0.00618 0.00350 0.03908 COG2859 Uncharacterized protein conserved in bacteria 0.0013 0.000 0.03908 COG3176 Putative hemolysin 0.0035 0.0053 0.03908 COG483 Predicted glycosyl hydrolase 0.00248 0.0017 0.03908 COG3206 Uncharacterized protein involved in exopolysaccharide biosynthesis 0.0053 0.0073 0.03923 COG240 Ferredoxin-like protein 0.0016 0.00182 0.03938 COG273 Beta-glucanase/Beta- glucan synthetase 0.0031 0.0036 0.0391 COG4804 Uncharacterized conserved protein 0.00768 0.00139 0.04142 COG0326 Molecular chaperone, HSP90 family 0.00763 0.0023 0.04210 COG0536 Predicted GTPase 0.0068 0.00357 0.04210 COG0676 Uncharacterized enzymes related to aldose 1-epimerase 0.0058 0.00253 0.04210 COG0781 Transcription termination factor 0.00416 0.00252 0.04210 COG1589 Cell division septal protein 0.0081 0.0034 0.04210 COG1643 HrpA-like helicases 0.00149 0.00649 0.04210 COG1696 Predicted membrane protein involved in D- alanine export 0.00768 0.00253 0.04210 COG1819 Glycosyl transferases, related to UDP- glucuronosyltransferase 0.00174 0.0049 0.04210 138 COG2081 Predicted flavoproteins 0.00734 0.00238 0.04210 COG204 Response regulator containing CheY-like receiver, AA-type ATPase, and DNA- binding domains 0.01946 0.00471 0.04210 COG275 Lysophospholipase L1 and related esterases 0.0142 0.00516 0.04210 COG3408 Glycogen debranching enzyme 0.00492 0.0092 0.04210 COG4856 Uncharacterized protein conserved in bacteria 0.00217 0.0080 0.04210 COG1609 Transcriptional regulators 0.0302 0.09242 0.04240 COG1052 Lactate dehydrogenase and related dehydrogenases 0.00930 0.00367 0.04298 COG4123 Predicted O- methyltransferase 0.00501 0.00181 0.04298 COG115 Na+/alanine symporter 0.016 0.0040 0.04313 COG2768 Uncharacterized Fe-S center protein 0.00548 0.0012 0.04313 COG475 Outer membrane protein/protective antigen OMA87 0.0102 0.0017 0.04313 COG4092 Predicted glycosyltransferase involved in capsule biosynthesis 0.0040 0.000 0.04341 COG083 Homoserine kinase 0.00105 0.00369 0.04346 COG198 Primosomal protein N' (replication factor Y) - superfamily II helicase 0.00829 0.00412 0.04346 COG1475 Predicted transcriptional regulators 0.02502 0.01376 0.04346 COG1649 Uncharacterized protein conserved in bacteria 0.00652 0.00107 0.04346 COG296 Uncharacterized conserved protein 0.00510 0.00305 0.04346 139 COG3250 Beta-galactosidase/beta- glucuronidase 0.05604 0.0297 0.04346 COG4231 Indolepyruvate ferredoxin oxidoreductase, alpha and beta subunits 0.00375 0.0064 0.04346 COG0621 2-methylthioadenine synthetase 0.01714 0.00781 0.0498 COG0707 UDP-N- acetylglucosamine:LPS N-acetylglucosamine transferase 0.00795 0.00474 0.0498 COG409 Predicted peptidase 0.00171 0.0015 0.04589 COG129 ABC-type sugar transport system, ATPase component 0.01257 0.02756 0.04924 Table 18 Diferentially abundant COGs in comparison of infant and adult gut microbiomes. 140 Bibliography 1. Snyder, L. and W. Champnes, Molecular genetics of bacteria. 3rd ed. 2007, Washington, D.C.: ASM Pres. xvii, 735 p. 2. heelis, M., Principles of Modern Microbiology 2008, Sudbury, MA: Jones and Bartlete Publishers. 11-25. 3. American Society for Microbiology., D.H. Bergey, and R.S. Bred, Bergey's manual of determinative bacteriology : a key for the identification of organisms of the class Schizomycetes. 1923, Baltimore: Wiliams & Wilkins company. xi,442p. 4. Razumov, A.S., The direct method of calculation of bacteria in water. Comparison with the Koch method. Mikrobiologiya, 1932. 1(2): p. 131-146. 5. Staley, J.T. and A. Konopka, Measurement of in situ activities of nonphotosynthetic microorganisms in aquatic and terestrial habitats. Annu Rev Microbiol, 1985. 39: p. 321-46. 6. Torsvik, V., J. Goksoyr, and F.L. Dae, High diversity in DNA of soil bacteria. Appl Environ Microbiol, 1990. 56(3): p. 782-7. 7. Woese, C.R., J. aniloff, and L.B. Zablen, Phylogenetic analysis of the mycoplasmas. Proc Natl Acad Sci U S A, 1980. 77(1): p. 494-8. 8. Woese, C.R., et al., The phylogenetic relationships of thre sulfur dependent archaebacteria. Syst Appl Microbiol, 1984. 5: p. 97-105. 9. Woese, C., et al., A comparison of the 16S ribosomal RNAs from mesophilic and thermophilic bacili: some modifications in the Sanger method for RNA sequencing. J Mol Evol, 1976. 7(3): p. 197-213. 10. Fox, G.E., et al., Classification of methanogenic bacteria by 16S ribosomal RNA characterization. Proc Natl Acad Sci U S A, 1977. 74(10): p. 4537-4541. 11. Lane, D.J., et al., Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci U S A, 1985. 82(20): p. 6955-9. 12. Pace, N.R., A molecular view of microbial diversity and the biosphere. Science, 1997. 276(5313): p. 734-40. 13. Rusch, D.B., et al., The Sorcerer I Global Ocean Sampling expedition: northwest Atlantic through eastern tropical Pacific. PLoS Biol, 2007. 5(3): p. e77. 14. Schloss, P.D. and J. Handelsman, Toward a census of bacteria in soil. PLoS Comput Biol, 2006. 2(7): p. e92. 15. Otesen, A.R., et al., Impact of organic and conventional management on the phylosphere microbial ecology of an apple crop. J Food Prot, 2009. 72(11): p. 2321-5. 16. Delbes, C., L. Ali-Mandje, and M.C. Montel, Monitoring bacterial communities in raw milk and cheese by culture-dependent and -independent 16S rRNA gene- based analyses. Appl Environ Microbiol, 2007. 73(6): p. 1882-91. 17. Yamane, K., et al., Diversity and similarity of microbial communities in petroleum crude oils produced in Asia. Biosci Biotechnol Biochem, 2008. 72(11): p. 2831-9. 18. Dinsdale, E.A., et al., Functional metagenomic profiling of nine biomes. Nature, 2008. 452(7187): p. 629-32. 141 19. Eckburg, P.B., et al., Diversity of the human intestinal microbial flora. Science, 2005. 308(5728): p. 1635-8. 20. Ley, R.E., et al., Microbial ecology: human gut microbes associated with obesity. Nature, 2006. 444(7122): p. 1022-3. 21. Turnbaugh, P.J., et al., A core gut microbiome in obese and lean twins. Nature, 2009. 457(7228): p. 480-4. 22. Cole, J.R., et al., The Ribosomal Database Project (RDP-I): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, 2005. 33(Database isue): p. D294-6. 23. DeSantis, T.Z., et al., Grengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol, 2006. 72(7): p. 5069-72. 24. Meyer, F., et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 2008. 9(1): p. 386. 25. Covaci, A., et al., From microbial genomics to meta-genomics. Drug Development Research, 1997. 41(3-4): p. 180-192. 26. The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet, N.R.C.o.t.N.A. Commite on Metagenomics: Chalenges and Functional Applications, Editor. 2007, The National Academies Pres. 27. Tyson, G.W., et al., Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature, 2004. 428(6978): p. 37-43. 28. Venter, J.C., et al., Environmental genome shotgun sequencing of the Sargasso Sea. Science, 2004. 304(5667): p. 66-74. 29. DeLong, E.F., Microbial community genomics in the ocean. Nat Rev Microbiol, 2005. 3(6): p. 459-69. 30. Turnbaugh, P.J., et al., An obesity-associated gut microbiome with increased capacity for energy harvest. Nature, 2006. 444(7122): p. 1027-31. 31. Venter, J.C., et al., The sequence of the human genome. Science, 2001. 291(5507): p. 1304-51. 32. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001. 409(6822): p. 860-921. 33. Waterston, R.H., et al., Initial sequencing and comparative analysis of the mouse genome. Nature, 2002. 420(6915): p. 520-62. 34. Adams, M.D., et al., The genome sequence of Drosophila melanogaster. Science, 2000. 287(5461): p. 2185-95. 35. Human Microbiome Project - Overview, P. Division of Program Coordination, and Strategic Initiatives, Editor. 2009. 36. Turnbaugh, P.J., et al., The human microbiome project. Nature, 2007. 449(7164): p. 804-10. 37. Margulies, M., et al., Genome sequencing in microfabricated high-density picolitre reactors. Nature, 2005. 437(7057): p. 376-80. 38. Sanger, F., S. Nicklen, and A.R. Coulson, DNA sequencing with chain- terminating inhibitors. Proc Natl Acad Sci U S A, 1977. 74(12): p. 5463-7. 39. Anderson, B., et al., A "double adaptor" method for improved shotgun library construction. Anal Biochem, 1996. 236(1): p. 107-13. 142 40. Chou, H.. and M.H. Holmes, DNA sequence quality triming and vector removal. Bioinformatics, 2001. 17(12): p. 1093-104. 41. Read, T.D., et al., Genome sequence of Chlamydophila caviae (Chlamydia psitaci GPIC): examining the role of niche-specific genes in the evolution of the Chlamydiaceae. Nucleic Acids Res, 2003. 31(8): p. 2134-47. 42. Richards, S., et al., Comparative genome sequencing of Drosophila pseudoobscura: chromosomal, gene, and cis-element evolution. Genome Res, 2005. 15(1): p. 1-18. 43. Delcher, A.L., et al., Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res, 2002. 30(11): p. 2478-83. 44. Kurtz, S., et al., Versatile and open software for comparing large genomes. Genome Biol, 2004. 5(2): p. R12. 45. Rabinowicz, P.D. and J.L. Bennetzen, The maize genome as a model for eficient sequence analysis of large plant genomes. Curr Opin Plant Biol, 2006. 9(2): p. 149-56. 46. Myers, E.W., et al., A whole-genome assembly of Drosophila. Science, 2000. 287(5461): p. 2196-204. 47. Seshadri, R., et al., Complete genome sequence of the Q-fever pathogen Coxiela burneti. Proc Natl Acad Sci U S A, 2003. 100(9): p. 5455-60. 48. Dethlefsen, L., et al., The Pervasive Efects of an Antibiotic on the Human Gut Microbiota, as Revealed by Dep 16S rRNA Sequencing. PLoS Biol, 2008. 6(11): p. e280. 49. Grice, E.A., et al., A diversity profile of the human skin microbiota. Genome Res, 2008. 18(7): p. 1043-50. 50. Huse, S.M., et al., Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing. PLoS Genet, 2008. 4(11): p. e1000255. 51. Chen, K. and L. Pachter, Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput Biol, 2005. 1(2): p. 106-12. 52. Wang, Q., et al., Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol, 2007. 73(16): p. 5261-7. 53. Felsenstein, J., PHYLIP - phylogeny inference package (Version 3.2). 1989, Cladistics 5. 54. Hugenholtz, P., B.M. Goebel, and N.R. Pace, Impact of culture-independent studies on the emerging phylogenetic view of bacterial diversity. J Bacteriol, 1998. 180(18): p. 4765-74. 55. Sait, M., P. Hugenholtz, and P.H. Janssen, Cultivation of globally distributed soil bacteria from phylogenetic lineages previously only detected in cultivation- independent surveys. Environ Microbiol, 2002. 4(11): p. 654-66. 56. Meila, M., Comparing clusterings - an information based distance. Journal of ultivariate Analysis, 2007. 98(5): p. 873-895. 57. Lambais, M.R., et al., Bacterial diversity in tre canopies of the Atlantic forest. Science, 2006. 312(5782): p. 1917. 58. Mavromatis, K., et al., Use of simulated data sets to evaluate the fidelity of metagenomic procesing methods. Nat Methods, 2007. 4(6): p. 495-500. 143 59. DeSantis, T.Z., Jr., et al., NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes. Nucleic Acids Res, 2006. 34(Web Server isue): p. W394-9. 60. Edgar, R.C., MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 2004. 5: p. 113. 61. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progresive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80. 62. Schloss, P.D. and J. Handelsman, Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richnes. Appl Environ Microbiol, 2005. 71(3): p. 1501-6. 63. Warnecke, F., et al., Metagenomic and functional analysis of hindgut microbiota of a wood-feding higher termite. Nature, 2007. 450(7169): p. 560-5. 64. Corby-Haris, V., et al., Geographical distribution and diversity of bacteria associated with natural populations of Drosophila melanogaster. Appl Environ Microbiol, 2007. 73(11): p. 3470-9. 65. Kennedy, J., et al., Diversity of microbes associated with the marine sponge, Haliclona simulans, isolated from Irish waters and identification of polyketide synthase genes from the sponge metagenome. Environ Microbiol, 2008. 10(7): p. 1888-902. 66. Huber, J.A., et al., Microbial population structures in the deep marine biosphere. Science, 2007. 318(5847): p. 97-100. 67. Sogin, M.L., et al., Microbial diversity in the deep sea and the underexplored "rare biosphere". Proc Natl Acad Sci U S A, 2006. 103(32): p. 12115-20. 68. Chao, A., Non-parametric estimation of the number of classes in a population. Scand. J. Stat., 1984. 11: p. 265-270. 69. Chao, A. and S.M. Le, Estimating the Number of Classes Via Sample Coverage. Journal of the American Statistical Asociation, 1992. 87(417): p. 210-217. 70. Shannon, C.E., A Mathematical Theory of Communication. Bel System Technical Journal, 1948. 27(4): p. 623-656. 71. Hugenholtz, P., Exploring prokaryotic diversity in the genomic era. Genome Biol, 2002. 3(2): p. REVIEWS0003. 72. Lane, D.J., 16S/23S rRNA sequencing, in Nucleic Acid Techniques in Bacterial Systematics. 1991, Wiley: New York. p. 115-175. 73. Navlakha, S., et al., Finding Biologically Acurate Clusterings in Hierarchical Decompositions Using the Variation of Information. 2008. 74. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402. 75. Ludwig, W., et al., ARB: a software environment for sequence data. Nucleic Acids Res, 2004. 32(4): p. 1363-71. 76. Rand, W.M., Objective Criteria for Evaluation of Clustering Methods. Journal of the American Statistical Asociation, 1971. 66(336): p. 846-&. 77. Schloss, P.D. and J. Handelsman, Metagenomics for studying unculturable microorganisms: cutting the Gordian knot. Genome Biol, 2005. 6(8): p. 229. 144 78. Bik, E.M., et al., Molecular analysis of the bacterial microbiota in the human stomach. Proc Natl Acad Sci U S A, 2006. 103(3): p. 732-7. 79. Batzoglou, S., et al., ARACHNE: a whole-genome shotgun assembler. Genome Res, 2002. 12(1): p. 177-89. 80. Palmer, C., et al., Development of the Human Infant Intestinal Microbiota. PLoS Biol, 2007. 5(7): p. e177. 81. Gil, S.R., et al., Metagenomic analysis of the human distal gut microbiome. Science, 2006. 312(5778): p. 1355-9. 82. Singleton, D.R., et al., Quantitative comparisons of 16S rRNA gene sequence libraries from environmental samples. Appl Environ Microbiol, 2001. 67(9): p. 4374-6. 83. Schloss, P.D., B.R. Larget, and J. Handelsman, Integration of microbial ecology and statistics: a test to compare gene libraries. Appl Environ Microbiol, 2004. 70(9): p. 5485-92. 84. Schloss, P.D. and J. Handelsman, Introducing SONS, a tool for operational taxonomic unit-based comparisons of microbial community memberships and structures. Appl Environ Microbiol, 2006. 72(10): p. 6773-9. 85. Huson, D.H., et al., MEGAN analysis of metagenomic data. Genome Res, 2007. 17(3): p. 377-86. 86. Lozupone, C. and R. Knight, UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol, 2005. 71(12): p. 8228-35. 87. Schloss, P.D. and J. Handelsman, Introducing TreClimber, a test to compare microbial community structures. Appl Environ Microbiol, 2006. 72(4): p. 2379- 84. 88. Rodriguez-Brito, B., F. Rohwer, and R.A. Edwards, An application of statistics to comparative metagenomics. BMC Bioinformatics, 2006. 7: p. 162. 89. Velculescu, V.E., et al., Serial analysis of gene expresion. Science, 1995. 270(5235): p. 484-7. 90. Lu, J., J.K. Tomfohr, and T.B. Kepler, Identifying diferential expresion in multiple SAGE libraries: an overdispersed log-linear model approach. BMC Bioinformatics, 2005. 6: p. 165. 91. Robinson, M.D. and G.K. Smyth, Moderated statistical tests for assesing diferences in tag abundance. Bioinformatics, 2007. 23(21): p. 2881-7. 92. Storey, J.D. and R. Tibshirani, Statistical significance for genomewide studies. Proc Natl Acad Sci U S A, 2003. 100(16): p. 9440-5. 93. Troyanskaya, O.G., et al., Nonparametric methods for identifying diferentially expresed genes in microarray data. Bioinformatics, 2002. 18(11): p. 1454-61. 94. Efron, B. and R. Tibshirani, An introduction to the bootstrap. 1993, New York: Chapman & Hal. xvi, 436. 95. Hesterberg, T., Control variates and importance sampling for eficient bootstrap simulations. Statistics and Computing, 1996. 6(2): p. 147-157. 96. Johns, M.V., Importance Sampling for Bootstrap Confidence-Intervals. Journal of the American Statistical Asociation, 1988. 83(403): p. 709-714. 97. Benjamini, Y. and Y. Hochberg, Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series B-Methodological, 1995. 57(1): p. 289-300. 145 98. Zar, J.H., Biostatistical analysis. 4th ed. 1999, Upper Saddle River, N.J.: Prentice Hal. 1 v. (various pagings). 99. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21. 100. Ruijter, J.M., A.H. Van Kampen, and F. Bas, Statistical evaluation of SAGE libraries: consequences for experimental design. Physiol Genomics, 2002. 11(2): p. 37-44. 101. Krause, L., et al., Phylogenetic classification of short environmental DNA fragments. Nucleic Acids Res, 2008. 36(7): p. 2230-9. 102. Kurokawa, K., et al., Comparative metagenomics revealed commonly enriched gene sets in human gut microbiomes. DNA Res, 2007. 14(4): p. 169-81. 103. Tuomanen, E., Microbial inhabitants of humans - Their ecology and role in health and disease. Science, 2005. 308(5722): p. 635-635. 104. Piciano, M.F. and R.H. Dering, The influence of feding regimens on iron status during infancy. Am J Clin Nutr, 1980. 33(4): p. 746-53. 105. van der Gast, C.J., A.S. Whiteley, and I.P. Thompson, Temporal dynamics and degradation activity of an bacterial inoculum for treating waste metal-working fluid. Environ Microbiol, 2004. 6(3): p. 254-63. 106. Becks, L., et al., Experimental demonstration of chaos in a microbial food web. Nature, 2005. 435(7046): p. 1226-1229. 107. Pringault, O., et al., Temporal variations of microbial activity and diversity in marine tropical sediments (New Caledonia lagoon). Microb Ecol, 2008. 55(2): p. 247-58. 108. Cook, K.L., et al., Spatial and temporal changes in the microbial community in an anaerobic swine waste treatment lagoon. Anaerobe, 2009. 109. Raes, J. and P. Bork, Molecular eco-systems biology: towards an understanding of community function. Nat Rev Microbiol, 2008. 6(9): p. 693-9. 110. DeLong, E.F., et al., Community genomics among stratified microbial assemblages in the ocean's interior. Science, 2006. 311(5760): p. 496-503. 111. Biddle, J.F., et al., Metagenomic signatures of the Peru Margin subseafloor biosphere show a genetically distinct environment. Proc Natl Acad Sci U S A, 2008. 105(30): p. 10583-8. 112. McCalum, H., Population parameters : estimation for ecological models. ethods in ecology. 2000, Oxford ; Malden, Mas.: Blackwel Science. x, 348 p. 113. Mounier, J., et al., Microbial interactions within a cheese microbial community. Applied and Environmental Microbiology, 2008. 74(1): p. 172-181. 114. Pfister, C.A., Estimation Competition Coeficients from Census Data: A Test with Field Manipulations of Tidepool Fishes. The American Naturalist, 1995. 146(2): p. 271-291. 115. Hastie, T. and R. Tibshirani, Generalized additive models. 1st ed. Monographs on statistics and applied probability ; 43. 1990, London ; New York: Chapman and Hal. xv, 335 p. 116. Trosvik, P., N.C. Stenseth, and K. Rudi, Convergent temporal dynamics of the human infant gut microbiota. Isme J, 2009. 146 117. Wood, S.N., Generalized additive models : an introduction with R. Texts in statistical science. 2006, Boca Raton, FL: Chapman & Hal/CRC. xvii, 391 p. 118. Trosvik, P., et al., Characterizing mixed microbial population dynamics using time-series analysis. Isme J, 2008. 2(7): p. 707-15. 119. Eskola, H.T.M. and S.A.H. Geritz, On the mechanistic derivation of various discrete-time population models. Bulletin of Mathematical Biology, 2007. 69(1): p. 329-346. 120. Beryman, A.., The Orgins and Evolution of Predator-Prey Theory. Ecology, 1992. 73(5): p. 1530-1535. 121. Coleman, T.F. and Y.. Li, An interior trust region approach for nonlinear minimization subject to bounds. Siam Journal on Optimization, 1996. 6(2): p. 418-445. 122. Lagarias, J.C., et al., Convergence properties of the Nelder-Mead simplex method in low dimensions. Siam Journal on Optimization, 1998. 9(1): p. 112-147. 123. Lewis, R.M. and V. Torczon, Pattern search methods or linearly constrained minimization. Siam Journal on Optimization, 2000. 10(3): p. 917-941. 124. Turnbaugh, P., et al., The Efect of Diet on the Human Gut Microbiome: A Metagenomic Analysis in Humanized Gnotobiotic Mice. Sci Transl Med, 2009. 1(6): p. 6ra14. 125. Ley, R.E., D.A. Peterson, and J.I. Gordon, Ecological and evolutionary forces shaping microbial diversity in the human intestine. Cel, 2006. 124(4): p. 837-48. 126. Berlow, E.L., et al., Interaction strengths in food webs: isues and opportunities. Journal of Animal Ecology, 2004. 73: p. 585-598. 127. Macfarlane, G.T., S. Macfarlane, and G.R. Gibson, Validation of a Thre-Stage Compound Continuous Culture System for Investigating the Efect of Retention Time on the Ecology and Metabolism of Bacteria in the Human Colon. Microb Ecol, 1998. 35(2): p. 180-7. 128. Guarner, F. and J.R. Malagelada, Gut flora in health and disease. Lancet, 2003. 361(9356): p. 512-9. 129. Mahowald, M.A., et al., Characterizing a model human gut microbiota composed of members of its two dominant bacterial phyla. Proc Natl Acad Sci U S A, 2009. 106(14): p. 5859-64. 130. Klappenbach, J.A., et al., rndb: the Ribosomal RNA Operon Copy Number Database. Nucleic Acids Res, 2001. 29(1): p. 181-4. 131. Sonnenburg, J.L., C.T. Chen, and J.I. Gordon, Genomic and metabolic studies of the impact of probiotics on a model gut symbiont and host. PLoS Biol, 2006. 4(12): p. e413. 132. Faulweter, J.L., et al., Microbial proceses influencing performance of treatment wetlands: A review. Ecological Engineering, 2009. 35(6): p. 987-1004. 133. Moon, S., et al. Escherichia coli counting using lens-fre imaging for sepsis diagnosis. in Proc. SPIE. 2009. Berlin, Germany. 134. White, J.R., N. Nagarajan, and M. Pop, Statistical methods for detecting diferentially abundant features in clinical metagenomic samples. PLoS Comput Biol, 2009. 135. Ley, R.E., et al., Obesity alters gut microbial ecology. Proc Natl Acad Sci U S A, 2005. 102(31): p. 11070-5. 147 136. Macfarlane, S., G.T. Macfarlane, and J.H. Cummings, Review article: prebiotics in the gastrointestinal tract. Aliment Pharmacol Ther, 2006. 24(5): p. 701-14. 137. Rastal, R.A., et al., Modulation of the microbial ecology of the human colon by probiotics, prebiotics and synbiotics to enhance human health: an overview of enabling science and potential applications. FEMS Microbiol Ecol, 2005. 52(2): p. 145-52. 138. Camileri, M., Probiotics and iritable bowel syndrome: rationale, mechanisms, and eficacy. J Clin Gastroenterol, 2008. 42 Supl 3 Pt 1: p. S123-5. 139. O'Sullivan, G.C., et al., Probiotics: an emerging therapy. Curr Pharm Des, 2005. 11(1): p. 3-10. 140. Gomez-Alvarez, V., T.K. Teal, and T.M. Schmidt, Systematic artifacts in metagenomes from complex microbial communities. Isme J, 2009. 3(11): p. 1314- 7.