ABSTRACT Title of Dissertation: REFERENCE-GUIDED ASSEMBLY OF METAGENOMES Victoria Paz Cepeda-Espinoza, Doctor of Philosophy, 2020 Dissertation Directed By: Professor Mihai Pop, Department of Computer Science Microorganisms play an important role in all of the Earth's ecosystems, and are critical for the health of humans [1], plants, and animals. Most microbes are not easily cultured [2]; yet, Metagenomics, the analysis of organismal DNA sequences obtained directly from an environmental sample, enables the study of these microorganisms. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. The two main paradigms for this method are de novo assembly (i.e., reconstructing genomes directly from the read data), and reference-guided assembly (i.e., reconstructing genomes using closely related organisms). Because the latter paradigm has a high computational cost?due to the mapping of tens of millions of reads to thousands of full genome sequences? Metagenomic studies have primarily relied on the former paradigm. However, the increased availability of high-throughput sequencing technologies has generated thousands of bacterial genomes, making reference-guided assembly a valuable resource regardless of its computational cost. Thus, this study describes a novel metagenome assembly approach, called MetaCompass, that combines reference- guided assembly and de novo assembly, and it is organized in the following stages: (i) selecting reference genomes from a database using a metagenomic taxonomy classification software that combines gene and genome comparison methods, achieving species and strain level resolution; (ii) performing reference-guided assembly in a new manner, which uses the minimum set cover principle to remove redundancy in a metagenome read mapping while performing consensus calling; and (iii) performing de novo assembly using the reads that have not been mapped to any reference genomes. We show that MetaCompass improves the most common metrics used to evaluate assembly quality?contiguity, consistency, and reference-bases metrics?for both synthetic and real datasets such as the ones gathered in the Human Microbiome Project (HMP) [3], and it also facilitates the assembly of low abundance microorganisms retrieved with the reference-guided approach. Lastly, we used our HMP assembly results to characterize the relative advantages and limitations of de novo and reference- guided assembly approaches, thereby providing guidance on analytical strategies for characterizing the human-associated microbiota. REFERENCE-GUIDED ASSEMBLY OF METAGENOMES by Victoria Paz Cepeda Espinoza Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2020 Advisory Committee: Professor Mihai Pop, Chair Professor H?ctor Corrada-Bravo Professor Abhinav Bhatele Professor Robert Patro Professor Stephanie Yarwood, Dean?s Representative ? Copyright by Victoria Paz Cepeda Espinoza 2020 Preface The algorithms, software, and results in this dissertation have either been published in peer-reviewed journals and conferences or are currently under preparation for submis- sion and/or available as a preprint. At the time of this writing, parts of Chapters 1, 2 already been published and are reformatted here. Chapter 3 is under preparation for publication. Chapter 4 and 6 are available in a preprint and are under preparation for submission to a peer-reviewed journal. ? Chapter 1: Introduction. Ghurye, J. S., Cepeda-Espinoza, V., & Pop, M. (2016). Focus: Microbiome: Metagenomic Assembly: Overview, Challenges and Applications. The Yale journal of biology and medicine, 89(3), 353. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5045144 My contributions to these works include (1) surveying the literature for recent and seminal results, and (2) writing the manuscript. ? Chapter 2: Related work. Olson, N. D., Treangen, T. J., Hill, C. M., Cepeda-Espinoza, V., Ghurye, J., Koren, S., & Pop, M. (2017). Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes. Briefings in Bioinformatics, bbx098. https://doi.org/10.1093/bib/bbx098 ii My contributions to these works include (1) surveying the literature for recent and seminal results, and (2) writing the manuscript. ? Chapter 3: Selecting references genome for metagenomic reference-guided assembly. Cepeda-Espinoza & Pop, M. ?Reference selection of Metagenomes?. Under preparation. My contributions to this work include (1) design and implementation of the method, (2) doing software evaluation, and (3) writing the manuscript. ? Chapters 4 and 5: o Chapter 4: Reference-guided metagenomic assembly. o Chapter 5: Hybrid reference-guided and de novo assembly of metagenomes. Cepeda-Espinoza, V., Liu, B., Almeida, M., Hill, C. M., Koren, S., Treangen, T. J., & Pop, M. ?MetaCompass: Reference-guided Assembly of Metagenomes?. bioRxiv (2017): 212506. https://www.biorxiv.org/content/early/2017/11/01/212506 My contributions to this work include (1) design and implementation of the algorithm, (2) doing software evaluation, and (3) writing the manuscript. Here is the list of other publications in which I have been involved, as a contributing author. ? Meisel, J. S., Nasko, D. J., Brubach, B., Cepeda-Espinoza, V., Chopyk, J., Corrada-Bravo, H., ... & Shah, N. (2018). Current progress and future iii opportunities in applications of bioinformatics for biodefense and pathogen detection: report from the Winter Mid-Atlantic Microbiome Meet-up, College Park, MD, January 10, 2018. My contributions to this work include (1) Participating in the Winter Mid- Atlantic Microbiome Meet-up as a speaker and discussion sessions, and (2) reviewing the manuscript before publication. ? Mihai Pop, Jacquelyn Meisel, Victoria Cepeda Espinoza, Kiran Javkar and Dylan Taylor. " Introducing genome assembly to the general public through interactive word games" (2020). Education COSI: Computational Biology Education. ISMB 2020. My contributions to this work include (1) Designing word games to simulate genome assembly and participating in the Maryland Day. iv Dedication To my family and friends for their unconditional love and support. v Acknowledgements First, I wish to express my sincere appreciation to my supervisor, Professor Mihai Pop, who gave the opportunity to join his lab to work on challenging problems. He convincingly guided and encouraged me to be professional and do the right thing even when the road got tough. Without his persistent help, this dissertation would not have been possible. I would also like to thank all my other committee members ?H?ctor Corrada- Bravo, Rob Patro, Abhinav Bhatele and Stephanie Yarwood?for being interested in my research and for giving me insightful feedback to complete my dissertation. I wish to express my deepest gratitude to all my colleagues and collaborators, which are all current and former members of the Pop Lab: Brian Brubach, Brook Stacy, Christopher Hill, Dan Nasko, Domenick Braccia, Dylan Taylor, Jay Ghurye, Hari Muralidharan, Irina Astrovskaya, Jacquelyn Meisel, Jeremy Selengut, Joseph Paulson, Kiran Javkar, Lee Mendelowitz, Marcus Fedarko, Mathieu Almeida, Nate Olson, Nidhi Shah, Saul Sarria, Senthil Muthiah, Seth Commichaux, Sergey Koren, Todd Treangen and Tu Luan. Your academic and moral support throughout my entire PhD. Process was a blessing. Finally, I would like to thank every person who made my graduate student life happier, including students, faculty, and staff from CBCB, UMIACS, the CS department and UMD. vi Table of Contents Preface .......................................................................................................................... ii Dedication .................................................................................................................... v Acknowledgements .................................................................................................... vi Table of Contents ...................................................................................................... vii List of Tables .............................................................................................................. ix List of Figures .............................................................................................................. x List of Abbreviations ................................................................................................. xi Chapter 1: Introduction ......................................................................................... 1 1.1 DNA basics ................................................................................................... 1 1.2 DNA sequencing ........................................................................................... 2 1.3 Genome assembly ......................................................................................... 4 1.3.1 De novo assembly ................................................................................. 5 1.3.2 Reference-guided assembly ................................................................ 10 1.4 Metagenomics ............................................................................................. 11 Chapter 2: Related work...................................................................................... 12 2.1 Discovering microbes present in a metagenomic sample ........................... 12 2.1.1 Taxonomic classification of metagenomes ......................................... 12 2.1.2 Current methods for taxonomic classification of metagenomes ......... 15 2.1.3 Metrics to evaluate taxonomy classification of metagenomes ........... 20 2.2 Metagenomic assembly ............................................................................... 22 2.2.1 Current methods for metagenomic assembly ...................................... 24 2.2.2 Metagenome assembly validation ....................................................... 26 2.3 Conclusion .................................................................................................. 30 Chapter 3: Selecting references genome for metagenomic reference-guided assembly 32 3.1 Introduction ................................................................................................. 32 3.2 Methods....................................................................................................... 33 3.2.1 Marker gene alignment ....................................................................... 33 3.2.2 Complete genome k-mer matching ..................................................... 34 3.2.3 Database construction ......................................................................... 35 3.2.4 Implementation details ........................................................................ 36 3.3 Results ......................................................................................................... 37 3.3.1 Evaluation of performance on Salmonella enterica simulated genome 37 3.3.2 Evaluation of performance on synthetic metagenomic dataset........... 39 3.3.3 Evaluation of computational performance on simulated and synthetic metagenomic datasets ......................................................................................... 42 3.3.4 Evaluation of performance on CAMI medium dataset ....................... 43 3.4 Conclusion and discussion .......................................................................... 46 Chapter 4: Reference-guided metagenomic assembly ...................................... 47 4.1 Introduction ................................................................................................. 47 vii 4.2 Related work ............................................................................................... 48 4.2.1 Read mapping ..................................................................................... 48 4.2.2 Consensus calling................................................................................ 50 4.2.3 Assembly polishing ............................................................................. 50 4.3 Methods....................................................................................................... 52 4.3.1 Read mapping ..................................................................................... 52 4.3.2 Selecting a minimal reference set for consensus calling .................... 52 4.3.3 Building contigs (consensus calling) .................................................. 54 4.4 Results ......................................................................................................... 56 4.5 Conclusion and future directions ................................................................ 59 Chapter 5: Hybrid reference-guided and de novo assembly of metagenomes 61 5.1 Introduction ................................................................................................. 61 5.2 Method ........................................................................................................ 61 5.2.1 Datasets used to evaluate metagenomic assemblies ........................... 63 5.2.2 Parameters used for metagenome assembly and metagenome assembly validation64 5.3 Results ......................................................................................................... 64 5.3.1 Evaluation of performance on synthetic metagenomic dataset........... 65 5.3.2 Evaluation of performance on Human Microbiome Project (HMP2) 74 5.4 Conclusion and discussion .......................................................................... 81 Chapter 6: Conclusion ......................................................................................... 84 Appendices ................................................................................................................. 87 Bibliography .............................................................................................................. 89 viii List of Tables Table 1.1. Overview of sequencing technologies. ........................................................ 2 Table 3.1. Most abundant species in MetaCompassRS database. .............................. 37 Table 3.2. Taxonomy classifier predictions at different taxonomy levels for Salmonella enterica serovar Typhimurium LT2 simulated dataset............................. 38 Table 3.3. Comparison between Salmonella enterica serovar Typhimurium LT2 (NC_003197.2) and MetaCompassRS false positive (FP) strains. ............................. 39 Table 3.4. Running time for taxonomy classifiers on simulated and synthetic datasets. ..................................................................................................................................... 43 Table 4.1. Evaluation of performance on synthetic dataset without using reference genomes. ..................................................................................................................... 58 Table 4.2. Evaluation of performance on synthetic dataset using reference genomes. ..................................................................................................................................... 58 Table 5.1. Evaluation of performance on synthetic dataset. ....................................... 67 Table 5.2. Evaluation of performance on down-sampled synthetic dataset. .............. 72 Table 5.3. Running time for assemblers on Shakya et al. sample. ............................. 73 Table 5.4. Re-assembly of 2,294 samples generated in the Human Microbiome Project. ........................................................................................................................ 76 ix List of Figures Figure 1.1. Depth and Breadth of coverage. ................................................................. 4 Figure 1.2. The challenge of repeats in metagenomes. ................................................. 5 Figure 1.3. Overview of different de novo assembly paradigms. ................................. 6 Figure 2.1. Metagenome assembly error signatures. .................................................. 29 Figure 3.1. Overview of MetaCompassRS. ................................................................ 34 Figure 3.2. Species and strain level classification results on Shakya et al. dataset. ... 40 Figure 3.3. Species and strain level classification results on down-sampled Shakya et al. dataset..................................................................................................................... 41 Figure 3.4. Scatter plot of Precision (x-axis) versus Recall (y-axis) across all datasets and taxa. ...................................................................................................................... 42 Figure 3.5. Species and strain level classification results on a medium complexity CAMI dataset. ............................................................................................................. 45 Figure 4.1. Aligning read to reference genomes. ........................................................ 53 Figure 4.2. Creating contigs from reads that are mapped to reference genome using the majority rule. ......................................................................................................... 55 Figure 5.1. Overview of the MetaCompass pipeline. ................................................. 62 Figure 5.2. Genome recovery percentages in synthetic metagenome (MetaCompass versus de novo assembly)............................................................................................ 68 Figure 5.3. Error profile on synthetic dataset. ............................................................ 70 Figure 5.4. MetaCompass performance on low coverage dataset. ............................. 74 Figure 5.5. Comparative assembly of 2,294 metagenomic samples from the HMP2 Project. ........................................................................................................................ 77 Figure 5.6. Icarus view of metagenomic assembly of the stool sample SRS143342 from the HMP2 Project. .............................................................................................. 80 Figure 5.7. Longest contig from Bacteroides cellulosilyticus strain WH2 chromosome genome assembly (accession:NZ_CP012801.1, length: 7084828 bp). ....................... 80 Figure 5.8. IGV visualization of read and contig mapped against a segment of the Bacteroides cellulosilyticus WH2 genome (accession:NZ_CP012801.1, length: 7084828 bp). ............................................................................................................... 81 x List of Abbreviations Bp Base pair DNA Deoxyribonucleic acid Gbp 1 Giga base pair equal to 109 base pairs Kbp 1 Kilo base pairs equal to 1000 base pairs Mpb 1 Mega base pairs equal to 106 base pairs ORF Open reading frame RNA Ribonucleic acid xi Chapter 1: Introduction 1.1 DNA basics DNA is a long chain of molecules that holds all the genetic information needed for the organisms to function and reproduce. The DNA alphabet is composed of four letters: A (Adenine), T (Thymidine), G (Guanine), and C (Cytosine), known as nucleotides. A DNA sequence can be represented as a string consisting of series of these four letters, and it is composed of two strands of nucleotides that ?match? each other following a constraint: i.e., A always matches T and C always matches G. Because nucleotides pair up, each letter in the DNA sequence is called base pair (bp). Here we will only refer to the sequence of one strand. An organism?s genome sequence refers to an organism?s complete set of DNA (e.g., the human genome has 3.2 billion bp). A genome consists of a set of genes? which are contiguous intervals of DNA that contain information needed to code for proteins or RNAs?that are paramount for answering a variety of biological questions, such as inheritance, ancestry, health and disease of an organism. 1 Table 1.1. Overview of sequencing technologies. Technology Read Accuracy Time per Bases per Length run run Single Molecular Real 10 kbp to 87% 30 min. to 5-10 Gbp Third- Time Sequencing (Pacific 15 kbp (Low) 4 hrs. generation Biosciences) sequencing technologies Oxford Nanopore MinION 5 kbp to 70% to 1 to 2 days 500 Mbp Sequencing 10 kbp 90% (Low) Ion Semiconductor (Ion Up to 98% 2 hrs. 10 Gbp Torrent sequencing) 400 bp (Medium) Second- Sequencing by synthesis 50-300bp 99.9% 1 to 11 300Gbp generation (Illumina) (High) days sequencing technologies Sequencing by Ligation 75 bp 99.9% 1 to 2 3 Gbp (SOLiD sequencing) (High) weeks Pyrosequencing (454) 700 bp 99.9% 24 hrs. 400 Mbp (High) First- Chain termination (Sanger 400 to 99.9% N/A 50-100 generation sequencing) 900 bp (High) Kbp sequencing technologies 1.2 DNA sequencing The process of determining the complete base pair sequence order of a string of DNA sequencing is called DNA sequencing or simply sequencing. The technique called shotgun sequencing randomly ?breaks up? DNA into a collection of small fragments, and each fragment is individually sequenced into a read. In many cases, reads are pair 2 ended or mate-paired, which means that pairs of reads are sequenced from the same DNA fragment. The distance between the reads in each pair, and their relative orientation are approximately known. The various sequencing technologies developed over the past 40 years can be broadly classified into three generations based on key technological innovations (Table 1.1). First-generation sequencing, usually referred to as Sanger sequencing, relied on DNA cloning, and therefore had limited throughput. The second-generation sequencing, often called short-read sequencing due to the much shorter length of the sequences compared to the Sanger technology, massively increased throughput by parallelizing many reactions on chips. Third-generation sequencing, also called single molecule or long-read sequencing, allows sequencing of single DNA molecules in contrast with prior technologies that required each molecule to be amplified [4]. Single molecule technologies can read substantially longer sequences than prior technologies. Sequencing technologies randomly oversample the genome and produce many overlapping reads such that the total amount of reads is greater than the amount needed to ?cover? each base pair of the genome. The theoretical or expected coverage is the average number of times that each nucleotide is expected to be sequenced given a certain number of reads of a given length and the assumption that reads are randomly distributed across an idealized genome[5]. Empirically, the term ?depth of coverage? (usually referred to as coverage) is defined as the average number of times a base of a genome is sequenced or ?covered? by a read. The term ?breadth of coverage?, on the other hand, is defined as the percentage of bases of a genome that are sequenced, or the percentage of the genome ?covered? by reads (Figure 1.1). 3 Figure 1.1. Depth and Breadth of coverage. In this example, the depth of coverage of the reference genome (10 Mbp) is 4X (there is average of 4 reads per base pair). The breadth of coverage is 90% of the reference genomes, or, in other words, 1 Mbp of the genome is not covered by any read. 1.3 Genome assembly Genome assembly is the reconstruction of a genome from short overlapping reads, and it is a complex computational tasks due to DNA segments repeated within a same organism, also known as ?intragenomic repeats? (Figure 1.2 A) [6]. Repeats present a challenge, because different genomic regions that share repeats can be indistinguishable if the repeats are longer than the reads. Therefore, if a genome region has a repeat, the repeat will introduce several possible sequences or paths in the assembly graph. It has been shown that assembly complexity is directly tied to the ratio between the sequencing read length and the length of repeats [7]. Algorithms and computational tools called ?genome assemblers? are able to reconstruct near-complete genome sequences from reads and they fall into two paradigms: the de novo assembly strategy where reads are used to reconstruct the genome without prior knowledge of the source of DNA, and a reference-guided assembly (also known as comparative assembly), in which reads are aligned to a reference genome. Although the first genome assembler was developed almost three 4 decades ago and now there are numerous computational tools to tackle genome assembly, genome assembly remains a challenging computational problem. In most cases, genome assemblers cannot fully reconstruct an organism?s genome and the output consists of a set of continuous fragments called contigs. Figure 1.2. The challenge of repeats in metagenomes. Three genomes are used to depict intragenomic (A) and intergenomic (B) repeats. The dark blue and light blue genomes represent two closely related strains and the green genome an unrelated strain. Within the genomes the red, orange, and tan blocks represent inparalogs. The yellow blocks represent a horizontal gene transfer event between the light blue and green genomes. In traditional assembly, any reads longer than the inparalog blocks (red, orange) would be sufficient to fully resolve the genome. In metagenomic assembly, reads longer than the full syntenic block (gray) would be necessary. 1.3.1 De novo assembly Currently, most state of the art de novo genome assemblers use a graph-based approach, where the problem can be formulated as a Hamiltonian or Eulerian path problem [8], depending on how the reads and overlaps are defined. Ideally, the genome assembly problem has one solution, but the graph formulations can have many solutions and finding the correct solution (genome assembly) is NP-hard [9]. Due to the computational intractability, three heuristic based methods have been developed 5 throughout time to perform de novo assembly: Greedy assembly, Overlap-layout consensus assembly, and de Bruijn graph assembly?currently the most widely used technique (Figure 1.3). Figure 1.3. Overview of different de novo assembly paradigms. Schematic representation of the three main paradigms for genome assembly ? Greedy, Overlap- Layout-Consensus, and de Bruijn. In Greedy assembler, reads with maximum overlaps are iteratively merged into contigs. In Overlap-Layout-Consensus approach, a graph is constructed by finding overlaps between all pairs of reads. This graph is further simplified and contigs are constructed by finding branch-less paths in the graph and taking the consensus sequence of the overlapping reads implied by the corresponding paths. Contigs are further organized and extended using mate pair information. In de Bruijn graph assemblers, reads are cut into short overlapping segments (k-mers) organized in a de Bruijn graph structure based on their co- occurrence across reads. The graph is simplified to remove artifacts due to sequencing errors, and branch-less paths are reported as contigs. 6 1.3.1.1 Greedy assembly This approach first identifies overlaps between pairs of reads and merges reads with the best overlaps. The overlap and merging steps continue in an iterative way until all reads and overlaps are merged. The main advantages of the greedy assembly method is the simplicity of its algorithm, which makes it easy to implement, and its effectivity when the genome contains only short or no repeats. Its disadvantage, on the other hand, is that the choices made during the merging steps are locally optimal and do not consider global relationships between reads. As a result, this approach can produce incorrect assemblies within repetitive sequences. 1.3.1.2 Overlap-layout consensus (OLC) This method was developed in 1995[9] and was used to assemble the first bacterial genome, Haemophilus influenzae [10][11], and the first human genome [12][13]. The OLC approach has three steps. In the first overlap step, it computes all pairwise overlaps with a dynamic programming-based alignment algorithm. The complexity of this computational step is quadratic in terms of the number of reads. Then an overlap graph is generated from both the reads and pairwise overlaps, using reads as vertices and overlaps as edges. In the second layout step, the overlap graph is simplified to identify a path (or ?layout?) of the reads along the genome that corresponds to its sequence. And, finally, in the consensus step the layout is used to construct a multiple alignment of the reads to infer the sequence of the genome. This approach is effective at high error rates, but its efficiency is reduced with high depth of coverage due to the complexity of the overlap computation step. Therefore, this approach is more suitable for relatively long reads such as those 7 generated by first and third-generation technologies, and it is also particularly beneficial in assembling reads with high error rates, such as those generated by third generation technologies. 1.3.1.3 De Bruijn Graph (DBG) The DBG approach became popular with the appearance of second-generation sequencing technologies, which increased the throughput to hundreds of millions of reads, as opposed to the Greedy and OLC assembly approaches that were designed for first-generation sequencing, and which did not scale well. The reads are used to construct a DBG as follows: each read is decomposed into overlapping segments of equal length k, called k-mers. The k-mers become the nodes of the graph, and the edges connect nodes with k-1 matching bases. In this approach reads are not explicitly aligned to each other, rather their overlaps can be inferred from the fact that they share k-mers. The DBG is a multigraph due to repeats. Repeats create additional edges in the graph, increasing the number of possible traversals. Given a collection of all k-mers in a genome sequence, the assembly problem reduces to finding an Eulerian path?a path through the graph that visits each edge once. The de Bruijn formulation above assumes perfect data and makes the assumption that for a given read length k, we are given all length-k substrings of a genome as well as the number of times they occur. In practice, not all substrings are obtained, and several factors impact the performance of de Bruijn graph assemblers: (i) sequencing errors; (ii) repeats; and (iii) the depth of sequencing coverage. The interplay 8 between these factors drives the choice of optimal k-mer size for a specific application as well as the ultimate performance of an assembler. Unlike the Overlap-Layout-Consensus approach, the DBG paradigm is affected by read errors. Sequencing errors create incorrect k-mers thereby increasing the complexity of the graph and making it more difficult to identify an unambiguous reconstruction of a sequence. These errors must be eliminated prior to identifying an Eulerian path in the graph. Every error impacts at most k different k-mers, thus the impact of sequencing errors increases with the size of k. As a result, assemblers often include a correction step or assume pre-corrected data as input. Initial de Bruijn assemblers used spectral correction [14], which attempts to make a minimum number of changes in a sequence to make it consistent with correct or ?solid? k-mers [15,16]. Repeats create ambiguity in the reconstruction of the genome and therefore a larger possible space of solutions must be explored [7]. Without further information, an assembler can randomly choose one of the branches, possibly leading to assembly errors, or decide to break the assembly, leading to fragmented results. Large values of k reduce the complexity of the graph and impact of repeats, but using such values requires longer sequences (longer than the k-mer size) as well as a higher depth of coverage, leading to an increased impact of sequencing errors (each error impacts k different k-mers). Conversely, shorter k-mers mitigate the impact of sequencing errors but lead to a higher impact of repeats on assembly effectiveness. Assuming uniform error and random sequencing, it is possible to compute the expected surviving coverage for a given k-mer size and input coverage [17]. These trade-offs represent a key 9 component of the algorithmic choices made by the assembly software and also guide the empirical choices made by users of assembly tools. Finally, the depth of coverage impacts the connectivity of the DBG graph. A path stretching from one read to another across an entire genome can only be found if adjacent reads share k-mers. At low depths of coverage, the adjacent reads are only expected to overlap by a small extent, and as a result the assembly is only possible for small values of k. 1.3.2 Reference-guided assembly The reference-guided assembly approach consists of two steps: first, all the reads are aligned against the reference genome; then a consensus sequence is generated by calling a base at each position where reads have mapped along the reference genome. This approach is more effective than de novo assembly in resolving repeats and is thus able to get better results than de novo approaches especially at low depths of coverage. Long repeats are still a challenge as they lead to an ambiguous alignment of reads against the genome, though the use of mate-pair information can partly mitigate this issue and can help to identify the correct placement of reads. At the same time, the effectiveness of the comparative assembly approach depends on the availability of a closely related reference sequence. Differences between the genome being assembled and the reference can lead to either errors in reconstruction or to a fragmented assembly. The AMOScmp [18,19] comparative assembler attempts to identify such polymorphisms and rearrangements between genomes, and breaks the assembly at these locations in order to avoid mis-assemblies. 10 Several tools were developed to help augment or improve de novo assemblies with the help of reference genomes. OSLay [20], Projector 2 [21], ABACAS [22] and r2cat [23] simply use a reference sequences to identify the correct order and orientation of contigs from a de novo assembly. An extension of this approach was proposed by Husemann et al. [24] that leverages information from multiple related genomes, weighted by their evolutionary distance from the sequence being assembled. Scaffold_builder [25] also provides functionality to join together contigs that were left unassembled by the de novo approach, thereby helping improve the assembly through the use of a reference sequence. Finally, E-RGA [26] performs de novo and reference guided assembly independently first and then merges two assemblies later using a novel data structure called merge graph to avoid mis-assemblies and ambiguous overlaps. 1.4 Metagenomics Metagenomics studies microbial samples directly taken from the environment. This technology allows research in human microbiome, soil, air, bodies of water, surfaces, and virtually any place where there is a community of interest. The advantage of metagenomic sequencing over single genome sequencing is the possibility of studying all the archaea, bacteria, plasmids, and viruses present at a given time in a sample. This same feature is also a challenge, for the microbial composition of the sample is unknown. In the following chapter we further outline several approaches developed to address both the composition and the assembly of a metagenomic sample. 11 Chapter 2: Related work 2.1 Discovering microbes present in a metagenomic sample Finding and quantifying the composition of a microbial community is a fundamental part of metagenomics. This process is both biologically and computationally challenging. In this chapter, we outline key biological concepts to understand how microorganisms can be computationally classified and quantified. Then, we describe current computational approaches used to classify and determine the composition of a metagenomic sample. 2.1.1 Taxonomic classification of metagenomes Taxonomy classifies living organisms into eight ranks: domain, kingdom, phylum, class, order, family, genus, and species. Each taxonomic rank is called taxon (plural taxa). Bacteria and archaea?the organism of interest in a metagenomic sample? belong to the prokaryotic domain in the tree of life and can be further classified into a taxonomy rank below species called strain?which are genetic variants (or subtypes) within a species. Taxonomic profiling is the computational process of inferring which taxonomic ranks are present in a microbial community (taxonomy classification) and estimating their relative abundances [27]. In metagenomics, marker genes are genes that are conserved across species, and thus are suitable to discriminate between taxonomic 12 ranks. The most widely used approaches for metagenomics taxonomy profiling use two type of marker genes universally present in prokaryotes: single genetic markers called 16S rRNA genes, and protein coding single-copy orthologous genes. In bacteria and archaea, the genes coding for the 16S ribosomal RNA (16S rRNA), part of the 30S small ribosomal unit, are referred to as 16S rRNA genes. These genes sequences consist of nine conserved regions separated by nine hypervariable regions (V1-V9). Highly conserved regions can be used as PCR primer bonding sites to amplify and sequence one or more hypervariable regions of the 16S rRNA gene, which, in turn, are used to identify the phylogeny of microorganism [28]. Such characteristics have made 16S rRNA sequencing one of the most widely used approaches to characterize the taxonomic diversity of a metagenomic sample. After sequencing, 16S rRNA analysis pipelines [29?31] start by clustering reads based on sequence similarity into Operational Taxonomic Units (OTUs). Then, a representative sequence from each OTU is compared against curated 16S rRNA reference databases [32?35] to assign taxonomic labels. The taxonomic resolution of 16S pipelines is usually limited to genus level. The 16S approach has several known shortcomings: (i) different organisms contain different copy numbers of 16S rRNA genes, introducing abundance estimation biases [36]; (ii) the amplification process introduces biases [37,38]; (iii) targeting different sub-regions of the 16S rRNA gene can influence the taxonomic assignment [39]; and (iv) problems differentiating species in an accurate and consistent manner [40]. 13 One of the possible explanations of the 16S rRNA gene problems to delineate taxa is its extremely slow rate of evolution. Thus, organisms from closely related but different taxa (e.g. different species from the same genera) might not have evolved fast enough to diverge in their 16S rRNA gene sequences [41,42]. Single-copy protein- coding orthologous genes, usually called single-copy marker genes, evolve faster than 16S rRNA and have been shown to have more power at resolving the relationships of closely related species [43]. Moreover, single-copy marker genes overcome many of the shortcomings of 16S rRNA genes. First, single-copy marker genes can be retrieved by sequencing the whole metagenomic dataset instead of targeting one gene, avoiding amplification biases. Second, they are not biased by copy number variation, allowing a more accurate abundance estimation of metagenomes. Lastly, single-copy marker genes can provide microbial species boundaries at higher resolution than 16S rRNA genes [44,45]. Several studies (Ciccarelli et al. [46] , Sorek et al. [44]) identified 40 universal single copy marker gene families that are present in all bacteria and archaea and can be used to reconstruct a phylogenetic tree [43,44]. These marker genes families are available in the Clusters of Orthologous Groups of proteins (COGs) public database [47,48]. Single copy marker genes can be extracted from microbial genomes using Hidden Markov Models (HMMs) trained on protein alignments [45]. Single-copy protein-coding orthologous genes are currently the most used marker genes in metagenomic studies and are part of some of the methods described below. 14 2.1.2 Current methods for taxonomic classification of metagenomes The taxonomic classification of metagenomes is computationally challenging for two reasons. First, high-throughput sequences technologies generate millions of reads that need to be analyzed. Second, there are hundreds of thousands microbial genomes available in public databases and the number is constantly increasing. Thus, the number of comparisons that need to be performed to analyze a metagenomic sequencing dataset is considerably large. In this section, we highlight published methods for metagenomic taxonomy classification. Most methods gather genomic, genetic, and taxonomy information from the public NCBI Reference Sequence Database (RefSeq) [49?52]. Several methods performed well in a recent review [53]. 2.1.2.1 Whole genome alignment-based methods The most intuitive approach for predicting the composition of a metagenomic sample is comparing each read to a database of reference genomes. This task can be accomplished by traditional methods based on local sequence alignment [54,55], which are highly accurate. Yet, while effective, aligning each read individually to a database of whole genomes can become prohibitively slow. Here we described the most popular alignment-based methods. BLAST (basic local alignment search tool) [56,57] is the most widely used software suite for sequence alignment (at nucleotide and protein level) and has been shown to align reads with greater accuracy than other sequence alignment methods [58]. BLAST was designed to find local similarities between one or more ?query? sequences and one 15 or more ?subject? sequences within a database. The intuition behind BLAST is that if the query and subject sequences are highly similar, they will contain exactly matching k-mers or ?seeds?. Before running a BLAST search, a database is created from subject sequences by decomposing them into seeds (k-mers of length 7 to 11) and then storing them in a hash table. During the search, BLAST performs local sequence alignment using a seed-and-extend algorithm. In the seed step, the query is decomposed into seeds and then looked up into the hash table to locate seed matches between the query and subject. In the extend step, matching seeds are joined and extend using the Smith- Waterman alignment algorithm [59]. In a metagenomic sequencing experiment, a massive set of reads correspond to the query and the hundreds of thousands microbial reference genomes correspond to the subject. Although BLAST was not designed for metagenomics and it is computationally intensive, it has been incorporated in several metagenomic classification tools described below as a pre-filter for read classification due to his accuracy. MegaBLAST [60], which is part of the BLAST software suite, was designed to compare highly similar sequences. MegaBLAST uses a greedy algorithm to perform gapped alignments between nucleotide sequences, and longer seeds (length 28) to reduce the number of alignments and accelerate the search. This is the only tool from the BLAST+ software suit that can compare metagenomic reads to a reference genome database in a feasible amount of time. MegaBLAST can also serve as a pre-filter for read classification. 16 DIAMOND (double index alignment of next-generation sequencing data) [61] is a protein-based method similar to BLASTx?BLAST module that translates the query into its six reading frames and compares it to a protein database? to align read queries to a protein database. Similar to MegaBLAST, it uses a longer seed (length from 15 to 24) to speed up the search of its BLASTx-like approach, and its main novelty is the use of double-indexing to determine the list of all seeds and their locations in both the query and subject. Double indexing improves cache locality, thus reducing memory usage. Although not designed for metagenomics classification, a DIAMOND search is usually followed by MEGAN [55], which post-processes sequence alignments and assigns each read to taxa using the lowest common ancestor (LCA) algorithm. Kaiju [62] is a protein-based classifier that finds maximum inexact matches on the protein-level using the Burrows?Wheeler transform (BWT) [63] and FM-index [64]. The use of the FM-index to store the reference genomes reduces memory requirements compared to both BLAST and DIAMOND. Kaiju first translates each read into six- reading frames and then searches for MEMs (Maximal exact matches) in the FM-index. Then taxonomic assignment is done by assigning reads to the longest MEM, or to the LCA taxon if a read matches multiple taxa. Kaiju also has a greedy search mode which allows some mismatches by searching backwards in the BWT. 2.1.2.2 Marker gene alignment-based methods As described in the previous section, marker genes are ideal for taxonomic profiling of metagenomic samples. While traditional sequence alignment methods can become prohibitively slow, by using marker genes instead is much faster due to the reduced 17 size of the subject database [62,65?67]. Thus, traditional sequence alignment methods, such as BLAST, can quickly align reads to marker genes while maintaining high accuracy. The following methods are based on marker gene sequence alignment. MetaPhyler [65] is a taxonomic classifier that relies on 30 marker genes as a taxonomic reference. First, MetaPhyler aligns a metagenomic sample against a marker gene database using BLAST. It then classifies each read individually based on its best blast-hit alignments to the database, and it uses different thresholds (automatically learned from the reference database) for each combination of taxa, reference gene, and sequence length. MetaPhyler achieves genus and species level taxonomic resolution. MetaPhlAn (Metagenomic Phylogenetic Analysis) is a taxonomic classifier that relies on a clade-specific marker gene database. Reads are aligned to the marker gene database using Bowtie2 [68]. The total number of reads in each clade is normalized by marker gene length to then provide a relative abundance of each taxon. MetaPhlAn achieves genus and species level taxonomic resolution. 2.1.2.3 K-mer-based methods K-mer based methods provide a fast identification of a metagenomic sample by relying on exact-match database queries instead of alignment. To achieve this, these methods pre-compute all k-mers contained in a database of complete microbial genomes [69? 71]. 18 Although k-mer approaches were created to achieve maximal speed, their main drawback is its substantial memory requirement. More recently, methods using a subset of k-mers to reduce the dimension of the problem have been developed [72?74], drastically reducing memory requirements. Regardless of the efficiency of k-mer-based approaches, a main shortcoming is their lower accuracy compared to alignment-based sequencing methods. Kraken [71,75], the first k-mer based taxonomic classifier, uses a hash-based index to store a genome?s k-mers along with its taxonomic label. If a k-mer is shared across multiple taxa, the k-mer is stored along with the LCA of those taxa. During the classification, Kraken decomposes each read into its constituent k-mers and then maps each k-mer to the database with an inexpensive table lookup. Because Kraken assigns reads to the LCA of taxa, many reads don?t get specific labels assigned. To tackle this problem, BRACKEN (Bayesian Reestimation of Abundance after Classification with Kraken) [76] was designed to re-estimate taxonomic abundance from Kraken results, and it estimates abundance by redistributing read assignments in the taxonomic tree using Bayesian probabilities. Bracken achieves genus and species level taxonomic resolution. CLARK [70], similar to Kraken, builds a database of genome?s k-mers. However, CLARK reduces the size of the k-mer by storing only species or genus-level specific k-mers and removing nonunique k-mers and rare k-mers, which also reduces noise 19 during the classification. CLARK-S [77] improves CLARK?s sensitivity by replacing fixed-length k-mers with target-specific or discriminative spaced k-mers. Mash Screen [78] is an extension of Mash [73], a tool that uses MinHash dimensionality reduction techniques to quickly calculate the approximated distance between two genomes via Jaccard index. Mash Screen introduces the concept of ?screen?, in which a genome database is tested for their containment within a set of metagenomic reads. For each reference genome, Mash Screen computes a containment score that measures the similarity of the reference genome to a metagenomic dataset. Similar to BLAST and DIAMOND, Mash Screen was not designed for taxonomy classification but can serve as a pre-filtering step. 2.1.3 Metrics to evaluate taxonomy classification of metagenomes The metrics selected to benchmark metagenomic classifiers greatly influence their relative rankings and performance. Different metrics have been applied for evaluating the binary classification task of predicting taxa presence or absence. The most commonly metrics for presence or absence in metagenomic classification used across benchmarking studies [79?81] are precision, recall, F1 score, and the Jaccard index. Precision, also known as positive predictive value, refers to the proportion of true positive classifications, out of the total number of classifications attempted: ?? ????????? = ?? + ?? 20 In the context of metagenome classification, precision can be calculated by taxon as the proportion of correct classification in the sample divided by the number of total classifications identified by the method. Recall, also known as sensitivity or true positive rate, is defined as the proportion of true positive classifications out of the total true positives plus false negatives being tested: ?? ?????? = ?? + ?? In the context of metagenome classification, recall can be calculated by taxon as the proportion of correct classifications divided by the number of distinct elements in the sample. There is a fundamental trade-off between precision and recall. Depending on the downstream analysis being performed after taxonomy classification, achieving either a higher precision or a higher recall can be preferred. Precision can represent a measure of exactness or quality, while recall a measure of completeness or quantity. In reference-based methods, such as reference-guided metagenome assembly and pangenome-based analysis, it is desired to retrieve all-known taxa present in the classification (higher recall) without sacrificing precision. The F1 Score measures the balance between precision and recall. The F1 score is the harmonic mean of recall and precision, weighting them equally in a single metric: 2 ? ????????? ? ?????? ?? ????? = ????????? + ?????? 21 Finally, the Jaccard index refers to the number of true positives (intersect between predicted and real communities) divided by the true positives plus the false positives and negatives: ?? ??????? ????? = ?? + ?? + ?? To provide a realistic estimate of precision, recall, F1 score, and Jaccard index for benchmarking, all taxonomy classifiers should be tested using the same abundance threshold. 2.2 Metagenomic assembly The goal of metagenomic assembly is to reconstruct the genomes of all organisms in a given microbial community. Metagenomic datasets are commonly sequenced using short second-generation reads. Thus, similarly to the de novo genome assembly problem, the short-metagenome assembly problem can be formulated as a DBG problem. In a metagenomic context, the task of a DBG metagenome assembler is not just to reconstruct one path through a graph, but a multitude of paths that come together and split apart at different places. As described in Chapter 1, the main factors affecting the performance of single genome DBG assemblers are sequencing errors, repeats?the presence of repetitive DNA segments within an organism's genome (intragenomic repeats)?, and the depth of sequencing coverage. Furthermore, the problem of reconstructing a metagenome is complicated by: (i) the presence of strain variants; (ii) the combination of both intragenomic and intergenomic repeats (DNA segments shared between distinct organisms, Figure 1B); and (iii) the uneven sequencing coverage within a metagenome. 22 Strain variants create a challenge similar to sequencing errors and in highly polymorphic samples the assembly result will likely be fragmented [82]. Furthermore, distinguishing true biological differences from sequencing errors becomes nearly impossible in a metagenomic setting. Intragenomic repeats are generally small (usually smaller than ~10,000 bp in bacteria [83,84]) but intergenomic repeats can be nearly the entire chromosomes for closely related strains. Multiple bacteria from the same species in a community (strain variants) may differ in just one gene, in which case almost the entire genomes are inter- genomic repeats. The decision of whether such differences can be ignored when reconstructing the corresponding genome, or whether it is proper to reconstruct individual-specific genomes, is not only computationally difficult but also ill-defined from a biological point of view. Due to uneven sequencing coverage within a metagenome, coverage heuristics employed for single genome assembly can no longer be used to detect repeats [85]. Organisms sequenced at high depth of coverage (often exceeding 1000-fold) lead to high computational costs. In a DBG, the higher depth of coverage amplifies the effect of errors on the assembly graph and may even confuse error correction algorithms (simply by chance multiple random errors can confirm each other). Organisms sequenced at low depth of coverage (less than 10-fold) can be assembled using shorter k-mers, but this strategy can lead to a higher impact of repeats on the assembly. Due to these complications, despite initial attempts, algorithms developed for a single genome assembly cannot be applied directly to metagenomics data. Instead, 23 several approaches, mostly under the DBG paradigm, have been developed that explicitly consider the specific characteristics of metagenomic datasets. 2.2.1 Current methods for metagenomic assembly Various de novo assemblers [3,4,8,10] have been developed and applied to the assembly of metagenomes from massive amounts of short reads. In this section, we highlight three published algorithms developed specifically for de novo metagenomic assembly that perform well in recent reviews [86,87]. IDBA-UD [88] is part of the IDBA (Iterative De Bruijn Graph De Novo Assembler) [89] suite of assemblers. A key algorithmic component of IDBA assemblers is the use of multiple k-mer sizes to address the trade-offs of different choices of k. To improve the DBG, IDBA-UD iterates through a range of k-mer values in a stepwise fashion. Sequencing errors are corrected at each iteration, reducing its impact in the assembly. The assembly graph becomes more resolved with increasing k-mer size in each iteration step, resulting in a more contiguous assembly. MEGAHIT [90] relies on the same multiple k-mer strategy as the [90] IDBA assemblers [89]. MEGAHIT is currently the most efficient de novo assembler largely due to its use of efficient data-structures for storing the de Bruijn graph[87]. Memory requirements are reduced by using a new data structure, called succinctly de Bruijn graph [91]. Memory is further reduced by eliminating k-mers below a defined frequency threshold from the graph. This step also minimizes the negative impact of 24 sequencing errors on the assembly. To retain k-mers from low abundance organisms, distinguishing them from errors, MEGAHIT reconsiders discarded k-mers in low- coverage regions of the assembly graph. MetaSPAdes [92] is a metagenomic-specific version of the SPAdes assembler [93], and it was originally designed to address two major issues of single cell sequencing data: the uneven read coverage and chimeric sequences?issues that are also relevant to metagenomic assembly. The main innovation in these family of assemblers is the use of paired-end information during the assembly process rather than afterwards [94]. This information is incorporated in the DBG by using pairs of k-mers separated by an estimated distance. Similar to IDBA-UD and MEGAHIT, SPAdes follows an iterative multiple k-mer approach, and, moreover, it uses the complete read information together with the preassembled contigs at every step. In addition, metaSPAdes was extended to handle strain variation; micro-variations between highly similar ?strain-contigs? are combined to form high quality consensus sequences, aiming at the best possible representation of each species instead of every strain variant. MetaSPAdes is slower than IDBA-UD and MEGAHIT and it is not scalable to large datasets[87]. Despite advances in metagenomic assembly algorithms [88,95?98], the assembly problem remains computationally challenging. As mentioned in Chapter 1, reference-guided assembly is more effective than the de novo assembly when sufficiently closely related sequences are available, yet, it has not been applied to 25 metagenomics. In this regard, a metagenomic reference-guided approach will be discussed in Chapter 4. 2.2.2 Metagenome assembly validation The validation of genome assemblies has been an active area of interest since the development of the first genome assemblers in the late 1970s [99]. The most commonly used metrics to evaluate the quality of metagenomics assembly can be classified into (i) contiguity-based, (ii) completeness-based, (iii) reference-based, and (iv) consistency-based. Contiguity, completeness and consistency-based metrics rely on features of the assembled data, seeking to identify internal inconsistencies indicative of potential assembly errors. Reference-based metrics need to know the ?ground truth? genomes used for generating the metagenomic reads. 2.2.2.1 Contiguity-based metrics Contiguity-based metrics are the most intuitive. These metrics evaluate how fragmented the final assembly is. The most common metric used to compare assemblies, the number of contigs (total number of assembled contigs reported by each assembler), attempts to assess how far the assembly is from the ideal goal of one contig per chromosome. Since most assemblies consist of many small contigs, usually due to sequencing errors or other artifacts, this metric can be misleading. More robust measures are the Contig Number at 1Mbp (the number of contigs required to exceed 1Mbp) and the Assembly size at 1MBp (the size of the largest contig C such that the sum of all contigs larger than C exceeds 1Mbp). The choice of 1Mbp is driven by bacterial and archaeal genome sizes: while the average length of such organisms is 26 usually longer than 3Mbp, many of them encompass multiple replicons of shorter length (~0.3-1.3Mbp) [100]. Since the real average length of replicons in a metagenome is arbitrary, using 2-5Mbps is also reasonable. The contiguity metrics already described do not take any correctness information into account and can be misled by accepting errors?a single long contig can be constructed by concatenating all the reads in an incorrect order. 2.2.2.2 Completeness-based metrics The most intuitive completeness-based metric is the total assembly size, which is the total number of bases in the assembly. The gene information contained in a metagenome assembly can also be used not only to evaluate completeness but also to measure how useful an assembly may be to downstream analyses. As genes are used to address biological questions, a greater number or density of genes results in more information available for testing biological hypotheses. Single-copy marker genes, here referred simply as marker genes, can be assumed to exist in all newly assembled bacterial and archaeal sequence. Thus, an assembly where some of these genes are missing can be assumed to be incomplete. Additionally, complete genes and marker genes metrics can be used as a measure of correctness, as assembly errors would disrupt ORFs (open reading frames). The completeness-based metrics used to represent a metagenome?s gene content are complete genes and complete marker genes, which are the median number of fully reconstructed complete genes and the marker genes, respectively. 27 2.2.2.3 Reference-based metrics To calculate reference-based validation metrics, the assembly is compared to a database containing previously assembled genes or genomes [101,102]. The most common reference-based metrics are: (i) Genome Recovery (%), which is the median percentage of each truth genome that is recovered; (ii) Total Aligned Length, which is the sum of the length of contigs aligned to the truth genomes; (iii) Total Unaligned Length, which is the sum of the length of unaligned contigs; and (iv) NGAx, which is the length of the contig that covers at least half the reference genome. Reference-based metrics are particularly effective in benchmarking experiments that try to reconstruct communities with known composition. However, these metrics can have limited effectiveness in real datasets. For example, metagenomic segments originating from a genome for which no reference sequence is available cannot be verified through a reference-based approach. It is also difficult to determine whether differences between an assembled contig and the reference genome are true differences o errors. 2.2.2.4 Consistency-based metrics It is often important to determine where exactly errors were introduced in the assembly, either to correct these mistakes, or to ensure that the errors do not influence the results of downstream analyses. The major types of assembly errors are: repeat collapse, insertions, deletions, and inversions (Figure 2.1). Consistency-based metrics evaluate assembly errors by aligning sequencing reads to the assembly and finding regions where the mappings are inconsistent. 28 Common consistency-based metrics include depth of coverage, consensus, split read mapping and insert size consistency. The depth of coverage metric is a statistical comparison of global vs local coverage, as signature of compressed or expanded repeats. Increases in coverage show collapsed repeats, while drops in coverage or coverage gaps can show breakpoints due to insertions, deletions, and inversions. Consensus refers to the concordance of the consensus to the read pileup. Split-read mapping measures single reads with partial alignments to separate locations of a genome. Lastly, insert size consistency evaluates the concordance of the insert size (distance between read pairs); increase in insert size shows expanded repeats and decrease size shows collapsed repeats. Figure 2.1. Metagenome assembly error signatures. There are four primary types of assembly errors, repeat collapse, insertions, deletions, and inversions. These assembly errors can be identified by mapping reads to the assembly and evaluating the coverage (blue curve), distance between read pairs (green reads), and split read mapping data (green reads). Increase in coverage indicates repeat collapse whereas drops in 29 coverage indicate breakpoints for insertions, deletions, and inversions. Shorter than expected distance between read pairs indicates potential repeat collapse or deletion, whereas increase in distance between read pairs indicates a potential insertion. Inconsistency in read pair direction can indicate an inversion. Finally, split-read mapping data, obtained by independently aligning the first and last third of a read can be used in a similar manner to read pair information to identify assembly errors [103]. 2.2.2.5 Current Methods for Metagenome assembly validation The software packages CheckM [104] and BUSCO [105] are only based on completeness-based metrics. CheckM relies on single-copy marker genes that are specific to a genome-based lineage within a reference tree, while also supplying information to correct the assemblies. BUSCO (Benchmarking Universal Single-Copy Orthologs) evaluates assemblies by measuring single copy ortholog marker genes and it estimates contamination from the recovered genes. The most used tool for metagenome assembly validation is the tool called MetaQUAST [106], which, unlike the previous ones that only use completeness metrics, incorporates contiguity, consistency, and reference-based metrics. MetaQUAST is a modification of QUAST [107], an isolate genome assembly validation tool that computes alignments of assembled contigs to a single reference genome. Similar to QUAST, MetaQUAST identifies mis-assemblies relative to a set of reference genomes. Additionally, metaQUAST applies a structural variant finding algorithm to distinguish between structural variants and true assembly errors. 2.3 Conclusion In this chapter, we first introduced the concept of taxonomy classification?finding the organisms present in a metagenomic sample?and then described several methods? whole genome alignment, marker gene alignment, and k-mer-based?to accomplish 30 this task. Secondly, we reviewed contemporary advances and challenges in de novo metagenomic assembly and outlined the main challenges faced by de novo metagenomic assemblers. Lastly, we described current methods and strategies for metagenome assembly validation and error characterization based on contiguity, completeness, consistency and references. In the following two chapters, we present our method for both taxonomy classification and metagenome assembly. 31 Chapter 3: Selecting references genome for metagenomic reference-guided assembly 3.1 Introduction As mentioned in Chapter 1, reference-guided assembly is an effective approach when sufficiently closely related sequences are available, yet, it has not been applied to metagenomics. Differences between the genomes being assembled and the references can greatly affect the final assembly by either leading to errors in reconstruction or to a fragmented assembly. Therefore, selecting closely related reference genomes is a crucial step before reference-guiding the assembly of a microbial community. The most popular metagenomic classification methods, Kraken and MetaPhlAn, are designed to achieve, at most, species level taxonomic resolution. Kraken was the first k-mer-based approach for metagenomic classification, and it is best-suited to rapidly match metagenomic sequences to large databases of complete genomes. The drawback of using k-mer approaches is that they are not as accurate as older sequence alignment-based methods. MetaPhlAn, on the other hand, maps metagenomic reads to a database of clade-specific marker genes to perform taxonomy classification. Although marker genes are a well-known resource to select biologically relevant genomes and can led to good precision and recall at species level, such genes only account for a small part of the complete microbial genome, excluding additional genomic information that can be relevant for strain resolution. 32 To tackle the above-mentioned problems, we designed MetaCompassRS (MetaCompass Reference Selection), a metagenome classification approach that achieves strain-level resolution by combining a marker gene sequence alignment approach with a whole genome k-mer matching approach. MetaCompassRS can be used both as a standalone software and as part of the MetaCompass pipeline described in Chapter 5. 3.2 Methods MetaCompassRS (Figure 3.1) follows a two-stage strategy: a marker gene alignment stage and a complete genome k-mer matching stage. 3.2.1 Marker gene alignment Each genome is assumed to have a defined marker gene set, thus, if a genome is present in a metagenomic sample, it should have a sufficient portion of its marker gene set covered by reads. We define this concept as ?marker gene set containment? to estimate how well a set of reference genomes is contained in metagenomic sample. Although aligning millions of reads to a marker gene database is relatively fast, we further speed up this process by pre-filtering the reads. We use kmer-mask [108] to extract k-mers from both metagenomic reads and a marker genes database, and then filter out reads without exact k-mer matches. Next, we use Blastn to align the complete sequence of the pre-filtered reads against the complete marker gene sequences. Lastly, we estimate a marker gene set containment score for each reference genome and only keep references above a certain percentage threshold. 33 3.2.2 Complete genome k-mer matching In this stage, we use the complete sequence of the pre-selected reference genomes from the first stage and the complete set of reads to re-estimate read containment. We use Mash Screen to identify which reference genomes are sufficiently contained within the reads. We then select the strains with higher containment score, which are more likely to be present in the metagenomic sample. By combining these two stages, we take advantage of the high accuracy of marker gene alignment methods and the efficiency of k-mer based approaches to compare complete genomes. Figure 3.1. Overview of MetaCompassRS. A) Reads are prefiltered using k-mer-mask. B) Pre-filtered reads are aligned to marker genes using blastn. Then Marker gene set containment is estimated from Blast results and complete genomes C) Preselected genomes are screened for containment using Mash screen. 34 3.2.3 Database construction To create our reference database, we retrieve high-quality genome assemblies from the NCBI Refseq database, including complete genome assemblies and chromosome level assemblies (which include chromosomes, scaffolds and contigs). We also retrieve taxa, genes, and protein sequences associated to each genome. After retrieving all the necessary information from RefSeq, we used several tools to gather marker gene information from each genome. First, use the tool FetchMG [45,109] to predict the 40 universal single-copy marker genes present in each genome. Due to the high redundancy of organisms currently available at RefSeq, many genomes share almost identical marker gene sequences. To further speed up the marker gene alignment stage (described below), we cluster almost identical marker genes with CDHIT [110]. Lastly, we use kmer-mask, part of the [111] k-mer counter package Meryl, to create a marker gene k-mer database. Finally, we process each genome retrieved from RefSeq to gather the k-mer information used in the second stage of MetaCompassRS. We pre-compute k-mer sketches for each genome using Mash sketch and estimate pairwise average nucleotide identity (ANI) between genomes using Mash dist. The former information is used by Mash screen to estimate containment. The latter is used to filter out almost identical genomes from the final metagenomic classification. 35 3.2.4 Implementation details 3.2.4.1 Marker gene alignment stage Pre-filtering reads before alignment: We pre-filter the reads to speed up the Blastn alignment. Given a k-mer database and a set of reads, kmer-mask computes the fraction of the reads which are covered by k-mers in the database. We use a seed size of 28. Aligning reads to marker genes: We run blastn with a word size or seed of 28, to retrieve only highly similar alignments. Since closely related genomes can share the same marker genes, each read can be aligned to multiple marker genes. Estimating marker gene set containment: After aligning reads to marker genes, we process the alignment results to estimate the ?marker gene set containment? per each genome. Intuitively, this metric estimates how well a genome is contained in the reads, using a set of marker genes as representation of such genome. Given a genome ?G? with ?n? marker genes, we define the marker gene set containment score as the total number of marker genes bases covered by the reads divided by the sum of marker gene lengths: ? ????? ?????? ?? ????? ?????? ???? ??? ??????????? ??? ?????? ? = ? ?????? ???? ?????? ? The number of bases covering a marker gene is calculated from the blast alignment results by extracting all the alignment intervals for that marker gene, then merging overlapping intervals, and finally adding the merged interval lengths. We exclude from further consideration all the genomes with an estimated marker gene set containment below a certain threshold (one third of the complete marker gene set, by default). 36 3.2.4.2 Whole genome comparison stage. To re-estimate read containment, we use Mash Screen with a minimum identity threshold of 0.95 percent. This step removes many reference genomes that had a sufficient marker gene containment score but were poorly contained in the reads when considering the complete genome. In case of finding multiple strains that satisfied all the criteria mentioned above, we only select the top ten strains per species. 3.3 Results 3.3.1 Evaluation of performance on Salmonella enterica simulated genome To test the efficacy of our approach in finding the correct organism among multiple strains, we simulated reads from a strain of one of the most abundant species in our database (Table 3.1), Salmonella enterica?well-known food pathogen that causes gastroenteritis in humans. We chose the strain Salmonella enterica serovar Typhimurium LT2 (available at RefSeq with accession NC_003197.2), which has close homologue genes with eight genomes from the Enterobacteriaceae family: Salmonella enterica serovar Typhi CT18, Salmonella enterica serovar Paratyphi A, Salmonella enterica serovar Paratyphi B, Salmonella enterica arizonae, Salmonella bongori, E. coli K12, E. coli O157:H7, and Klebsiella pneumoniae [112]. Table 3.1. Most abundant species in MetaCompassRS database. Species ID Number of strains Description 562 3217 Escherichia coli 573 2022 Klebsiella pneumoniae 28901 1779 Salmonella enterica 47466 1634 Borrelia miyamotoi 1280 987 Staphylococcus aureus 37 We compared the performance of our approach with the most widely used marker gene and k-mer based taxonomic classifiers, MetaPhlAn, Kraken2, respectively. We also used Kraken2 with its abundance estimation companion method Bracken. We evaluated the genus and strain level classification results in terms of precision, recall and F1 score. At the genus and species level, MetaCompassRS and MetaPhlAn made a perfect prediction (Table 3.2). In Contrast, Kraken2 and Kraken2+Bracken reported several low abundance false positives. At the strain level, MetaCompass outperformed Kraken2 and Metaphlan2 by reporting fewer false positives. Among all strains reported by the two versions of Kraken2, Salmonella enterica serovar Typhimurium LT2 was not assigned any read. Conversely, MetaCompassRS reported the correct strain with a marker gene containment score of 99. 8% and a genome containment of 99.99%. Compared to the true Salmonella enterica strain, the nine false positive strains reported by MetaCompassRS (Table 3.3) shared more than 99% of identity and covered 97-99% of the true genome. This experiment highlights the effectiveness of our approach in selecting highly closed reference genomes. Table 3.2. Taxonomy classifier predictions at different taxonomy levels for Salmonella enterica serovar Typhimurium LT2 simulated dataset. Note that MetaPhlAn does not provide strain level resolution. Kraken2 Kraken2+Bracken MetaPhlAn2 MetaCompass Genus 14 35 1 1 Species 9 59 1 1 Strains 161 240 NA 10 38 Table 3.3. Comparison between Salmonella enterica serovar Typhimurium LT2 (NC_003197.2) and MetaCompassRS false positive (FP) strains. Query cover describe how much the query genome, NC_003197.2, is covered by each FP and percentage of identity describes how similar the query is to each FP. Accession ID Description Query Percentage cover of Identity NC_021151.1 Salmonella enterica subsp. enterica serovar 99% 99.99% Typhimurium str. U288 NZ_CP007523.1 Salmonella enterica subsp. enterica serovar 98% 99.99% Typhimurium str. CDC 2011K-0870 NZ_CP014051.2 Salmonella enterica strain LT2 100% 100% NZ_CP014971.2 Salmonella enterica subsp. enterica serovar 98% 99.97% Typhimurium str. USDA-ARS-USMARC-1898 NZ_CP021909.1 Salmonella enterica subsp. enterica strain 99% 100% ST1120 NZ_CP028199.1 Salmonella enterica subsp. enterica serovar 100% 100% Typhimurium strain CFSAN018746 NZ_CP025736.1 Salmonella enterica strain FORC_079 99% 99.98% NZ_CP041005.1 Salmonella enterica strain FDAARGOS_768 100% 100% NZ_CP032494.1 Salmonella enterica subsp. enterica serovar 97% 99.99% Typhimurium strain SO21 3.3.2 Evaluation of performance on synthetic metagenomic dataset We evaluated our method on synthetic microbial community published by Shakya et al. [113]. The synthetic sample was downloaded from the NCBI Short Read Archive (SRA) database, (SRR606249) and contains 54 bacteria and 10 archaea. Among these organisms, 55 had complete genome sequences in the NCBI RefSeq database (the database used by default by MetaCompassRS), and 9 are available only as a high- quality draft assembly. The sample contains 61 species. At the species level, MetaCompassRS outperformed Kraken2 and MetaPhlan2 in recall, correctly predicting 59 out of the 61 species. Although MetaPhlan2 achieved the highest precision, thus, predicting fewer false positives, it only predicted 54 out of the 61 species. Both MetaCompassRS and MetaPhlan2 had almost identical F1 score (Figure 3.2A). 39 At the strain level, MetaCompassRS outperformed Kraken2 in all metrics, correctly predicting 59 out of the 64 strains. The 5 strains that MetaCompassRS did not predict are draft assemblies without complete reference genomes available to date. (Figure 3.2B). A B Figure 3.2. Species and strain level classification results on Shakya et al. dataset. (A) Heatmap showing Precision, Recall, and F1 score at species level. (B) Heatmap showing Precision, Recall, and F1 score at strain level. To evaluate the ability of MetaCompassRS to classify low-coverage genomes, we downsampled the synthetic dataset to 10% of its original size. The results (Figure 40 3.3) highlight that MetaCompassRS is also highly effective at low coverage and outperformed the other tools in several metrics. A B Figure 3.3. Species and strain level classification results on down-sampled Shakya et al. dataset. (A) Heatmap showing Precision, Recall, and F1 score at species level. (B) Heatmap showing Precision, Recall, and F1 score at strain level. We further evaluated the classification averaging across all datasets on the species and strain level (Figure 3.4). Only MetaCompass achieved more than 90% recall for both taxa. Similarly, only MetaPhlAn achieved more than 50% of precision. 41 In terms running time, Kraken2 was the fastest across all datasets. MetaCompassRS and MetaPhlAn, which are both based on read alignment, took a very similar amount of time (Table 3.4). Figure 3.4. Scatter plot of Precision (x-axis) versus Recall (y-axis) across all datasets and taxa. 3.3.3 Evaluation of computational performance on simulated and synthetic metagenomic datasets We evaluated the running time performance of MetaCompassRS on a Linux 12-core server node with 16 GB of memory using the Salmonella enterica simulated dataset, the Shakya et al. synthetic dataset, and the downsampled Shakya et al. synthetic dataset. The wall clock running time on this synthetic dataset for MetaCompassRS was slightly higher than Kraken2 and considerably lower than MetaPhlAn2 (Table 3.4). MetaPhlAn2 had the lowest memory usage among all taxonomic classifiers, followed by Kraken2. Note that since Kraken2 loads the database into memory, its memory 42 usage is determined by the size of the Kraken2 database. We used the reduced size Kraken2 database (MiniKraken2 database, 8GB) because the default Kraken2 database (29 GB) required more memory than the limit used in our experiments. MetaCompassRS was able to process a 100 million read dataset using the complete MetaCompassRS database in less than 16 minutes without prohibitive memory requirements (13.06GB), highlighting the scalability of this method to large datasets. Table 3.4. Running time for taxonomy classifiers on simulated and synthetic datasets. We evaluated the running time performance of MetaCompassRS and three taxonomic classifiers for the simulated Salmonella enterica sample, the full Shakya et al. sample (100 million paired-end reads), and a 10% of the original Shakya et al. sample (5 million paired-end reads). We used the default MetaCompassRS and MetaPhlSn2 databases and the reduced size Kraken2 database (MiniKraken2 database). All dataset were run using 16 GB of memory and 12 CPUs. Downsampled Salmonella enterica Shakya et al. Shakya et al. Classifier Time Memory Time Memory Time Memory (mm:ss) (Gb) (mm:ss) (Gb) (mm:ss) (Gb) MetaCompassRS 10:21 7.42 15:32 13.06 5:38 12.40 MetaPhlAn 62:14 3.27 177:44 3.05 151:00 2.66 Kraken2 1:03 8.67 1:33 8.22 1:12 8.21 Kraken2+Bracken 1:04 8.67 1:35 8.22 1:13 8.21 3.3.4 Evaluation of performance on CAMI medium dataset To provide a better idea of how MetaCompassRS would perform in a worst-case scenario (the closest genomes contained in the metagenomic sample are not present in the database), we used a medium complexity dataset generated by the benchmarking study CAMI (Critical Assessment of Metagenome Interpretation) [81]. From the two medium complexity datasets generated by CAMI, we used the medium complexity dataset consisting of 132 newly sequenced genomes (not present in public databases) with an insert size of 270bp. We ran MetaCompassRS on the selected medium 43 complexity dataset and compared our species-level performance with the publicly- released results from CAMI. We only included methods that achieved species level taxonomy resolution (FOCUS [114], TIPP [115], MetaPhlAn2 [66] MetaPhyler [116], mOTU [45], Quikr [117], Taxy-pro2 [118], and CommonKmers [119]). Notably, the least precise profiling methods (TIPP, MetaPhyler, and Quikr) had the highest recall, introducing a high false positive rate in their prediction (Figure 3.5 A). MetaPhlAn2 and MetaCompassRS achieved not only the highest precision (fewer false positives) but also the highest F1 score (balance between precision and recall) (Figure 3.5 B). The similar results obtained by MetaPhlAn2 and MetaCompassRS are expected as both tools use marker genes, which are known for being precise at higher taxonomic ranks up to species level. We did not evaluate strain level predictions as CAMI didn?t not include such results in their study. 44 Figure 3.5. Species and strain level classification results on a medium complexity CAMI dataset. (A) Heatmap showing Precision, Recall, and F1 score at species level. (B) Scatter plot of Precision (x-axis) versus Recall (y-axis) across all taxa. 45 3.4 Conclusion and discussion We presented MetaCompassRS, a taxonomic classification method that outperforms previous methods in both species and strain level recall, while maintaining a strong balance between precision and recall. Achieving a high recall at the strain level is ideal if the end goal of the classification is to capture all relevant genomes from a database. MetaCompassRS achieves such results by combining alignment-based and k-mer based approach with a highly comprehensive reference database. MetaCompassRS maintains a competitive running time and memory usage due to its marker gene clustering and k-mer pre-filtering steps. Clustering almost identical marker genes reduces the marker gene database size, and pre-filtering the reads dramatically reduces the query size. Furthermore, the use of Mash Screen adds genomic information beyond marker genes while keeping a low running time and memory usage. Any of the methods presented, including ours, was capable to achieve full strain resolution. In fact, full strain resolution might not be possible by only analyzing short reads due to their short genomic context. Assembling the reads after the metagenomic classification can provide a more complete picture of the microbial community and further improve strain resolution [120?123]. In the following chapters, we describe how to use a set of reference genomes to perform reference-guided metagenomic assembly (Chapter 4 and 5). 46 Chapter 4: Reference-guided metagenomic assembly 4.1 Introduction In the previous chapter, we described a method to infer the microbial genomes present in a metagenomic sample and select closely related genomes. In this chapter, we describe a metagenomic reference-guided assembly approach that uses a set of microbial genomes to reconstruct a metagenome. Several de novo assembly methods have been applied to metagenomic data sets, but very little progress has been made on reference-guided assembly for metagenomics. Reference-guided assembly approaches are commonly used to assist the assembly of short reads when a closely related reference genome is available [19,23]. This process overcomes, in part, the challenge posed by repeats as the entire read provides information about its location in the genome. Currently, thousands of bacterial genomes have been sequenced and finished [50,124]. These genomes are a great resource for performing comparative assembly of metagenomic sequences. However, to date, they have not been used for assembly, primarily due to the tremendous computational cost of aligning metagenomic reads to the entire reference collection of bacterial genomes. In this chapter, we describe our approach for reference-guided assembly of metagenomes. 47 4.2 Related work The reference-guided metagenomic assembly process has two steps: (i) reads are aligned to a set of closely related reference genomes (read mapping step); and (ii) contigs are built from the relative locations of the reads in the reference genomes (consensus calling step). In this section we describe the read mapping step, consensus calling step and also present methods suitable for polishing metagenome assemblies. 4.2.1 Read mapping A fundamental part of reference-guided assembly and many other bioinformatics analyses is the mapping of millions of short reads to reference genomes. A variety of algorithms and tools have been developed for read alignment[125,126]. Currently, the most widely used methods for read mapping can be divided into hash table based algorithms and Burrows-Wheeler Transform (BWT) [63] based algorithms. Hash table methods can index either the genomes or the reads. Some methods for indexing genomes include GSNAP [127], Novoalign [128], mrFAST [129] , mrsFAST [130], and FANGS [131]. Methods for indexing reads include MAQ [132] and RMAP [133]. The most popular read mapping tools rely on the Burrows-Wheeler transform to reduce memory requirements [68,134?136]. Some BWT based read mapping tools are Bowtie [134], Bowtie2 [68], BWA [135], and SOAP2 [136]. Among them, the most widely used tool is Bowtie2, and improved version of Bowtie. Bowtie and Bowtie2 index the reference genome using a FM-index a [64] to maintain a small memory footprint. Bowtie was designed to find ungapped alignments, reporting end-to-end read 48 alignments. Bowtie2 was extended to support local alignment? i.e. doesn?t require end-to-end read alignments. For a single reference sequence, the read mapping problem has mostly been solved by indexing the reference into a data structure that supports efficient pattern search queries [125]. The read mappers described above [68,130,132,134?136] provide different trade-offs between speed and quality of the mapping [126,137]. Read mappers for single genomes are not suited for classification of metagenomic sequences, because they usually use a semi-global alignment model and assume near-identity of read sequences and reference genomes. Some metagenomic- specific mappers have been developed by adding filtration and normalization techniques to previously described single-genome mapping approaches [138,139]. Despite these efforts, metagenomic read mapping remain an open area of research. The standard output format of read mappers is the Sequence Alignment/Map (SAM) format [140]. A SAM file has the information for each individual read mappings, including the read and reference genome identifiers, leftmost mapping position, mapping quality, and the CIGAR (Concise Idiosyncratic Gapped Alignment Report) string. A CIGAR string is a compress representation of an alignment that shows how the reads align to a reference genome. The CIGAR string has key information that can be used for consensus calling, such as matches, mismatches, gaps, deletion and insertion positions. 49 4.2.2 Consensus calling The process of getting a consensus from the bases aligned to a genome is called consensus calling. In metagenomics, we need to find consensus sequences?equivalent to a de novo contigs?for each individual reference genome. The most common approach for single genome consensus calling is using the mpileup and BCFtools utilities from the SAMtools package[141]. First, Mpileup summarize the base call information at each position in the reference genome into a ?pileup of reads?. The pileups of reads are generated by calculating the likelihoods of a base at each genomic position based on depth of coverage. Then, BCFtools call performs variant calling on each pileup of reads. The variant calling process involves identifying difference between the reads and the reference genome?such as single base changes, such as SNPs and indels, or larger scale structural variants. Finally, BCFtools consensus generates consensus calls from pileup of reads using the variant information. There are multiple variant calling methods described in literature that could be used instead of BCFtools call, however, a broader discussion of such methods is beyond the scope of this dissertation. 4.2.3 Assembly polishing Both de novo and reference-guided assemblies may have considerable base errors. Compared to de novo assembly, reference guided assembly has less space for misassembles. However, even small errors can degrade the performance of the reference-guided assembly process. 50 Assembly polishing tools can be used to correct base errors in draft assemblies. State of the art assembly polishing tools are GATK [142], Pilon [143], Racon [144], POLCA [145], and ntEdit [146]. Pilon and GATK are the most well-established polishers and can fix single bases changes, small and large indels, local misassemblies and can also fill gaps in the assembly. RACON and POLCA, part of the MaSuRca assembler [147], are more recent tools aimed to correct assemblies from long reads. Pilon, GATK and RACON work by mapping all reads against the assembly and then re-doing the consensus calling. This read mapping step, although accurate, makes the running time prohibitive to samples with high depth of coverage. POLCA is a little bit faster than Pilon and GATK by calling variants first and then only correcting the variants found in the assembly, thus avoiding remapping the reads. A more recent tool, ntEDIT [148] is a bloom filter k-mer based approach that reduces time significatively compared to the previous described tools. First, ntEdit runs the tool ntHits [149], which removes erroneous k-mers and build a canonical representation of ?coverage-thresholded k-mers? using a bloom filter. Then, ntEdit process contigs by interrogating the bloom filter for presence/absence. If a k-mer presence is confirmed, consecutive k-mers are skipped to avoid repetitive computation. If a k-mer or a part of a k-mer is absent from the reads, that part of the assembly is reported as a misassembly and the contig is polished. Warren et al. reported that ntEdit its faster and makes fewer mistakes than Pilon, its closest competitor. 51 4.3 Methods 4.3.1 Read mapping The reference-guided metagenomic assembly approach involves mapping metagenomics reads to a set of genomes and then using their relative placement within each genome to guide the assembly of each reference. To achieve this task, we use Bowtie2 (parameters: --sam-nohead --sam-nosq --end-to-end --quiet --all -p 12). The output is filtered to keep alignments with the lowest edit distance for each read, allowing a read to be aligned in multiple locations (similar to the best-strata option of Bowtie 1). 4.3.2 Selecting a minimal reference set for consensus calling In metagenomics, the relative placement of the reads within a mixture of genomes is more complex than in a single genome. This process is complicated by the fact that individual reads may map to multiple reference genomes, some of which are highly similar. Adequately dealing with this ambiguity is critical for effective assembly. If all read mappings are kept, allowing a read to be associated with multiple reference genomes, the resulting assembly will be redundant, reconstructing multiple copies of homologous genomic regions (Figure 4.1a). If for each read a random placement is selected from among the multiple equivalent matches, none of the related genomes may recruit enough reads to allow assembly, thereby leading to a fragmented reconstruction (Figure 4.1b). Assigning reads to genomes according to their estimated representation in the sample (determined, e.g., based on the depth of coverage), may bias the 52 reconstruction towards the more divergent reference genomes, which may lead to an overall poorer reconstruction of the genomic regions shared across related genomes (Figure 4.1c). Here we propose a parsimony-driven approach: finding the minimal set of reference genomes that explains all read alignments (Figure 4.1d). Figure 4.1. Aligning read to reference genomes. Shorter bars represent shotgun reads; longer bars represent reference genomes (4 genomes in this figure). Regions with the same color in the reference genomes represent homologous sequences. (a) All read mapping records. A read may be mapped to several reference genomes equally well, e.g., 5 yellow reads are mapped to both of the first two genomes. (b) For each read, if it is mapped to more than one reference genome, we randomly pick one. (c) A read is assigned to a reference with highest depth of coverage. (d) We pick the minimum number of reference genomes, to which all reads can be mapped. 4.3.2.1 Minimum set cover problem. This parsimony-driven approach can be outlined as the set cover problem, an NP-hard optimization problem [151]. An instance (X, F) of the set-covering problem consists of a finite set X and a family F of subsets of X, such that every element of X belongs to at least one subset S in F: ? = ? ? ??? The problem is to find a minimum-size subset ? ?F whose members cover all of X: 53 ? = ? ? ??? We use a greedy approximation algorithm (see Algorithm 1), which iteratively picks the set of genomes using the greatest number of remaining unused reads. The algorithm works as follows. The set U has, at each stage, the set of remaining uncovered elements (uncovered reads). The set C has the cover being constructed (reference genomes that are picked). In the greedy decision-making step (line 4) a subset S of genomes is chosen that covers as many uncovered reads as possible with ties broken randomly. After S is selected, its elements are removed from U, and S is placed into C. When the algorithm ends, the set C has a subfamily of F that covers X with the greatest number of reads. It can be shown that this greedy algorithm is the best-possible polynomial time approximation algorithm for the set cover problem, under plausible complexity assumptions [150]. Algorithm 1: Greedy approximation for minimum set covering problem. Input: a finite set X; a family F of subsets of X. Output: a minimum-size subset ? ?F whose members cover all of X. 1: ? ? ? 2: ? ? ? 3: while ? ? ? do 4: select an ? ? ? that maximizes |? ? ?| 5: ? ? ? ? ? 6: ? ? ? ? {?} 7: return C 4.3.3 Building contigs (consensus calling) In order to apply the minimum set cover concept to metagenome assembly, we developed a consensus caller called Buildcontig. Buildcontig starts assembling the genome with the highest breadth of coverage first. Buildcontig evaluates the bases from the reads that are mapped to each position in the reference genomes and reports the 54 genome with the highest depth of coverage as the consensus. Buildcontig can introduce indels up to a threshold. To introduce an indel, its depth of coverage should be higher than half of that of its neighbor nucleotides (Figure 4.2). Nucleotides from a reference sequence that don?t match any base from the reads are discarded from the consensus sequence. This guarantee that the consensus sequence is not overly biased against the reference. Figure 4.2. Creating contigs from reads that are mapped to reference genome using the majority rule. Nucleotides that differ from the reference sequences are highlighted in red. Buildcontig received two inputs: a SAM file with the read alignments, and a file with reference genomes. The minimum depth of coverage and minimum length for creating contigs can be specified through the program command-line options. Finally, to remove reference-bias, we employ ntEdit (v1.18) to modify the consensus sequence to better represent the input data rather than the reference genome. In this step, contigs can be broken if the metagenomic sequence diverges from the reference sequence. 55 4.4 Results We evaluated the performance of our reference-guided approach by using reads from a synthetic microbial community, which consists of a set of metagenomic reads from ground truth genomes [113]. After aligning the synthetic reads to the reference genomes, we generated consensus sequences (or assemblies) with Buildcontig under two settings. We first assembled the synthetic metagenomic skipping the minimum set cover algorithm and using all read mappings to guide the assembly (see Buildcontig_all results, Table 4.1). The aim of this experiment is to show that the performance of Buildcontig can be undermined by multi-mapped reads. Secondly, we ran Buildcontig including the minimum set cover algorithm (see Buildcontig results, Table 4.1). For both experiments, we set the minimum depth of coverage at 1-fold. We also assembled the reads using Samtools. For all experiments, we performed error correction with ntEdit (see ?+ntEdit? results, Table 4.1). When analyzing assembly statistics without reference genomes (Table 4.1), we observed that Buildcontig_all performed better than Buildcontig and Samtools in terms of contiguity (maximum contig size and size to 1 Mbp, Table 4.1) and completeness (total assembly size(bp) and # genes, Table 4.1). However, the higher contiguity and completeness of Buildcontig_all were hampered by the highest duplication and error rates (Table 4.2). Buildcontig produced the lowest duplication ratio (1.0), which indicates that Buildcontig was the only tool without a redundant assembly. Buildcontig_all had the highest duplication ratio (2.1), which indicates that Buildcontig_all generated a highly redundant assembly. In terms of assembly errors, Buildcontig produced the fewest misassemblies, mismatches, and indels. 56 For all assemblers, we observed a decrease in the total number of contigs shorter than 500bp after running ntEdit (see #contigs(<=0bp) and #contigs(<=500bp), Table 4.1), indicating that short erroneous contigs were effectively removed from the assemblies. 57 Table 4.1. Evaluation of performance on synthetic dataset without using reference genomes. Tool indicates the consensus calling method: Buildcontig, Buildcontig_all and Samtools. Buildcontig indicates the default settings the minimum set coverage setting was used, and Buildcontig_all indicates that all read mapping were used (no minimum set coverage setting).?+ntEdit? indicates that ntEdit was run over the . # ctgs is the total number of assembled contigs reported by each assembler, Total assembly size is the total assembled length per assembler, Max ctg is the maximum contig length (broken at errors) for all assembled contigs, Size to 1 Mbp is the size of the largest contig C such that the sum of all contigs larger than C exceeds 1Mbp, #Genes is the number of fully reconstructed genes. Total Max # # assembly contig Size to Contigs Contigs size size 1Mbp Tool (>=0bp) (>=500bp) (bp) (bp) (Kbp) # Genes Buildcontig 54,207 13,727 187,980,023 7,057,101 7,057.10 179,428 Buildcontig+ntEdit 52,954 13,727 187,980,311 7,057,103 7,057.10 179,376 Buildcontig_all 821,792 139,413 383,249,716 7,145,578 7,145.58 281,753 Buildcontig_all+ntEdit 806,658 139,412 383,252,412 7,145,577 7,145.58 280,262 Samtools 815,862 63,570 377,636,570 7,057,100 7,057.10 242,306 Samtools+ntEdit 793,884 63,572 375,724,066 7,057,099 7,057.10 241,692 Table 4.2. Evaluation of performance on synthetic dataset using reference genomes. Total aligned Length is the sum of the length of contigs aligned to the reference genomes, Total unaligned Length is the sum of the length of unaligned contigs, Genome fraction(%) is the total number of aligned bases in the references divided by genome size, and Duplication ratio(%) is the total number of aligned bases in the assembly divided by the total number of aligned bases in the reference. The last five statistics are reference-based errors reported by MetaQUAST. # # # Local Total Fully Genome # Indels Mis- Misassm # aligned unaligned fraction Duplication Mismatches (/100 assemblies (<1 Total Tool length length (%) ratio (%) (/100 kbp) kbp) (>1 Mbp) Mbp) Misassm Buildcontig 185,773,672 1,364,126 89.366 1.00 137.74 3.53 242 155 357 Buildcontig+ntEdit 185,774,130 1,363,100 89.367 1.00 136.91 3.55 242 156 358 Buildcontig_all 377,259,399 3,834,531 92.211 2.10 395.55 15.36 1581 780 2361 Buildcontig_all+ntEdit 377,267,056 3,833,765 92.210 2.10 391.16 13.29 1586 781 2367 Samtools 250,165,735 3,197,557 89.548 1.45 138.33 5.21 512 301 813 Samtools+ntEdit 250,161,490 3,200,782 89.547 1.45 141.25 4.90 512 301 813 58 4.5 Conclusion and future directions In this chapter, we first described concepts relevant to the reference-guided metagenomic assembly problem. We introduced the concept of read mapping in the context of both single genome and metagenomes, highlighting the most widely used indexing data structures for read mapping?Hash Tables and the Burrows-Wheeler Transform (BWT). Next, we briefly described the consensus calling process for single genomes. Lastly, we explained how assembly polishing can boost the correctness of the final assembly. Secondly, we presented our reference-guided metagenomic assembly strategy. Our strategy starts by aligning a set of metagenomic reads to reference genomes using Bowtie2. Then, our consensus caller Buildcontig applies the minimum set cover algorithm to select a minimal reference set. After calling the consensus, we use the error correction tool ntEdit to polish the assembly and remove reference-bias. We showed that our reference-guided metagenome assembly strategy outperforms previous methods in terms of reference-free and reference-based assembly statistics. Finally, we showed that our assembly strategy generates non-redundant assemblies (low duplication ratio) while maintaining a high genome recovery. Our reference guided assembly method could be further improved by adopting different read mapping and consensus calling strategies. As previously mentioned, Bowtie2 was not designed for metagenomics. As the number of available bacterial genomes increases, mapping reads with Bowie2 will get increasingly difficult. A more suitable mapping strategy for metagenomics would be a graph-based approach. In 59 particular, the use of de Bruijn graphs for pan-genome analysis is well-suited for the tasks of compressing genomes and mapping reads, as described in previous work [152? 155]. The most recent tools designed to align reads to de Bruijn graphs are Puffaligner, part of Pufferfish [156], and an extension of the pangenomic suit PanTools [157]. Our current assembly algorithm uses the minimum set cover algorithm, a winner-take-all strategy to minimize redundancy. When multiple closely related species co-exist, one will be well-assembled and the other species assemblies will be shattered into small contigs. We want to explore strategies for re-distributing multi- mapped reads across all aligned locations, resulting in a ?resolve strains? mode. One strategy could be probabilistic assignment to pick the best strains given a species. This problem is similar to estimating differential abundance of transcript isoforms in RNA- sequencing data. Several methods for estimating differential abundance analysis employ the EM algorithm, which has also been successfully applied to metagenomics datasets [74]. 60 Chapter 5: Hybrid reference-guided and de novo assembly of metagenomes 5.1 Introduction In this chapter, we present MetaCompass, a metagenomic assembly approach that combines reference-guided and de novo assembly. MetaCompass selects reference genomes using MetaCompassRS (Chapter 3), and then follows the reference-guided assembly method described in Chapters 4 to reconstruct a metagenomic sample. Finally, to reconstruct genomes missing from our database, MetaCompass incorporates a de novo assembly step. 5.2 Method MetaCompass is divided into five steps (Figure 5.1): (i) selecting reference genomes, (ii) reference guided assembly, (iii) removing reference bias, (iv) de novo assembly, and (v) combining reference-guided and de novo assembly. First, we use the taxonomic classifier method MetaCompassRS to find the reference genomes most closely related to the input metagenomic sample. In the reference-guided assembly process reads are mapped to the selected genomes using Bowtie2, and then the consensus calling is performed with Buildcontig. After consensus calling, we rely on ntEdit to correct the contigs and avoid biasing the reconstruction towards the reference sequences. Finally, the reads that were not included in the reference-guided process outlined above are de novo assembled using 61 MEGAHIT [90] (v1.0.6). We chose MEGAHIT because it is the fastest and lowest- memory metagenomic assembler available, and it was shown to perform excellent in recovering the genomes of closely related strains [87]. Finally, we combined reference- guided contigs and assembly contigs. This hybrid approach allows the final assembly to capture microbes with closest reference genomes available and microbes that are missing from our reference database (such as novel variants). Figure 5.1. Overview of the MetaCompass pipeline. Short colored lines represent reads and long lines genomes. Each color represents a different genome from a metagenomic sample. 1a-1c are part of the taxonomy classifier MetaCompassRS. 2a and 2b are part of the reference-guided assembly step. 62 5.2.1 Datasets used to evaluate metagenomic assemblies 5.2.1.1 Synthetic dataset As described in Chapter 3, the synthetic microbial community published by Shakya et al. [113] contains 64 genomes. The set of known genomes for the synthetic dataset is available in the Supplementary Table 2 from Shakya et al [113]. The synthetic sample was downloaded from the NCBI Short Read Archive (SRA) database, (SRR606249) and has 54 bacterial and 10 archaeal strains from, representing a total of 61 species. Among these organisms, 55 had complete genome sequences in the NCBI RefSeq database (the database used by default by MetaCompass), and 9 were available only as a high-quality draft assembly. 5.2.1.2 HMP2 dataset The Human Microbiome Project (HMP) is a collection of organisms living in association with the human body. The HMP has more than two thousand samples from different body sites sequenced and assembled. A list of all available HMP samples was obtained by from the HMP Data Analysis and Coordination Center (DACC) (www.hmpdacc.org). Some samples were excluded from the downloaded set because they were corrupt or extracted to a duplicate SRS identifier. Additional samples had no references recruited and were excluded from further analysis. A total of 2,294 samples had both an HMP2 assembly and a MetaCompass assembly and were used for the analysis. 63 5.2.2 Parameters used for metagenome assembly and metagenome assembly validation 5.2.2.1 Metagenomic assembly parameters We compared MetaCompass with the de novo assemblers IDBA-UD (July 2016) [88], MEGAHIT (v1.0.6) [90], and MetaSPAdes (v3.9.0) [9]. IDBA-UD requires a single fasta file that was generated using the IDBA ?fq2fa --merge --filter? command. MEGAHIT was run using the options ?--presets meta-sensitive --min-count 3 --min- contig-len 300 -t 12?. MetaSPAdes was run using the options ?--meta -t 12?, then all contigs shorter than 300nt and with less than 3X coverage were removed. IDBA-UD was run using the options ?--min_count 3 --min_contig 100 --mink 20 --maxk 100 -- num_threads 12?. MetaCompass was run using the options -m [1,2,3] -g 100 -t 16? on the synthetic dataset and ?-m 3 -g 100 -t 16? on the HMP2 samples. 5.2.2.2 Metagenomic assembly validation parameters We used MetaQUAST, a reference-based metagenomic assembly validation method that finds misassemblies and structural variants in an assembly relative to reference genomes. The command used to run MetaQUAST on the Shakya et al. synthetic dataset was: ?metaquast.py -R ./shakya_references --fragmented --gene-finding?. 5.3 Results Although real metagenomic reads are the most proper test of performance, it is not possible to assess accuracy from such data because true species in metagenomic datasets are unknown. We first evaluated the performance of MetaCompass using 64 synthetic datasets. Since the true genome sequences are known, these data are ideal as they allow us to fully quantify the quality of the genomic reconstruction. Additionally, we generated improved assemblies of almost the entire dataset generated by the Human Microbiome Project (2,294 distinct samples in total), and use the results to characterize the relative advantages and limitations of de novo and reference-guided assembly approaches, thereby providing guidance on analytical strategies for characterizing the human-associated microbiota. 5.3.1 Evaluation of performance on synthetic metagenomic dataset We valuated MetaCompass by assembling a synthetic microbial community [113]. We assembled this synthetic metagenomic with MetaCompass under two settings. We first assembled the synthetic metagenome skipping the reference selection step and using the exact genomes present in the sample as a reference to guide the assembly. The aim of this experiment is to show that the performance of MetaCompass can be excellent if the reference collection has genomes highly similar to those in the metagenomic sample being assembled. Secondly, we ran the complete MetaCompass pipeline including both the reference selection step and reference-guided assembly. We set the minimum depth of coverage in MetaCompass at 1-fold and 2-fold for both experiments. The assembly results of our fist experiment (Table 5.1, see MetaCompass 1X and 2X) can be considered an approximate upper bound on the performance of any assembly tool, as in this case almost all of the genomes recruited (90%) were exactly those from which the metagenomic reads were obtained. We compared the performance of the two rans of MetaCompass with that of three widely used de novo 65 assemblers: IDBA-UD, MEGAHIT, and metaSPAdes. Compared with these assemblers, MetaCompass achieved higher genome recovery (Table 5.1, Figure 5.2) and produced significantly larger and more accurate contigs (Table 5.1). When we decreased the MetaCompass minimum coverage threshold from 2-fold to 1-fold, we observed gains in maximum contig size and total aligned length, while retaining a similar error profile. 66 Table 5.1. Evaluation of performance on synthetic dataset. MetaCompass (X) indicates the minimum coverage setting (1X or 2X), and MetaCompass.nr indicates all 64 reference genomes comprising the Shakya et al. dataset were removed from the database. # ctgs is the total number of assembled contigs reported by each assembler, Max ctg is the maximum contig length for all assembled contigs, Gen. Rec. (%) is the median percentage of each of the synthetic genomes that is recovered, Complete Marker Genes (median) is the median number of fully reconstructed marker genes, Total aligned length is the sum of the length of contigs aligned to the reference genomes, Total unaligned length is the sum of the length of unaligned contigs. Mismatches, Indels, and Misassemblies (Misassm) are error statistics generated with MetaQUAST. Complete Total Gen. Total Total Indels Misassm Misassm marker Mismatches Misassm Assembler #contigs Max Ctg Rec. aligned unaligned (/100 (>1 (<1 genes (/100 kbp) (<1 (%) length length kbp) Mbp) Mbp) (median) Mbp) MetaCompass (1X) 18,766 7,057,109 100 40 198,113,036 6,340,278 61.9 1.9 0.8 1.1 1.9 MetaCompass (2X) 23,648 5,841,107 100 40 195,836,655 6,198,040 63.1 1.8 0.9 1.1 2.0 MetaCompass.nr (2X) 42,852 1,151,857 98 40 195,225,556 6,338,183 89.9 3.6 3.3 1.6 4.9 IDBA-UD 22,355 991,792 98 39 186,777,879 6,186,424 98.6 3.5 5.3 1.0 6.3 MEGAHIT 35,351 1,151,857 99 40 195,334,581 6,263,018 66.5 2.8 1.5 1.0 2.5 metaSPAdes 21,424 1,438,235 99 40 192,795,050 6,208,276 97.1 3.7 1.3 1.0 2.3 67 Figure 5.2. Genome recovery percentages in synthetic metagenome (MetaCompass versus de novo assembly). Box plots represent distribution of genome recovery percentages (for the 64 genomes present in the synthetic metagenome). x-axis indicates the assembly method, either IDBA-UD, metaSPAdes, MEGAHIT, or MetaCompass. MetaCompass was run both with the reference genomes present in the database (recruited as described in the methods) and without the truth reference genomes in the database (they were individually removed). y-axis indicates the genome recovery percentage, 0% indicates the genome was unassembled, whereas 100% indicates the genome was fully assembled. 5.3.1.1 References removed from database To provide a better idea of how MetaCompass would perform in a worst-case scenario, we removed from the database the genomes represented in the synthetic community (Appendix A), thereby forcing MetaCompass to recruit near-neighbor reference genomes, when available. (see ?MetaCompass.nr? row, Table 5.1). In this case, we found that MetaCompass still performed almost as well as de novo assemblers while making far fewer errors than if it simply mimicked the reference genome. Median genome recovery for MetaCompass is just 1% less than that of de novo assemblers. The 68 accuracy of the reconstruction, as measured by mismatch and indel rates, is lower than that of IDBA-UD and metaSPAdes (Table 5.1, MetaCompass.nr (2x)), while moderately higher than MEGAHIT. The number of misassemblies and local misassemblies per 1 Mbp of assembled sequence (as reported by MetaQUAST [158]) increased from 2.0 to 4.9 when reducing the coverage threshold to 1. To put this increase into context, we measured the total number of possible errors by evaluating the "accuracy" of the near-neighbor reference genomes recruited by MetaCompass with respect to the correct reference sequence (Figure 4 see hashed blue bar). This allows us to capture the real differences between the recruited reference genomes and the actual genome represented in the synthetic dataset [113], providing an upper bound on the number of errors MetaCompass could make if it simply recapitulated the sequence of the selected reference genomes. As seen in Figure 5.3, the MetaCompass assembly is much closer to the correct genome than the reference sequence. 69 Figure 5.3. Error profile on synthetic dataset. The hashed blue bar represents the difference between the second-best reference genome (recruited by MetaCompass) and the true genome represented in the sample. This bar can be viewed as an upper bound on the errors metacompass.nr could make if it simply reconstructed the reference genome. Mismatches are the number of bases in a contig that differ from the reference genome. Misassemblies include large-scale (left flanking region aligns >1 kbp away from right flanking region) relocations, interspecies relocations, translocations, and inversions. Local misassemblies include small-scale (left flanking region aligns <=1 kbp away from right flanking region) translocations and inversions. All errors are normalized to represent rates per 1 Mbp. 5.3.1.2 Evaluation of performance on down sampled synthetic metagenomic dataset To evaluate the ability of MetaCompass to assemble low-coverage genomes, we downsampled the synthetic dataset to just 5 million paired-end reads, or 10% of the original data set. After downsampling, the average coverage was reduced to approximately 3-fold. The results (Table 5.2, Figure 5.4) highlight that MetaCompass can recover a median of 90% of each of the 64 genomes in the sample. While metaSPAdes comes in second place and is able to recover 80% (median recovery), it does so at the cost of four times higher misassembly rate (Table 5.2). The two remaining methods, MEGAHIT and IDBA-UD leave a quarter to a half of the genomes unassembled and also produce higher misassembly rates 70 assemblers: IDBA-UD, MEGAHIT, and metaSPAdes. Compared with these assemblers, MetaCompass achieved higher genome recovery (Table 5.2, Figure 5.2) and produced significantly larger and more accurate contigs (Table 5.2). When we decreased the MetaCompass minimum coverage threshold from 2-fold to 1-fold, we observed gains in maximum contig size and total aligned length, while retaining a similar error profile. 71 Table 5.2. Evaluation of performance on down-sampled synthetic dataset. The synthetic dataset was down-sampled to only contain 10% of the total reads. # ctgs is the total number of assembled contigs reported by each assembler, Max ctg is the maximum contig length for all assembled contigs, Median Genome Recovery (%) is the median percentage of each of the synthetic genomes that is recovered, Complete Marker Genes (median) is the median number of fully reconstructed marker genes, Total aligned Length is the sum of the length of contigs aligned to the reference genomes, Total unaligned Length is the sum of the length of unaligned contigs. Mismatches, Indels, and Misassemblies (Misassm) are error statistics generated with MetaQUAST. Median Complete Total Total Total Max Genome Marker Mismatches Indels Misassm Misassm Assembler #contigs aligned unaligned Misassm (<1 Ctg Recovery Genes (/100kbp) (/100kbp) (>1 kbp) (<1 kbp) length length kbp) (%) (median) MetaCompass 71457 962,929 90% 22 134,008,055 3,009,931 117.6 1.9 112 33 145 IDBA-UD 43973 120159 45% 6 75,970,693 1,564,008 175.0 5.3 3447 93 3540 MEGAHIT 62842 209,706 76% 15 105,665,678 2,774,432 128.0 4.1 772 122 894 metaSPAdes 67138 287,554 80% 16 111,636,826 3,154,199 133.0 4.3 470 115 585 72 5.3.1.3 Computational performance When dealing with large-scale data sets, the total required memory and running time are important factors in determining the applicability of a computational tool. We first evaluated the running time performance of MetaCompass on a Linux 12-core server node with 80 GB of memory using the Shakya et al. synthetic dataset. The wall clock running time on this synthetic dataset for MetaCompass is comparable to the evaluated de novo assemblers and sometimes lower (Table 5.4). MetaCompass and Megahit were the only approaches that required less than 16GB of RAM on a 100 million read dataset, highlighting the scalability of this methods to large datasets. Table 5.3. Running time for assemblers on Shakya et al. sample. We evaluated the running time performance of MetaCompass and three de novo assemblers for the full Shakya et al. sample (100 million paired-end reads) and a 10% of the original data set (5 million paired-end reads). The full dataset was run using 80 GB of memory and 12 CPUs and the down-sampled dataset using 36GB of memory and 4 CPUs. Shakya et al. Downsampled Shakya et al. Assembler Time (hh:mm) Memory (Gb) Time (mm:ss) Memory (Gb) MetaCompass 3:53 19.82 3:35 10.34 IDBA-UD 3:53 16.78 2:42 7.39 MEGAHIT 2:26 8.61 2:03 2.35 metaSPAdes 6:02 28.07 8:25 19.63 73 Figure 5.4. MetaCompass performance on low coverage dataset. Results obtained by down-sampling the Shakya et al. synthetic genome to just 10% of the original set of reads. The 64 genomes present in the sample are ordered per assembler by percent recovery, from lowest to highest. The y-axis indicates how much of the n-th reference was covered by correctly assembled contigs (can range from 0% to 100%). The colored dashed lines indicate the median percent recovery for each assembler. 5.3.2 Evaluation of performance on Human Microbiome Project (HMP2) 5.3.2.1 Reassembly of the data generated by the Human Microbiome Project (HMP2) To further explore the benefits and limits of comparative approaches for metagenomic assembly, we re-analyzed with MetaCompass 2,294 metagenomic samples from the HMP Project. These samples cover 15 body sites from four broad regions of the human body: oral, skin, stool, and vaginal. We compared the assemblies produced by MetaCompass with the assemblies reported by the HMP project [159]. Across all samples, on average, MetaCompass outperforms the HMP2 de novo approach, leading to an overall better assembly of the original data (Table 5.4, Figure 10). 74 The relative performance of the MetaCompass and HMP2 assemblies varied across body sites due to the specific characteristics of the microbial communities being reconstructed. While MetaCompass generates more assembled sequence and complete marker genes across all body sites, the maximum contig size and size at 1 Mbp metrics vary per body site. In oral and stool samples (Figure 5.5), MetaCompass outperforms de novo assembly for all metrics. In skin and vaginal samples (Figure 5.5), the de novo (HMP2) assemblies have better contiguity statistics but MetaCompass assembles more complete marker genes. To gain further insight into these results we calculated the average nucleotide identity of the de novo assembled contigs to the recruited reference genomes for each body site. In all body sites, except for oral, the assembled contigs had 99% average nucleotide identity to the reference genomes. In the oral samples, the most distant reference genomes had only 97% identity to the assembled contigs. To further explore the drop-in contiguity in skin and vaginal samples, we focused on just the contigs that mapped to bacterial genomes contained in the reference database, allowing for a direct comparison between MetaCompass and de novo contigs. The results in Table 5.4 show that for this set of contigs, MetaCompass outperforms the de novo approach for the vaginal samples. However, the de novo HMP2 assembly of the skin sample is still better in terms of complete genes recovered, but equivalent to MetaCompass with respect to complete marker genes recovered (a measure of assembly completeness). 75 Table 5.4. Re-assembly of 2,294 samples generated in the Human Microbiome Project. The results are aggregated by body site. # indicates the total reads per sample, Avg cvg per sample (X) is the mean estimate read coverage calculated based on the de novo assembly of each sample and body site, Shannon Entropy (median) is the Shannon diversity value per body site as reported in Li et al. [160]. The rows labeled MC contain results obtained with MetaCompass. The rows labeled HMP2 show the statistics for contigs from the production HMP2 assembly. Total Size (Mbp) is the total assembly size for each method, Max ctg size (kbp) is the size of the largest contig, Median Size 1Mbp (kbp) represents the median size of the largest contig C such that the sum of all contigs larger than C exceeds 1Mbp. Median Complete Genes represents the median number of complete genes per sample. Median Marker Genes indicates the median number of complete marker genes per sample. Median HMP2 Avg Shanon Total Max Median Median Num of size 1 body cvg per Entropy Asm size ctg size complete marker samples Mbp site sample (median) (Mbp) (kbp) genes genes (kbp) Oral 1259 20.0 2.4 HMP2 106,693 546.4 70.8 54,1 762 ?8.1 MC 135,586 892.3 95.8 63,144 915 Skin 291 17.4 1.5 HMP2 2,944 890.7 36.5 4,654 78 ?4.7 MC 3,782 2,159.3 15.1 5,01 79 Stool 524 18.4 2.6 HMP2 56,573 592.8 109.1 84,193 847 ?4.9 MC 66,838 3,301.0 230.9 94,297 1,043 Vagina 220 7.8 0.2 HMP2 1,179 465.8 28.7 2,539 45 ?4.5 MC 1,458 558.0 16.1 2,934 60 A ll 2294 18.2 1.9 HMP2 184,518 890.7 79.0 48,836 633 ? 5.6 MC 232,161 3,301.0 114.6 57,639 764 76 Figure 5.5. Comparative assembly of 2,294 metagenomic samples from the HMP2 Project. The bean plots represent the distribution of assembly contiguity and completeness statistics across all samples within the data. The x axis organizes the data by assembly and body site. The y-axis indicates the statistic used to evaluate the assembly contiguity or completeness. The top panel shows total assembly size, the second panel shows maximum contig size, the third panel shows the size of the contig at 1 Mbp, and the bottom panel shows the complete marker genes assembled per sample. 77 5.3.2.2 Comparing reference-guided to de novo assembly on low-coverage HMP2 samples To assess the ability of MetaCompass to assemble low-abundance organisms, we focused on all skin HMP2 samples. The skin samples had the second lowest average number of reads while still containing reasonable diversity and richness, as reported in Table 5.4. We removed the contigs assembled via de novo assembly from the MetaCompass output, collected the reference genomes that were used, mapped the HMP2 contigs to these reference genomes, and then evaluated the number of complete genes and complete marker genes. Compared to the HMP2 assembly, reference-guided assembly of these low coverage samples is able to reconstruct approximately 10% more marker genes (4,423 versus 3,915) than the de novo approach, roughly equating to 10 additional complete bacterial genomes. We next searched for microbes that were present in the skin samples at relatively low coverage and explored the differences between the reconstructions generated by the HMP2 project and MetaCompass. Specifically, we identified the low coverage assembly of a Propionibacterium acnes genome reconstructed by both MetaCompass and the HMP in sample SRS057083. The HMP2 assembly covers less than 40% of the closest reference genome (NC_016516.1, Propionibacterium acnes TypeIA2 P.acn33), while the MetaCompass assembly covers more than 90% of the same genome. 78 5.3.2.3 Comparing reference-guided to de novo assembly on high coverage HMP2 samples To assess the ability of MetaCompass to assemble high-abundance organisms, we focused on all stool HMP2 samples. The stool assemblies had the longest maximum contig and median size to 1Mbp, as reported in Table 5.4. We searched for microbes with the best assembly among all stool samples (NZ_CP012801, Bacteroides cellulosilyticus WH2, HMP2 sample SRS143342), and explored the differences between the reconstructions generated by de novo assemblers and MetaCompass. We next collected the reference genomes that were used by MetaCompass and mapped both de novo and reference-guided assemblies to these reference genomes. The Bacteroides cellulosilyticus WH2 genome was recovered by all assemblies with more than 70% of genome recovery. As show in Figure 5.6, all tools reconstructed a fragmented assembly towards the beginning of the genome, were more sequencing errors are usually found. Overall, after the initial fragmented contigs, MetaCompass assembled ten long contigs with length ranging from 0.5 to 2.28MBp. The longest MetaCompass contig covers 0.32% of the Bacteroides cellulosilyticus WH2 genome (Figure 5.7) and aligned almost perfectly to the reference genome (2 mismatches). In contrast, MetaSPAdes, Megahit and IDBA-UD reconstructed an extremely fragmented assembly with many misassembled contigs. To further investigate how the reads were distributed across both the reference genome and contigs, we mapped both reads and contigs to the genome with Bowtie2 and Minimap2 [161], respectively (Figure 5.8). Although the read mapping visualization shows a relatively even depth of coverage, de novo assemblers were unable to reconstruct a contiguous assembly. Conversely, MetaCompass reconstructed the full 79 segment of the genome. MEGAHIT was the second-best assembler, almost reconstructing the full segment. Figure 5.6. Icarus view of metagenomic assembly of the stool sample SRS143342 from the HMP2 Project. The contigs largest than 1000bp from MetaCompass, MetaSPAdes, Megahit, and IDBA were aligned to the Bacteroides cellulosilyticus WH2 genome (NZ_CP012801, 7084828 bp). Colors indicate how well the contigs aligned to the reference. Green represent correct contigs, red misassembled contigs, purple ambiguously mapped contigs, and gray unaligned contigs. Figure 5.7. Longest contig from Bacteroides cellulosilyticus strain WH2 chromosome genome assembly (accession:NZ_CP012801.1, length: 7084828 bp). The length of contig NCP012801.1_102 is almost 2.28 Mbp, covering 0.32% of the complete genome. 80 Figure 5.8. IGV visualization of read and contig mapped against a segment of the Bacteroides cellulosilyticus WH2 genome (accession:NZ_CP012801.1, length: 7084828 bp). MetaCompass reconstructed the full segment of the genome. MEGAHIT almost reconstructed the full segment. MetaSPAdes and IDBA-UD had the biggest assembly gap. 5.4 Conclusion and discussion We have described MetaCompass, a comparative metagenome assembly method that relies on an indexing strategy to construct sample-specific reference collections. We show that comparative and de novo assemblies provide complementary strengths, and that combining both approaches effectively improves the overall assembly, providing a consistent increase in the quality of the assembly. Even when distant reference genomes are recruited, we remain competitive with de novo genome assembly methods. We accomplish this via two critical steps. First, we avoid reference bias by constructing 81 the consensus sequence from the reads within the sample, using the reference genome as just a guide, and we break the assembly where the reads indicate a structural disagreement with the reference. Second, we use unmapped reads in a de novo assembly process to reconstruct the sections of the metagenomic sample that are not similar to known reference genomes. We have shown MetaCompass to be particularly effective in the assembly of low coverage or rare microbes, a setting in which de novo assembly approaches simply cannot be used with good results. Improved assembly of low-abundance, rare microbes from existing datasets has the potential to provide additional resolution in complex microbial communities or clinical samples where the host DNA comprises a large fraction of the data. Finally, we have shown that in high- abundance genomes, MetaCompass is more effective that de novo in generating complete and contiguous assemblies. The benefit of comparative assembly is highly dependent on the reference genomes available in the database provided to MetaCompass. While MetaCompass can effectively use reference genomes that are distantly related to the genomes being assembled, the quality of the reconstruction is lower than can be achieved with closely related reference sequences. Many bacteria found in the human microbiota are difficult to culture (e.g., the many anaerobes inhabiting the human intestinal tract) and are, therefore, under-represented in public databases. Despite this fact, MetaCompass was able to improve, often significantly, upon the assembly of the data generated by the Human Microbiome Project 2. However, the contiguity of MetaCompass on skin samples was not improved upon the assemblies generated by HMP2. This could be due to the structural genome dynamics of bacterial defense systems commonly found in the 82 skin microbe [162?164]. Future work will focus on elucidating the effect of each of these factors via assembly graph-based approaches. In addition, as the number of genomes in public databases is increasing, comparative approaches such as ours will be increasingly valuable for reconstructing near-complete genome sequences from metagenomic data. 83 Chapter 6: Conclusion Metagenomic assembly, the process of reconstructing large genomic segments from metagenomic reads, is a formidable computational challenge. Even for single organisms, the assembly of genome sequences from next-generation sequencing (NGS) reads is a complex task, primarily due to ambiguities in the reconstruction that are caused by genomic repeats. In addition, metagenomic assemblers must be tolerant of non-uniform representation of genomes in a sample as well as of the genomic variants between the sequences of closely related organisms. Despite advances in metagenomic assembly algorithms over the past years, the computational difficulty of the assembly process remains high and the quality of the resulting assemblies requires improvement. The reference-guided assembly paradigm has been shown to outperform the de novo assembly paradigm under certain settings, yet, the former has not been extensively explored. In this dissertation, we designed methods to address the reference-guided metagenomic assembly problem. This problem consists of two subproblems: selecting closely related genomes to guide the assembly and reconstructing each genome individually. To address the first subproblem, we developed MetaCompassRS, a taxonomy classification approach that is able to retrieve the closest reference genomes available in a database that are contained in a metagenomic sample. We showed that MetaCompassRS achieves higher recall than state of the art taxonomy classification tools, while maintaining a competitive running time. 84 The second subproblem is further subdivided into read mapping and consensus calling. We used Bowtie2?the most widely used short read mapper?for the former task and developed an approached inspired on the minimum set cover problem for the latter task. We implemented the minimum set cover algorithm in our tool Buildcontig and showed its efficiency and effectivity in reducing the redundancy of metagenome assemblies. Finally, we developed MetaCompass, a metagenomic assembly pipeline that encompass MetaCompassRS, Buildcontig and de novo assembly to reconstruct a metagenomic sample. When combined with de novo assembly approaches, we showed that reference-guided assembly is able to generate more complete assemblies than the ones obtained by the de novo assembly alone. We also showed that MetaCompass performs better than the state of the art methods in real world datasets?such as the ones gather by the HMP. We believe that reference-guided metagenomic assembly approaches, and with MetaCompass being one of the first ones reported in the literature, will increasingly replace the more computationally expensive and error-prone de novo assembly approaches as the collection of available reference genome sequences increases. Furthermore, reference-guided assembly provides new opportunities for the development of both clinical and computational applications. Clinical applications are a particularly relevant application domain for reference-guided approaches because the vast majority of publicly available genome sequences comprises human pathogens. Computational methods capable of handling a large amount of metagenomic sequencing data are an active area of research. One of the most promising strategies to 85 handle metagenomics reference collections is using pangenome graphs, which we plan to further explore in the future. 86 Appendices Appendix A. References removed from database used by MetaCompass. Species name Acidobacterium capsulatum Aciduliprofundum boonei Akkermansia muciniphila Archaeoglobus fulgidus Bacteroides thetaiotaomicron Bacteroides vulgatus Bordetella bronchiseptica Burkholderia xenovorans LB400 Caldicellulosiruptor bescii Caldicellulosiruptor saccharolyticus Chlorobium limicola Chlorobium phaeobacteroides Chlorobium phaeovibrioides Chlorobium tepidum Chloroflexus aurantiacus J-10-fl Clostridium thermocellum Deinococcus radiodurans R1 Desulfovibrio piger Desulfovibrio vulgaris DP4 Dictyoglomus turgidum Enterococcus faecalis Fusobacterium nucleatum nucleatum Gemmatimonas aurantiaca Geobacter sulfurreducens PCA Haloferax volcanii Herpetosiphon aurantiacus Hydrogenobaculum sp. Y04AAS1 Ignicoccus hospitalis Leptothrix cholodnii Methanocaldococcus jannaschii Methanococcus maripaludis C5 Methanococcus maripaludis S2 Methanopyrus kandleri 87 Methanosarcina acetivorans C2A Nanoarchaeum equitans Nitrosomonas europaea Nostoc sp. PCC 7120 Pelodictyon phaeoclathratiforme Persephonella marina EX-H1 Porphyromonas gingivalis Pyrobaculum aerophilum IM2 Pyrobaculum arsenaticum Pyrobaculum calidifontis Pyrococcus furiosus Pyrococcus horikoshii Rhodopirellula baltica Ruegeria pomeroyi Salinispora arenicola Salinispora tropica Shewanella baltica OS185 Shewanella baltica OS223 Sulfitobacter sp. EE-36 Sulfitobacter sp. NAS-14.1 Sulfolobus tokodaii Sulfurihydrogenibium sp. YO3AOP1 Sulfurihydrogenibium yellowstonense SS-5 Thermoanaerobacter pseudethanolicus Thermotoga neapolitana DSM 4359 Thermotoga petrophila RKU-1 Thermotoga sp. RQ2 Thermus thermophilus HB8 Treponema denticola Wolinella succinogenes Zymomonas mobilis 88 Bibliography 1. Hooper L V. Commensal Host-Bacterial Relationships in the Gut. Science (80-. ). [Internet]. 2001 [cited 2017 Mar 23];292:1115?8. Available from: http://www.sciencemag.org/cgi/doi/10.1126/science.1058709 2. Tringe SG, Rubin EM. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. [Internet]. Nature Publishing Group; 2005 [cited 2017 Mar 23];6:805?14. Available from: http://www.nature.com/doifinder/10.1038/nrg1709 3. Meth? BA, Nelson KE, Pop M, Creasy HH, Giglio MG, Huttenhower C, et al. A framework for human microbiome research. Nature [Internet]. Nature Research; 2012 [cited 2017 Mar 23];486:215?21. Available from: http://www.nature.com/doifinder/10.1038/nature11209 4. Heather JM, Chain B. The sequence of sequencers: The history of sequencing DNA. Genomics. Academic Press Inc.; 2016. p. 1?8. 5. Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat. Rev. Genet. [Internet]. Nature Research; 2014 [cited 2017 Aug 5];15:121?32. Available from: http://www.nature.com/doifinder/10.1038/nrg3642 6. Kingsford C, Schatz MC, Pop M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics [Internet]. 2010 [cited 2017 Apr 28];11:21. Available from: http://www.biomedcentral.com/1471-2105/11/21 7. Nagarajan N, Pop M. Parametric Complexity of Sequence Assembly: Theory and Applications to Next Generation Sequencing. J. Comput. Biol. [Internet]. 2009 [cited 2017 Apr 28];16:897?908. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19580519 8. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol. [Internet]. 2011 [cited 2017 Mar 23];29:987?91. Available from: http://www.nature.com/nbt/journal/v29/n11/pdf/nbt.2023.pdf 9. Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica [Internet]. Springer-Verlag; 1995 [cited 2017 Jun 28];13:7? 51. Available from: http://link.springer.com/10.1007/BF01188580 10. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science (80-. ). [Internet]. American Association for the Advancement of Science; 1995 [cited 2020 Jul 7];269:496?512. Available from: https://science.sciencemag.org/content/269/5223/496 11. Ann Liebert M, Sutton GG, White O, Adams MD, Kerlavage AR. TIGR Assembler: A New Tool for Assembling Large Shotgun Sequencing Projects [Internet]. GENOME Sci. Technol. 1995. Available from: www.liebertpub.com 12. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature [Internet]. Nature; 2001 [cited 2020 Jul 7];409:860?921. Available from: https://pubmed.ncbi.nlm.nih.gov/11237011/ 89 13. Craig Venter J, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science (80-. ). [Internet]. Science; 2001 [cited 2020 Jul 7];291:1304?51. Available from: https://pubmed.ncbi.nlm.nih.gov/11181995/ 14. Pevzner PA, Tang H, Waterman MS. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. U. S. A. [Internet]. 2001 [cited 2017 Apr 28];98:9748?53. Available from: http://www.pnas.org/cgi/doi/10.1073/pnas.171285098 15. Nagarajan N, Pop M. Sequence assembly demystified. Nat. Rev. Genet. [Internet]. 2013 [cited 2017 Jun 15];14:157?67. Available from: http://www.nature.com/doifinder/10.1038/nrg3367 16. Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics [Internet]. 2010 [cited 2017 Jun 15];95:315?27. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0888754310000492 17. Salmela L, Walve R, Rivals E, Ukkonen E. Accurate self-correction of errors in long reads using de Bruijn graphs. [cited 2020 Aug 4]; Available from: http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/. 18. Schatz MC, Phillippy AM, Sommer DD, Delcher AL, Puiu D, Narzisi G, et al. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies. [cited 2020 Jun 30]; Available from: http://amos.sourceforge.net. 19. Pop M, Phillippy A, Delcher AL, Salzberg SL. Comparative genome assembly. Brief. Bioinform. [Internet]. Oxford University Press; 2004 [cited 2017 Mar 23];5:237?48. Available from: https://academic.oup.com/bib/article- lookup/doi/10.1093/bib/5.3.237 20. Richter DC, Schuster SC, Huson DH. OSLay: optimal syntenic layout of unfinished assemblies. 2007 [cited 2020 Jun 30];23:1573?9. Available from: http://www-ab.informatik.unituebingen.de/software/oslay 21. Van Hijum SAFT, Zomer AL, Kuipers OP, Kok J. Projector 2: contig mapping for efficient gap-closure of prokaryotic genome sequence assemblies. [cited 2020 Jun 30]; Available from: http://genome.nhgri.nih.gov/blastall/ 22. Assefa S, Keane TM, Otto TD, Newbold C, Berriman M. ABACAS: algorithm- based automatic contiguation of assembled sequences. Bioinforma. Appl. NOTE [Internet]. 2009 [cited 2020 Jun 30];25:1968?9. Available from: http://abacas.sourceforge. 23. Husemann P, Stoye J. r2cat: synteny plots and comparative assembly. Bioinformatics [Internet]. 2010 [cited 2015 Feb 12];26:570?1. Available from: http://bioinformatics.oxfordjournals.org/content/26/4/570.full 24. Husemann P, Stoye J. Phylogenetic comparative assembly. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) [Internet]. Springer, Berlin, Heidelberg; 2009 [cited 2020 Jun 30]. p. 145?56. Available from: https://link.springer.com/chapter/10.1007/978-3-642-04241-6_13 25. Silva GG, Dutilh BE, David Matthews T, Elkins K, Schmieder R, Dinsdale EA, et al. Combining de novo and reference-guided assembly with scaffold_builder [Internet]. 2013. Available from: http://www.scfbm.org/content/8/1/23 26. Vezzi F, Cattonaro F, Policriti A. e-RGA: enhanced Reference Guided Assembly of Complex Genomes. EMBnet.journal [Internet]. 2011 [cited 2020 Jun 30];17:46? 54. Available from: 90 http://journal.embnet.org/index.php/embnetjournal/article/view/208/484 27. Scholz M, Tett A, Segata N. Computational Tools for Taxonomic Microbiome Profiling of Shotgun Metagenomes. Metagenomics Microbiol. Elsevier Inc.; 2015. p. 67?80. 28. Weisburg WG, Barns SM, Pelletier DA, Lane DJ. 16S ribosomal DNA amplification for phylogenetic study. J. Bacteriol. [Internet]. American Society for Microbiology (ASM); 1991 [cited 2020 Jul 7];173:697?703. Available from: /pmc/articles/PMC207061/?report=abstract 29. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 2009;75:7537?41. 30. Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. DADA2: High-resolution sample inference from Illumina amplicon data. Nat. Methods [Internet]. Nature Publishing Group; 2016 [cited 2020 Jul 12];13:581?3. Available from: https://github.com/benjjneb/dada2 31. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data [Internet]. Nat. Methods. Nature Publishing Group; 2010 [cited 2020 Jul 12]. p. 335? 6. Available from: http://qiime.sourceforge. 32. Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 2013;41. 33. DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 2006;72:5069?72. 34. Cole JR, Wang Q, Chai B, Tiedje JM. The Ribosomal Database Project: Sequences and Software for High-Throughput rRNA Analysis. Handb. Mol. Microb. Ecol. I [Internet]. Hoboken, NJ, USA: John Wiley & Sons, Inc.; 2011 [cited 2020 Jun 30]. p. 313?24. Available from: http://doi.wiley.com/10.1002/9781118010518.ch36 35. Nilsson RH, Larsson K-H, Taylor AFS, Bengtsson-Palme J, Jeppesen TS, Schigel D, et al. The UNITE database for molecular identification of fungi: handling dark taxa and parallel taxonomic classifications. Nucleic Acids Res. [Internet]. 2018 [cited 2020 Jun 30];47:259?64. Available from: https://www.postgresql.org/ 36. Case RJ, Boucher Y, Dahll?f I, Holmstr?m C, Doolittle WF, Kjelleberg S. Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl. Environ. Microbiol. [Internet]. American Society for Microbiology (ASM); 2007 [cited 2020 Jul 12];73:278?88. Available from: /pmc/articles/PMC1797146/?report=abstract 37. Engelbrektson A, Kunin V, Wrighton KC, Zvenigorodsky N, Chen F, Ochman H, et al. Experimental factors affecting PCR-based estimates of microbial species richness and evenness. ISME J. [Internet]. Nature Publishing Group; 2010 [cited 2020 Jul 12];4:642?7. Available from: www.nature.com/ismej 38. Kennedy K, Hall MW, Lynch MDJ, Moreno-Hagelsieb G, Neufeld JD. Evaluating bias of Illumina-based bacterial 16S rRNA gene profiles. Appl. Environ. Microbiol. [Internet]. American Society for Microbiology; 2014 [cited 2020 Jul 91 14];80:5717?22. Available from: http://dx.doi.org/10.1128 39. Comeau AM, Douglas GM, Langille MGI. Microbiome Helper: a Custom and Streamlined Workflow for Microbiome Research. mSystems [Internet]. American Society for Microbiology; 2017 [cited 2020 Jul 12];2. Available from: /pmc/articles/PMC5209531/?report=abstract 40. Gevers D, Cohan FM, Lawrence JG, Spratt BG, Coenye T, Feil EJ, et al. Re- evaluating prokaryotic species [Internet]. Nat. Rev. Microbiol. Nature Publishing Group; 2005 [cited 2020 Jul 12]. p. 733?9. Available from: www.nature.com/reviews/micro 41. Ochman H, Wilson AC. Evolution in bacteria: Evidence for a universal substitution rate in cellular genomes [Internet]. J. Mol. Evol. Springer-Verlag; 1987 [cited 2020 Jul 13]. p. 377. Available from: https://link.springer.com/article/10.1007/BF02101157 42. Yamamoto S, Harayama S. Phylogenetic relationships of Pseudomonas putida strains deduced from the nucleotide sequences of gyrB, rpoD and 16S rRNA genes. Int. J. Syst. Bacteriol. [Internet]. Microbiology Society; 1998 [cited 2020 Jul 13];48:813?9. Available from: https://www.microbiologyresearch.org/content/journal/ijsem/10.1099/00207713-48- 3-813 43. Palys T, Berger E, Mitrica I, Nakamura LK, Cohan FM. Protein-coding genes as molecular markers for ecologically distinct populations: The case of two Bacillus species. Int. J. Syst. Evol. Microbiol. [Internet]. Society for General Microbiology; 2000 [cited 2020 Jul 13];50:1021?8. Available from: https://pubmed.ncbi.nlm.nih.gov/10843041/ 44. Mende DR, Sunagawa S, Zeller G, Bork P. Accurate and universal delineation of prokaryotic species. Nat. Methods [Internet]. Nature Publishing Group; 2013 [cited 2020 Jul 13];10:881?4. Available from: http://www.bork.embl.de/software/speci/ 45. Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger SA, Kultima JR, et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods [Internet]. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2013 [cited 2014 Jul 18];10:1196?9. Available from: http://dx.doi.org/10.1038/nmeth.2693 46. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P. Toward automatic reconstruction of a highly resolved tree of life. Science [Internet]. 2006 [cited 2014 Jul 10];311:1283?7. Available from: http://www.ncbi.nlm.nih.gov/pubmed/16513982 47. Tatusov RL, Galperin MY, Natale DA, Koonin E V. The COG database: a tool for genome-scale analysis of protein functions and evolution [Internet]. Nucleic Acids Res. 2000. Available from: http://www.ncbi.nlm.nih.gov/COG 48. Galperin MY, Makarova KS, Wolf YI, Koonin E V. Expanded microbial genome coverage and improved protein family annotation in the COG database. Nucleic Acids Res. [Internet]. 2014 [cited 2020 Jul 14];43:261?9. Available from: http://www.ncbi.nlm.nih.gov/COG/ 49. Pruitt K, Brown G, Tatusova T, Maglott D. The Reference Sequence (RefSeq) Database [Internet]. National Center for Biotechnology Information (US); 2012 [cited 2015 Sep 17]. Available from: http://www.ncbi.nlm.nih.gov/books/NBK21091/ 92 50. O?Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. [Internet]. Oxford University Press; 2016 [cited 2017 Mar 27];44:D733-45. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26553804 51. RefSeq: NCBI Reference Sequence Database [Internet]. [cited 2020 Jul 2]. Available from: https://www.ncbi.nlm.nih.gov/refseq/ 52. Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. [cited 2020 Jul 2]; Available from: http://www.ncbi.nlm.nih.gov/RefSeq/ 53. Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking Metagenomics Tools for Taxonomic Classification [Internet]. Cell. Cell Press; 2019 [cited 2020 Jul 2]. p. 779? 94. Available from: https://doi.org/10.1016/j.cell.2019.07.010 54. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, Yu Y, et al. Basic local alignment search tool. J. Mol. Biol. [Internet]. BioMed Central; 1990 [cited 2017 Mar 23];215:403?10. Available from: http://linkinghub.elsevier.com/retrieve/pii/S0022283605803602 55. Huson DH, Beier S, Flade I, G?rska A, El-Hadidi M, Mitra S, et al. MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data. Poisot T, editor. PLOS Comput. Biol. [Internet]. Public Library of Science; 2016 [cited 2016 Jul 19];12:e1004957. Available from: http://dx.plos.org/10.1371/journal.pcbi.1004957 56. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. [Internet]. J Mol Biol; 1990 [cited 2020 Jul 1];215:403?10. Available from: https://pubmed.ncbi.nlm.nih.gov/2231712/ 57. Altschul SF, Madden TL, Sch?ffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs [Internet]. Nucleic Acids Res. Oxford University Press; 1997. Available from: https://academic.oup.com/nar/article-abstract/25/17/3389/1061651 58. McCall C, Xagoraraki I. Comparative study of sequence aligners for detecting antibiotic resistance in bacterial metagenomes. Lett. Appl. Microbiol. [Internet]. Blackwell Publishing Ltd; 2018 [cited 2020 Jul 2];66:162?8. Available from: http://doi.wiley.com/10.1111/lam.12842 59. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. [Internet]. 1981 [cited 2014 Dec 11];147:195?7. Available from: http://www.ncbi.nlm.nih.gov/pubmed/7265238 60. Zhang Z, Schwartz S, Wagner L, Miller W. A Greedy Algorithm for Aligning DNA Sequences [Internet]. J. Comput. Biol. Mary Ann Liebert, Inc. Pp; 2000. Available from: www.liebertpub.com 61. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND [Internet]. Nat. Methods. Nature Publishing Group; 2014 [cited 2020 Jul 16]. p. 59?60. Available from: http://ab.inf.uni-tuebingen.de/software/diamond 62. Menzel P, Ng KL, Krogh A, Marth G, Lipman D. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. [Internet]. Nature Publishing Group; 2016 [cited 2017 Mar 23];7:11257. Available from: http://www.nature.com/doifinder/10.1038/ncomms11257 93 63. Burrows M, Burrows M, Wheeler DJ. A block-sorting lossless data compression algorithm. 1994 [cited 2017 Sep 25];16. Available from: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.3.8069 64. Ferragina P, Manzini G. Opportunistic data structures with applications. Annu. Symp. Found. Comput. Sci. - Proc. IEEE; 2000. p. 390?8. 65. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics [Internet]. 2011 [cited 2017 Mar 23];12:S4. Available from: http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-12-S2-S4 66. Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods [Internet]. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2012 [cited 2015 Mar 6];9:811?4. Available from: http://dx.doi.org/10.1038/nmeth.2066 67. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND [Internet]. Nat. Methods. Nature Publishing Group; 2014 [cited 2020 Jun 30]. p. 59?60. Available from: https://www.nature.com/articles/nmeth.3176 68. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat. Methods [Internet]. Nature Publishing Group, a division of Macmillan Publishers Limited. All Rights Reserved.; 2012 [cited 2014 Jul 10];9:357?9. Available from: http://dx.doi.org/10.1038/nmeth.1923 69. Ames SK, Hysom DA, Gardner SN, Lloyd GS, Gokhale MB, Allen JE. Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics [Internet]. Ottawa, Canada; 2013 [cited 2017 Jul 5];29:2253?60. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/btt389 70. Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics [Internet]. 2015 [cited 2017 Jul 5];16:236. Available from: http://www.biomedcentral.com/1471-2164/16/236 71. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. [Internet]. 2014 [cited 2017 Mar 23];15:R46. Available from: http://genomebiology.biomedcentral.com/articles/10.1186/gb-2014- 15-3-r46 72. Ulyantsev VI, Kazakov S V., Dubinkina VB, Tyakht A V., Alexeev DG. MetaFast: fast reference-free graph-based comparison of shotgun metagenomic data. Bioinformatics [Internet]. Springer, Berlin/Heidelberg; 2016 [cited 2017 Jul 5];32:2760?7. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/btw312 73. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. [Internet]. 2016 [cited 2017 Jun 30];17:132. Available from: http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x 74. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2016 [cited 2017 Jul 9];26:1721?9. Available from: 94 http://www.ncbi.nlm.nih.gov/pubmed/27852649 75. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol. [Internet]. BioMed Central Ltd.; 2019 [cited 2020 Jul 18];20:257. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059- 019-1891-0 76. Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. [Internet]. PeerJ Inc.; 2017 [cited 2017 Jul 9];3:e104. Available from: https://peerj.com/articles/cs-104 77. Ounit R, Lonardi S. Higher classification sensitivity of short metagenomic reads with CLARK-S. [cited 2020 Jul 2]; Available from: http://clark.cs.ucr.edu/ 78. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, et al. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol. [Internet]. BioMed Central Ltd.; 2019 [cited 2020 Jul 2];20:232. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059- 019-1841-x 79. McIntyre ABR, Ounit R, Afshinnekoo E, Prill RJ, H?naff E, Alexander N, et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. [Internet]. BioMed Central Ltd.; 2017 [cited 2020 Jul 3];18:182. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059- 017-1299-7 80. Meyer F, Bremges A, Belmann P, Janssen S, McHardy AC, Koslicki D. Assessing taxonomic metagenome profilers with OPAL. Genome Biol. [Internet]. BioMed Central Ltd.; 2019 [cited 2020 Jul 3];20:51. Available from: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1646-y 81. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S, Dr?ge J, et al. Critical Assessment of Metagenome Interpretation - A benchmark of metagenomics software. Nat. Methods [Internet]. Nature Publishing Group; 2017 [cited 2020 Jul 3];14:1063? 71. Available from: https://www.nature.com/articles/nmeth.4458 82. Morowitz MJ, Denef VJ, Costello EK, Thomas BC, Poroyko V, Relman DA, et al. Strain-resolved community genomic analysis of gut microbial colonization in a premature infant. Proc. Natl. Acad. Sci. [Internet]. 2011 [cited 2017 May 1];108:1128?33. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21191099 83. Treangen TJ, Abraham A-L, Touchon M, Rocha EPC. Genesis, effects and fates of repeats in prokaryotic genomes. FEMS Microbiol. Rev. [Internet]. 2009 [cited 2017 May 1];33:539?71. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19396957 84. Koren S, Harhay GP, Smith TP, Bono JL, Harhay DM, Mcvey SD, et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. [Internet]. 2013 [cited 2017 Apr 26];14:R101. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24034426 85. Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. [Internet]. 2012 [cited 2014 Jul 10];40:e155. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3488206&tool=pmcentre z&rendertype=abstract 86. Greenwald WW, Klitgord N, Seguritan V, Yooseph S, Venter JC, Garner C, et al. 95 Utilization of defined microbial communities enables effective evaluation of meta- genomic assemblies. BMC Genomics [Internet]. 2017 [cited 2017 May 1];18:296. Available from: http://bmcgenomics.biomedcentral.com/articles/10.1186/s12864- 017-3679-5 87. Awad S, Irber L, Brown CT. Evaluating Metagenome Assembly on a Simple Defined Community with Many Strain Variants. bioRxiv [Internet]. 2017 [cited 2017 Jun 29]; Available from: http://www.biorxiv.org/content/early/2017/06/25/155358 88. Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics [Internet]. Oxford University Press; 2012 [cited 2017 Mar 23];28:1420?8. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/bts174 89. Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA ? A Practical Iterative de Bruijn Graph De Novo Assembler. Springer, Berlin, Heidelberg; 2010 [cited 2017 Apr 28]. p. 426?40. Available from: http://link.springer.com/10.1007/978-3-642-12683-3_28 90. Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: an ultra-fast single- node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics [Internet]. Oxford University Press; 2015 [cited 2017 Mar 23];31:1674?6. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/btv033 91. Bowe A, Onodera T, Sadakane K, Shibuya T. Succinct de Bruijn Graphs. Springer, Berlin, Heidelberg; 2012 [cited 2017 Apr 27]. p. 225?35. Available from: http://link.springer.com/10.1007/978-3-642-33122-0_18 92. Nurk S, Meleshko D, Korobeynikov A, Pevzner PA. metaSPAdes: a new versatile metagenomic assembler. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2017 [cited 2017 Mar 29];gr.213959.116. Available from: http://www.ncbi.nlm.nih.gov/pubmed/28298430 93. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. [Internet]. Mary Ann Liebert, Inc.; 2012 [cited 2017 Apr 27];19:455?77. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22506599 94. Prjibelski AD, Vasilinetc I, Bankevich A, Gurevich A, Krivosheeva T, Nurk S, et al. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics [Internet]. 2014 [cited 2017 May 1];30:i293?301. Available from: http://www.ncbi.nlm.nih.gov/pubmed/24931996 95. Laserson J, Jojic V, Koller D. Genovo: De Novo Assembly for Metagenomes. J. Comput. Biol. [Internet]. Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA ; 2011 [cited 2017 Mar 23];18:429?43. Available from: http://www.liebertonline.com/doi/abs/10.1089/cmb.2010.0244 96. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2008 [cited 2017 Mar 23];18:821?9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/18349386 97. Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. [Internet]. 2009 [cited 2014 Jul 11];19:1117?23. Available from: 96 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2694472&tool=pmcentre z&rendertype=abstract 98. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. [Internet]. 2010 [cited 2014 Jul 22];20:265?72. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2813482&tool=pmcentre z&rendertype=abstract 99. Staden R. A strategy of DNA sequencing employing computer programs. Nucleic Acids Res. [Internet]. Oxford University Press; 1979 [cited 2017 May 1];6:2601?10. Available from: http://www.ncbi.nlm.nih.gov/pubmed/461197 100. diCenzo GC, Finan TM. The Divided Bacterial Genome: Structure, Function, and Evolution. Microbiol. Mol. Biol. Rev. [Internet]. American Society for Microbiology; 2017 [cited 2020 Jul 16];81. Available from: https://doi.org/10.1128/ 101. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics [Internet]. 2016 [cited 2017 Apr 26];32:1088?90. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26614127 102. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. [Internet]. 2015 [cited 2015 May 20];gr.186072.114-. Available from: http://genome.cshlp.org/content/early/2015/05/14/gr.186072.114.abstract 103. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One [Internet]. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America. Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America; VIB Department of Plant Systems Biology, Ghent University, Ghent, Belgium. Broa; 2014;9:e112963. Available from: http://dx.doi.org/10.1371/journal.pone.0112963 104. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2015 [cited 2017 Apr 26];25:1043?55. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25977477 105. Sim? FA, Waterhouse RM, Ioannidis P, Kriventseva E V, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. [cited 2020 Jul 17]; Available from: https://academic.oup.com/bioinformatics/article-abstract/31/19/3210/211866 106. Mikheenko A, Saveliev V, Gurevich A. MetaQUAST: evaluation of metagenome assemblies. Bioinformatics [Internet]. 2016 [cited 2017 Mar 27];32:1088?90. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26614127 107. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics [Internet]. 2013 [cited 2017 May 1];29:1072? 5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23422339 108. kmer-mask [Internet]. [cited 2020 Jul 2]. Available from: http://kmer.sourceforge.net/wiki/index.php?Main_Page) 109. Kultima JR, Sunagawa S, Li J, Chen W, Chen H, Mende DR, et al. MOCAT: a 97 metagenomics assembly and gene prediction toolkit. Gilbert JA, editor. PLoS One [Internet]. 2012 [cited 2017 Mar 28];7:e47656. Available from: http://dx.plos.org/10.1371/journal.pone.0047656 110. Li ? W, Godzik A. BIOINFORMATICS APPLICATIONS NOTE Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. 2006 [cited 2020 Jul 3];22:1658?9. Available from: http://cd-hit.org 111. Getting Started with Meryl - kmer [Internet]. [cited 2020 Jul 2]. Available from: http://kmer.sourceforge.net/wiki/index.php/Getting_Started_with_Meryl 112. McClelland M, Sanderson KE, Spieth J, Clifton SW, Latreille P, Courtney L, et al. Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature [Internet]. ASM Press; 2001 [cited 2020 Jul 3];413:852?6. Available from: http://www.sanger.ac.uk/Software/ 113. Shakya M, Quince C, Campbell JH, Yang ZK, Schadt CW, Podar M. Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. Environ. Microbiol. [Internet]. 2013 [cited 2017 Mar 23];15:1882?99. Available from: http://doi.wiley.com/10.1111/1462- 2920.12086 114. Silva GGZ, Cuevas DA, Dutilh BE, Edwards RA. FOCUS: An alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ [Internet]. PeerJ Inc.; 2014 [cited 2020 Aug 15];2014. Available from: /pmc/articles/PMC4060023/?report=abstract 115. Nguyen NP, Mirarab S, Liu B, Pop M, Warnow T. TIPP: Taxonomic identification and phylogenetic profiling. Bioinformatics [Internet]. Oxford University Press; 2014 [cited 2020 Aug 14];30:3548?55. Available from: https://pubmed.ncbi.nlm.nih.gov/25359891/ 116. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M. Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics [Internet]. 2011 [cited 2015 Feb 25];12 Suppl 2:S4. Available from: http://www.biomedcentral.com/1471-2164/12/S2/S4 117. Koslicki D, Foucart S, Rosen G. Quikr: A method for rapid reconstruction of bacterial communities via compressive sensing. Bioinformatics [Internet]. Bioinformatics; 2013 [cited 2020 Aug 14];29:2096?102. Available from: https://pubmed.ncbi.nlm.nih.gov/23786768/ 118. Klingenberg H, A?hauer KP, Lingner T, Meinicke P. Protein signature-based estimation of metagenomic abundances including all domains of life and viruses. Bioinformatics [Internet]. Bioinformatics; 2013 [cited 2020 Aug 15];29:973?80. Available from: https://pubmed.ncbi.nlm.nih.gov/23418187/ 119. Koslicki D, Falush D. MetaPalette: a k-mer Painting Approach for Metagenomic Taxonomic Profiling and Quantification of Novel Strain Variation. mSystems [Internet]. American Society for Microbiology; 2016 [cited 2020 Aug 15];1. Available from: /pmc/articles/PMC5069763/?report=abstract 120. Podell S, Ugalde JA, Narasingarao P, Banfield JF, Heidelberg KB, Allen EE. Assembly-driven community genomics of a hypersaline microbial ecosystem. Mormile MR, editor. PLoS One [Internet]. 2013 [cited 2017 Apr 28];8:e61692. Available from: http://dx.plos.org/10.1371/journal.pone.0061692 121. Narasingarao P, Podell S, Ugalde JA, Brochier-Armanet C, Emerson JB, Brocks 98 JJ, et al. De novo metagenomic assembly reveals abundant novel major lineage of Archaea in hypersaline microbial communities. ISME J. [Internet]. 2012 [cited 2017 Apr 28];6:81?93. Available from: http://www.nature.com/doifinder/10.1038/ismej.2011.78 122. Ji P, Zhang Y, Wang J, Zhao F. MetaSort untangles metagenome assembly by reducing microbial community complexity. Nat. Commun. [Internet]. 2017 [cited 2017 Apr 28];8:14306. Available from: http://www.nature.com/doifinder/10.1038/ncomms14306 123. Sangwan N, Xia F, Gilbert JA. Recovering complete and draft population genomes from metagenome datasets. Microbiome [Internet]. 2016 [cited 2017 Apr 28];4:8. Available from: http://www.microbiomejournal.com/content/4/1/8 124. Mukherjee S, Seshadri R, Varghese NJ, Eloe-Fadrosh EA, Meier-Kolthoff JP, G?ker M, et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat. Biotechnol. [Internet]. Nature Research; 2017 [cited 2017 Jul 20];35:676?83. Available from: http://www.nature.com/doifinder/10.1038/nbt.3886 125. Li H, Homer N. A survey of sequence alignment algorithms for next-generation sequencing. [cited 2020 Jul 1]; Available from: http://novocraft.com 126. Hatem A, Bozda? D, Toland AE, ?ataly?rek ? V. Benchmarking short sequence mapping tools. BMC Bioinformatics [Internet]. 2013 [cited 2017 Jul 5];14:184. Available from: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471- 2105-14-184 127. Wu Thomas D, Nacu Serban. Fast and SNP-tolerant Detection of Complex Variants and Splicing in Short Reads - PubMed. Bioinformatics [Internet]. 2010 [cited 2020 Jul 1];26:873?81. Available from: https://pubmed.ncbi.nlm.nih.gov/20147302/ 128. Novocraft [Internet]. [cited 2020 Jul 1]. Available from: http://www.novocraft.com/ 129. Alkan C, Kidd JM, Marques-Bonet T, Aksay G, Antonacci F, Hormozdiari F, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. [Internet]. Nat Genet; 2009 [cited 2020 Jul 1];41:1061?7. Available from: https://pubmed.ncbi.nlm.nih.gov/19718026/ 130. Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods [Internet]. Nature Research; 2010 [cited 2017 Jun 30];7:576?7. Available from: http://www.nature.com/doifinder/10.1038/nmeth0810-576 131. Misra S, Narayanan R, Lin S, Choudhary A. FANGS: High speed sequence mapping for next generation sequencers. Proc. ACM Symp. Appl. Comput. [Internet]. 2010 [cited 2020 Jul 1]. p. 1539?46. Available from: https://www.scholars.northwestern.edu/en/publications/fangs-high-speed-sequence- mapping-for-next-generation-sequencers 132. Li H, Ruan J, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2008 [cited 2017 Jul 6];18:1851?8. Available from: http://www.ncbi.nlm.nih.gov/pubmed/18714091 133. Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves 99 accuracy of Solexa read mapping. BMC Bioinformatics [Internet]. BMC Bioinformatics; 2008 [cited 2020 Jul 1];9. Available from: https://pubmed.ncbi.nlm.nih.gov/18307793/ 134. Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. [Internet]. 2009 [cited 2014 Jul 9];10:R25. Available from: http://genomebiology.com/2009/10/3/R25 135. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics [Internet]. 2009 [cited 2014 Jul 9];25:1754?60. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2705234&tool=pmcentre z&rendertype=abstract 136. Li R, Yu C, Li Y, Lam T-W, Yiu S-M, Kristiansen K, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics [Internet]. Digital Equipment Corporation, CA; 2009 [cited 2017 Jul 6];25:1966?7. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/btp336 137. Ruffalo M, LaFramboise T, Koyuturk M. Comparative analysis of algorithms for next-generation sequencing read alignment. Bioinformatics [Internet]. Springer, Berlin/Heidelberg; 2011 [cited 2017 Jul 6];27:2790?6. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/btr477 138. Chouvarine P, Wiehlmann L, Moran Losada P, DeLuca DS, T?mmler B, Speed T. Filtration and Normalization of Sequencing Read Data in Whole-Metagenome Shotgun Samples. Dalby AR, editor. PLoS One [Internet]. Public Library of Science; 2016 [cited 2017 Jul 6];11:e0165015. Available from: http://dx.plos.org/10.1371/journal.pone.0165015 139. Petersen TN, Lukjancenko O, Thomsen MCF, Maddalena Sperotto M, Lund O, M?ller Aarestrup F, et al. MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads. An L, editor. PLoS One [Internet]. Public Library of Science; 2017 [cited 2017 Jul 6];12:e0176469. Available from: http://dx.plos.org/10.1371/journal.pone.0176469 140. Sequence Alignment/Map Format Specification [Internet]. 2020. Available from: https://github.com/samtools/hts-specs. 141. Li H, Barrett J. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. 2011 [cited 2020 Jul 3];27:2987?93. Available from: http://samtools.sourceforge.net 142. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2010 [cited 2020 Jul 3];20:1297?303. Available from: http://www.genome.org/cgi/doi/10.1101/gr.107524.110. 143. Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, et al. Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement. Wang J, editor. PLoS One [Internet]. Public Library of 100 Science; 2014 [cited 2017 Mar 23];9:e112963. Available from: http://dx.plos.org/10.1371/journal.pone.0112963 144. Vaser R, Sovi? I, Nagarajan N, ?iki? M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. [Internet]. Cold Spring Harbor Laboratory Press; 2017 [cited 2020 Jul 3];27:737?46. Available from: /pmc/articles/PMC5411768/?report=abstract 145. Zimin A V, Salzberg SL. The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. [cited 2020 Jul 3]; Available from: https://doi.org/10.1101/2019.12.17.864991 146. Warren RL, Coombe L, Mohamadi H, Zhang J, Jaquish B, Isabel N, et al. ntEdit: scalable genome sequence polishing. [cited 2020 Jul 3]; Available from: https://academic.oup.com/bioinformatics/article-abstract/35/21/4430/5490204 147. Zimin A V, Mar? Ais G, Puiu D, Roberts M, Salzberg SL, Yorke JA. Genome analysis The MaSuRCA genome assembler. 2013 [cited 2020 Jul 3];29:2669?77. Available from: https://academic.oup.com/bioinformatics/article- abstract/29/21/2669/195975 148. Solomon B, Kingsford C. Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees. doi.org [Internet]. Cold Spring Harbor Laboratory; 2015 [cited 2017 Sep 25];017087. Available from: https://www.biorxiv.org/content/early/2015/03/26/017087 149. bcgsc/ntHits: Identifying repeats in high-throughput sequencing data [Internet]. [cited 2020 Jul 4]. Available from: https://github.com/bcgsc/nthits 150. Feige U. A threshold of ln n for approximating set cover. J. ACM [Internet]. ACM; 1998 [cited 2015 May 7];45:634?52. Available from: http://dl.acm.org/citation.cfm?id=285055.285059 151. Liu B. Title of dissertation: COMPUTATIONAL METAGENOMICS: NETWORK, CLASSIFICATION AND ASSEMBLY [Internet]. 2012. Available from: http://drum.lib.umd.edu/handle/1903/13278 152. Beller T, Ohlebusch E. A representation of a compressed de Bruijn graph for pan-genome analysis that enables search. Algorithms Mol. Biol. [Internet]. 2016 [cited 2017 Jul 9];11:20. Available from: http://almob.biomedcentral.com/articles/10.1186/s13015-016-0083-7 153. Baier U, Beller T, Ohlebusch E. Graphical pan-genome analysis with compressed suffix trees and the Burrows?Wheeler transform. Bioinformatics [Internet]. Oldenbusch Verlag, Bremen, Germany,; 2016 [cited 2017 Jul 9];32:497? 504. Available from: https://academic.oup.com/bioinformatics/article- lookup/doi/10.1093/bioinformatics/btv603 154. Marcus S, Lee H, Schatz MC. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics [Internet]. 2014 [cited 2017 Jul 9];30:3476? 83. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25398610 155. Holley G, Wittler R, Stoye J. Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage. Algorithms Mol. Biol. [Internet]. 2016 [cited 2017 Jul 9];11:3. Available from: http://almob.biomedcentral.com/articles/10.1186/s13015-016-0066-8 156. Almodaresi F, Sarkar H, Srivastava A, Patro R. A space and time-efficient index for the compacted colored de Bruijn graph. [cited 2020 Jul 5]; Available from: 101 https://academic.oup.com/bioinformatics/article-abstract/34/13/i169/5045749 157. Anari SS, Ridder D de, Schranz ME, Smit S. Pangenomic read mapping. bioRxiv [Internet]. Cold Spring Harbor Laboratory; 2019 [cited 2020 Jul 5];813634. Available from: https://doi.org/10.1101/813634 158. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics [Internet]. 2013 [cited 2017 Mar 28];29:1072?5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23422339 159. HMP2 assembly details [Internet]. Available from: http://gembox.cbcb.umd.edu/metacompass 160. Li K, Bihan M, Yooseph S, Meth? BA, Ludwig W. Analyses of the Microbial Diversity across the Human Microbiome. Xu P, editor. PLoS One [Internet]. Public Library of Science; 2012 [cited 2017 Mar 23];7:e32118. Available from: http://dx.plos.org/10.1371/journal.pone.0032118 161. Li H. Minimap2: pairwise alignment for nucleotide sequences. [cited 2020 Jul 5]; Available from: https://github.com/ruanjue/smartdenovo; 162. Puigb? P, Makarova KS, Kristensen DM, Wolf YI, Koonin E V. Reconstruction of the evolution of microbial defense systems. BMC Evol. Biol. [Internet]. 2017 [cited 2017 May 6];17:94. Available from: http://www.ncbi.nlm.nih.gov/pubmed/28376755 163. Belkaid Y, Segre JA. Dialogue between skin microbiota and immunity. Science (80-. ). [Internet]. 2014 [cited 2017 May 6];346:954?9. Available from: http://www.ncbi.nlm.nih.gov/pubmed/25414304 164. Ambur OH, Davidsen T, Frye SA, Balasingham S V, Lagesen K, Rognes T, et al. Genome dynamics in major bacterial pathogens. FEMS Microbiol. Rev. [Internet]. 2009 [cited 2017 May 6];33:453?70. Available from: http://www.ncbi.nlm.nih.gov/pubmed/19396949 102