ABSTRACT Title of Dissertation: STUDYING TUMOR EVOLUTION WITH EMERGING SEQUENCING TECHNOLOGIES Yuelin Liu Doctor of Philosophy, 2025 Dissertation Directed by: Professor Mihai Pop Department of Computer Science Doctor S. Cenk Sahinalp National Institutes of Health The path along which heterogeneous subpopulations within a tumor arise can be modeled as an evolutionary process, and a better understanding of such process may improve cancer diagnostics, prognostics, and treatment outcome. How to construct more accurate tumor phylogenies and how to leverage the inferred phylogenetic relations among tumor subclones in understanding the disease are both key in the study of tumor evolution. The advances in novel sequencing technologies present op- portunities to examine crucial aspects of the evolution of a tumor, beyond the types of information short-read genomic and transcriptomic sequencing technologies offer. As such, this work leverages three such technologies - single-cell bisulfite sequencing, long-read sequencing, and spatial transcrip- tomics - in answering key questions in tumor evolution. In the first part of the thesis we present the first distance-based method to jointly construct a tumor phylogeny and identify lineage-informative CpG sites from single-cell methylation sequencing data. The tool provides a fast and accurate way to examine tumor evolution through the methylome at a single-cell resolution and allows the incorporation of matched RNA sequencing information whenever applicable. Next, we present a novel framework to study both the genomic and epigenomic changes during tumor evolution using Nanopore long-read sequencing data of single-cell derived tumor sublines. Sin- gle nucleotide variants, structural variants, copy number alterations, as well as CpG methylation were profiled. The clonal resolution this data set uniquely affords allowed the inference of active mutational processes, confident identification of parallel CNAs, and detection of differentially methylated regions, all at a subclonal level. Finally, we present a framework to examine tumor subclones at spatial resolution. Melanoma tumors from mice in both control and anti-CTLA-4 treatment group were subjected to both Smart-seq2 single-cell full length transcriptomics sequencing and 10x Visium spatial transcriptomics sequencing. Six tumor subclones were identified using mutations profiled from the Smart-seq2 data, and their proportion were estimated from the spatial transcriptomics slides via cell type deconvolution. The associations between tumor subclones with various stromal and immune cell types were accessed with established spatial statistics. By contrasting subclone-immune activities between responding and non- responding tumors, we show that there exist subclonal differences in their response to the immune microenvironment. STUDYING TUMOR EVOLUTION WITH EMERGING SEQUENCING TECHNOLOGIES by Yuelin Liu Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Mihai Pop, Chair/Advisor Dr. S. Cenk Sahinalp, Co-Advisor Professor Rob Patro Professor David M. Mount Professor Stephen M. Mount © Copyright by Yuelin Liu 2025 Acknowledgments The PhD journey has been full of challenges in ways I could not even have previously under- stood, and there’s a long list of people I would like to acknowledge who have helped me get to where I am today. First, I want to acknowledge my collaborators and members of the Cancer Data Science Labo- ratory at the National Cancer Institute. In particular, I would like to thank members of the Sahinalp lab, with whom I have had countless productive discussions even when we are not working on the same project. Especially Chih Hao, John, and Jake, who have consistently driven me to and from the Bethesda Campus of the National Institutes of Health. I for sure owe them dinner, and probably a lot of gas money. I would like to thank past and present students and faculty of the Center for Bioinformatics and Computational Biology at the University of Maryland, who made me immediately at home when I first set foot in IRB-3112. Specifically, I would like to thank Zinat and Noor, who were in the same year as me and have supported each other through losing our first advisors, and Hirak, who imparted with me so much wisdom in my first year when he left for a postdoc, and is doing that once again now as he settles in a faculty position at Vanderbilt. I would like to thank the past and present members of CS GradCo at the University of Maryland, and all the friends I have met while organizing for it. It is amazing to be among such community- minded people. From GradCo Day, to fall picnic, and to the very brief time we tried to start a jazz band, I cannot believe what we were able to pull off. I would like to thank my past and present housemates at 48th Avenue who have witnessed ourselves at our best as much as at our worst, all in between Costco runs and Biryani City orders. ii I would like to thank the DC swing dance community for bringing me so much joy through our collective love for the art form. I would like to thank the very dear friends in my life for inspiring me to be the best version of myself, Yang, Tori, Yujia, Daniel, Ann, and especially Jwi, Shramay, and Gibson for making sure I did not forget to eat in the weeks leading up to my defense. There are three mentors who have played instrumental roles in this process. I would like to thank Dr. Mihai Pop, who has been tremendously supportive during Max’s departure from UMD and over the years. He has been a kind mentor and have set an example for all of us students in CBCB. I would like to thank Dr. Lenore Cowen, my undergrad advisor, who changed my life in the literal sense by introducing me to computational biology and being the first to say “I think you can be good at this”. And, of course, I would like to thank Dr. S. Cenk Sahinalp, who supervised all the work in this thesis, who has not only taught me a lot about science, but also a whole lot about being a scientist. His guidance, both spoken and unspoken, has been truly invaluable in this journey. Last but not least, I would like to thank my parents, Fei Dai and Zhuangcheng Liu, who have sacrificed the unimaginable in making me who I am today. Without their unconditional love and support, none of this would ever have been possible. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables viii List of Figures ix List of Abbreviations xii Chapter 1: Introduction 1 1.1 Cancer as a heterogeneous disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Modeling how tumor heterogeneity arises as an evolutionary process . . . . . . . . . . 1 1.3 Advancements in sequencing technologies further understanding of tumor evolution . . 3 1.4 Phylogeny inference using multi-sampled bulk data . . . . . . . . . . . . . . . . . . . 5 1.5 Phylogeny inference using single-cell data . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 DNA methylation as phylogenetic marker . . . . . . . . . . . . . . . . . . . . . . . . 7 1.7 Long-read sequencing technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.8 Sequencing-based spatial transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . 10 1.9 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 2: Reconstructing tumor phylogenies with single-cell CpG methylation data using Sgootr 14 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Biclustering of cells and sites for reducing sparsity-induced noise . . . . . . . 17 2.2.2 Likelihood-based sequencing error correction accounting for copy number es- timates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Expected distance calculation between cell pairs for tree construction . . . . . 21 2.2.4 Pruning of CpG sites according to a tree-based methylation status persistence measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.5 Inference of migration history from lesion-labeled tumor lineage tree leverag- ing prior knowledge on migration patterns . . . . . . . . . . . . . . . . . . . . 28 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.1 Sgootr obtains a simple tumor migration history with scBS-seq and scRNA- seq data from patient CRC01. . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.3.2 Sgootr infers migration histories simpler than alternative methods for the Bian CRC cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.3.3 Sgootr-identified lineage-informative CpG sites construct lineages suggest- ing simpler migration histories than those identified via FG. . . . . . . . . . . 38 2.3.4 Sgootr is orders-of-magnitude faster than IQ-TREE. . . . . . . . . . . . . . 38 iv 2.3.5 Sgootr-identified lineage-informative CpG sites are enriched in inter-CGI regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3.6 Genes associated with Sgootr-identified lineage-informative CpG sites are enriched for pan-cancer and CRC-related terms. . . . . . . . . . . . . . . . . . 39 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Chapter 3: Studying melanoma evolution with Nanopore long-read sequencing data 42 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.1 Long-read sequencing of single-cell derived subclones of B2905 mouse melanoma model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2 Constructed phylogeny reveals phenotypically-distinct major clades . . . . . . 46 3.2.3 Phylogeny-guided harmonization with TreeHarmonizer provides evolu- tionary timing of genomic variants . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.4 Evolutionary dynamics of somatic SNVs and mutational processes . . . . . . . 49 3.2.5 Somatic SVs dynamics suggest early drivers and late genomic instability . . . . 51 3.2.6 Recurrent CNAs in independent lineages suggest parallel adaptation process . . 53 3.2.7 Methylation changes contribute to differentiation of an aggressive clonal lineage 56 3.2.8 Systematic analysis of multiple variation types provides insights into B2905 lineages differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.2.9 Curated set of subclonal variant calls enables benchmarking of long-read tools . 62 3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: Studying melanoma subclones at spatial resolution 66 4.1 Spatial transcriptomics profiling of the control and anti-CTLA-4 treated tumors. . . . . 66 4.2 Identifying tumor subclones from Smart-seq2 full-length scRNA-seq data of the con- trol and anti-CTLA-4 treated tumors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Estimating tumor subclone proportions on the ST slides via deconvolution using SpaCET 68 4.4 Validation of SpaCET-deconvolved tumor subclone proportions . . . . . . . . . . . . . 69 4.5 Analysis of SpaCET-deconvolved tumor subclone proportions . . . . . . . . . . . . . 73 4.5.1 SpaCET-deconvolved proportions reveal differences in subclone growth dy- namics between treatment responders and non-responders . . . . . . . . . . . . 73 Chapter 5: Concluding remarks 76 Appendix A: Supplemental Material for Chapter 2 80 A.1 Biclustering: benchmarking on simulated data and a case study on CRC01 . . . . . . . 80 A.1.1 Running time and memory benchmark on simulated data . . . . . . . . . . . . 81 A.1.2 Biclustering case study on CRC01 from the Bian cohort . . . . . . . . . . . . . 85 A.2 Benchmarking Sgootr on simulated data . . . . . . . . . . . . . . . . . . . . . . . . 89 A.2.1 Expected distance calculation on simulated data . . . . . . . . . . . . . . . . . 89 A.2.2 Iterative pruning on simulated data . . . . . . . . . . . . . . . . . . . . . . . . 94 A.3 Application of Sgootr on single cells from U87MG ex vivo tree from the RETrace study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 v A.4 Application of Sgootr on the complete set of 9 metastatic CRC patients made avail- able by Bian and colleagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 A.5 Application of Sgootr on the cohort, replacing NJ with FastME 2.0 . . . . . . . . . . 111 A.6 Patient CRC01 tumor lineage tree constructed using mutation . . . . . . . . . . . . . . 115 A.7 Application of Sgootr to multi-regional GBM patient MGH105 MscRRBS data by Chaligne and collegues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.8 How to choose p, P(1c), P(0c), and P(mixed) for Sgootr . . . . . . . . . . . . . . . . 118 A.9 How to choose κ for Sgootr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 A.10 Experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.10.1 Sgootr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.10.2 FastME 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.10.3 IQ-TREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 A.10.4 FG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Appendix B: Supplemental Material for Chapter 3 124 B.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 B.1.1 Constructing tumor phylogeny using long read sequencing data from 23 sublines124 B.1.2 Determining the phenotypes of the 4 major clades . . . . . . . . . . . . . . . . 125 B.1.3 Assessing the effectiveness of TreeHarmonizer in determining the place- ment of variants impacted by filtering during tree construction . . . . . . . . . 128 B.1.4 Mutational signature analysis along the tumor phylogeny . . . . . . . . . . . . 129 B.1.5 Computing dN/dS values along the tumor phylogeny . . . . . . . . . . . . . . 130 B.1.6 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.1.7 Determining genes impacted by SVs . . . . . . . . . . . . . . . . . . . . . . . 131 B.1.8 Determining the set of cancer-relevant mouse genes-of-interest according to the COSMIC Gene Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 B.1.9 Timing analysis of chr4 copy-neutral LOH . . . . . . . . . . . . . . . . . . . . 134 B.1.10 Subclone-specific differential methylation analysis . . . . . . . . . . . . . . . 137 B.1.11 Permutation test on the number of DMRs obtained . . . . . . . . . . . . . . . 137 B.1.12 Correlation between the number of DMRs and cumulative SNVs at a branch in the phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B.1.13 Differential gene expression analysis of subline scRNA-seq data . . . . . . . . 138 B.1.14 Discovering key regions where changes in methylation is monotonic along the evolutionary trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 B.1.15 Permutation test on the number of genes associated with monotonic methyla- tion changes along the evolutionary trajectory . . . . . . . . . . . . . . . . . . 140 B.1.16 Calling copy number from bWES and Smart-Seq2 full-length scRNA-seq data on the 23 sublines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.2 Technology comparison: long-read, bWES, and scRNA-seq . . . . . . . . . . . . . . . 141 B.3 Additional Supplementary Figures and Tables . . . . . . . . . . . . . . . . . . . . . . 145 Appendix C: Supplemental Material for Chapter 4 161 C.1 SpaCET deconvolution results on 17 ST slides . . . . . . . . . . . . . . . . . . . . . . 161 C.1.1 Control STs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 C.1.2 Treated STs (Non-responders) . . . . . . . . . . . . . . . . . . . . . . . . . . 165 vi C.1.3 Treated STs (Responders) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 C.2 Validation of SpaCET-deconvolved tumor subclone proportions on 17 ST slides . . . . 171 C.2.1 Control STs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 C.2.2 Treated STs (Non-responders) . . . . . . . . . . . . . . . . . . . . . . . . . . 174 C.2.3 Treated STs (Responders) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 Bibliography 180 vii List of Tables A.1 Feature sizes of the biclustering formulation model as a function of the number of cells, v, and the number of CpG sites, u in the input. . . . . . . . . . . . . . . . . . . . . . . 81 A.2 Densities of the original binary matrix and the subsampled matrices. . . . . . . . . . . 87 A.3 Data summary of the Bian metastatic CRC cohort. . . . . . . . . . . . . . . . . . . . . 102 A.4 Overlap coefficients of GO terms associated with different genomic locations (CpG island and inter-CGI) discovered by GOrilla and DAVID. . . . . . . . . . . . . . . . . 110 A.5 Spearman correlation of CpG site distance from closest CpG island and homogeneity. . 111 B.1 Phenotypes of the 4 major clades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 B.2 DMRs associated with the red clade that overlap putative promoter or putative proxi- mal enhancers of known genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 viii List of Figures 1.1 Illustration of the Nowell’s theory of tumor evolution. . . . . . . . . . . . . . . . . . . 3 1.2 Illustration of tumor phylogeny reconstruction from multi-sampled bulk data. . . . . . 5 1.3 Illustration of tumor phylogeny reconstruction from single-cell data. . . . . . . . . . . 7 2.1 Overview of Sgootr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Application of Sgootr to patient CRC01 scTrio-seq data by Bian and colleagues. . . 33 2.3 Overview of results from the Bian metastatic CRC cohort. . . . . . . . . . . . . . . . . 35 3.1 An overview of the long-read approach and analysis of 23 subclones of a mouse melanoma tumor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Phylogenetic tree and mutational landscape of 23 single-cell derived sublines. . . . . . 47 3.3 Patterns and dynamics of somatic SVs. . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Genomic regions recurrently affected by CNAs suggest parallel evolution. . . . . . . . 54 3.5 DMRs detected at subclonal resolution. . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 A model of B2905 melanoma evolution informed by analyses of long-read sequencing data on the 23 sublines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.7 Subline-derived pseudobulk samples enables benchmarking the performances of long- read variant callers at various clonality. . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1 Growth dynamics of the 17 tumors from which the ST slides were derived. . . . . . . . 67 4.2 Mutations profiled from Smart-seq2 full-length scRNA-seq data of control and treated tumors reveal 6 major tumor subclones. . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Validation of tumor subclone proportions estimated by SpaCET using subclone-specific mutations detected on ST slides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Deconvolved subclone proportions inform differences in tumor growth dynamics. . . . 74 A.1 Patient CRC01 cell and site coverate level distribution . . . . . . . . . . . . . . . . . . 80 A.2 Running time and memory behavior on 200 simulated square binary matrices. . . . . . 84 A.3 Mean and standard deviation of wall clock time-to-convergence in hours, optimality gap at 20 hours, and resulting submatrix density across the 10 subsamples, over a range of (α, β) value pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.4 Probability mass functions of A. ZTPois(λ), where λ = {1, 2, 3, 4, 5, 6} and B. the empirical per-site read coverage distribution from patient CRC01. . . . . . . . . . . . 91 A.5 Baseline distance measure and expected distance measure on 13, 500 sets of read count vectors, simulated from cell pairs with known normalized ground truth distance π = {.1, .2, .3, .4, .5, .6, .7, .8, .9}, with level of read coverage depth modeled by ZTPois(λ), where λ = {1, 2, 3, 4, 5, 6}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.6 Baseline distance measure and expected distance measure on 2, 250 sets of read count vectors, simulated from cell pairs with known normalized ground truth distance π = {.1, .2, .3, .4, .5, .6, .7, .8, .9}, with level of read coverage depth modeled by the empir- ical read coverage distribution gathered from the CRC01 sample. . . . . . . . . . . . . 94 A.7 Topologies of the simulated clone trees. . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.8 Results on 10 simulated clone trees from Sgootr and baseline method. . . . . . . . . 98 ix A.9 Application of Sgootr to scRRBS data from U87MG ex vivo tree by the RETrace study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 A.10 Sgootr-constructed (κ = .1) tumor lineage trees (t∗s) and Sgootr-inferred tumor migration histories for the 9 patients in the Bian cohort . . . . . . . . . . . . . . . . . 104 A.11 RF distances across iterations for each patient in the metastatic CRC cohort applying Sgootr with different κ values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 A.12 GOrilla -identified GO terms with significant (<.05) q-values in enrichment analysis of non-pseudo, protein coding genes whose gene bodies and 1000bp upstream pro- moter regions contain the Sgootr-identified lineage-informative CpG sites, stratified by whether they locate in CpG island or inter-CGI regions. . . . . . . . . . . . . . . . 107 A.13 DAVID-identified GO terms with significant (<.05) q-values in enrichment analysis of non-pseudo, protein coding genes whose gene bodies and 1000bp upstream pro- moter regions contain the Sgootr-identified lineage-informative CpG sites, stratified by whether they locate in CpG island or inter-CGI regions. . . . . . . . . . . . . . . . 109 A.14 Median number of bp to the closest CGI from CpG sites used to construct the methy- lation lineage tree at each iteration for all CRC patients and GBM patient MGH105. . . 110 A.15 Sgootr-constructed (κ = .1) tumor lineage trees (t∗s) and Sgootr-inferred tumor migration histories for the 9 patients in the Bian cohort, replacing NJ with FastME 2.0. 113 A.16 Number of migration events and BAMDPO of migration histories obtained via apply- ing Sgootr with FastME 2.0 in comparison to NJ. . . . . . . . . . . . . . . . . . . . 113 A.17 RF distances across iterations for each patient in the Bian metastatic CRC cohort ap- plying Sgootr - substituting NJ with FastME 2.0 - with different κ values. . . . . . . 114 A.18 Tumor lineage tree based on SNVs/INDELs obtained from scRNA-seq data from pa- tient CRC01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 A.19 Application of Sgootr to patient MGH105 MscRRBS data by Chaligne and colleagues.117 A.20 FG frequencies (averaged across 5 runs) for the Bian cohort. . . . . . . . . . . . . . . 122 A.21 Pearson product-moment correlation coefficients between FG frequencies across sites between downsampled runs for the Bian cohort. . . . . . . . . . . . . . . . . . . . . . 123 B.1 DGI of sublines in C57BL/6 and nude mice. . . . . . . . . . . . . . . . . . . . . . . . 127 B.2 Mutational signature activities along the tumor phylogeny. . . . . . . . . . . . . . . . 130 B.3 SNVs along the tumor phylogeny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 B.4 SNV evidence for the timing clonal CNA events. . . . . . . . . . . . . . . . . . . . . 135 B.5 Comparison among long-read, bWES, and scRNA-seq data of the 23 sublines. . . . . . 142 B.6 SNVs and SVs placed by TreeHarmonizer. . . . . . . . . . . . . . . . . . . . . . 145 B.7 SNV density and SV events present in individual sublines in the green clade. . . . . . . 146 B.8 SNV density and SV events present in individual sublines in the blue clade. . . . . . . 147 B.9 SNV density and SV events present in individual sublines in the red clade. . . . . . . . 148 B.10 SV-disrupted genes and gene fusion annotated by Padfoot. . . . . . . . . . . . . . . 149 B.11 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro- moters in 23 sublines (1/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.12 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro- moters in 23 sublines (3/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 x B.13 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro- moters in 23 sublines (4/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 B.14 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro- moters in 23 sublines (5/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 B.15 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro- moters in 23 sublines (6/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.16 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro- moters in 23 sublines (7/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 B.17 Permutation tests on the number of DMRs obtained for all subtree sizes of the phy- logeny in Figure 3.2a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.18 Associating truncating Dnah6 mutation with (haplotype-specific) methylation of Dnah6 promoter in major clades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 B.19 CNA profiles of 23 sublines obtained using long-read, bWES, and scRNA-seq data. . . 160 xi xii List of Abbreviations AID Activation-induced cytidine deaminase ANOVA Analysis of variance ASM Allele-specific methylation BAMDPO Binary adjacency matrix distance from partial order bp Base pair(s) bWES Bulk whole-exome sequencing cDNA complementary DNA CGC Cancer Gene Census CGI CpG island chr Chromosome CLL Chronic lymphocitic leukemia COSMIC Catalogue Of Somatic Mutations In Cancer CpG Cytosine-guanine CNA Copy number aberration CRC Colorectal cancer dbSNP The Single Nucleotide Polymorphism Database DGI Delay of growth initiation DMR Differentially methylated region DNA Deoxyribonucleic acid FG Four gametes GATK Genomic Analysis Toolkit GBM Glioblastoma GENCODE Encyclopedia of genes and gene variants GO Gene Ontology HSD Honestly Significant Difference ICB Immune checkpoint blockade INDEL Insertion-deletion kb kilo-base pairs LN Lymph node metastasis LOH Loss of heterozygosity xiii ML Liver metastasis MP Post-treatment liver metastasis mRNA Messenger RNA MRCA Most-recent common ancestor MscRRBS Multiplexed single-cell reduced representation bisulfite sequencing NC Normal colon tissue ND Not designated NJ Neighbor joining ONC (Proto-)Oncogene ONT Oxford Nanopore Technologies PacBio Pacific Biosciences PT Primary tumor RF Robinson-Foulds RNA Ribonucleic acid ROS Reactive oxygen species SBS Single base substitution scBS-seq Single-cell bisulfite sequencing scRNA-seq Single-cell RNA sequencing scRRBS Single-cell reduced representation bisulfite sequencing scTrio-seq Single-cell triple omics sequencing SNP Single nucleotide polymorphism SNV Single nucleotide variant ST Spatial transcriptomics SV Structural variant TCGA The Cancer Genome Atlas Program TSG Tumor Suppressor Gene T2T Telomere-to-Telomere UV Ultraviolet (radiation) WGS Whole-genome sequencing 5mC 5-methylcytosine xiv Chapter 1: Introduction 1.1 Cancer as a heterogeneous disease Cancer is a group of diseases characterized by the abnormal growth of cells with the potential to spread to other parts of the body. It is highly heterogeneous. Different tumor types are characterized by the primary tissue or cell type of origin. Treatment decisions for different tumor types are frequently specific to the primary site, since tumors in different tissues likely are caused by different endogenous or exogenous processes. Within each tumor type, there could be further molecular subclassifications based on histological or genomic profiling. Distinguishing among different tumor subtypes is im- portant, as they may differ in invasiveness, metastatic potential, proneness to relapse, prognosis, and response to specific treatment. These types of differences we call inter-tumor heterogeneity. Looking within a single tumor, there is extensive heterogeneity as well. There may be cellular subpopulations that differ by genetic or non-genetic characteristics. We call each group of these cells subclones. From a more clinically relevant perspective, how well a cancer patient responds to a type of treatment may be explained by the tumor’s subclone composition. For example, treatment resistance and relapse may be caused by the presence of a stubborn subclone unaffected by treatment. If we can better characterize such intra-tumor heterogeneity, we may achieve more accurate cancer diagnostics, prognostics, and more effective and targeted treatment. The latter, intra-tumor heterogeneity, will be the focus of this thesis. 1.2 Modeling how tumor heterogeneity arises as an evolutionary process The model with which we study how such intra-tumor heterogeneity arises has its root in species evolution. In his revolutionary work in 1859, The Origin of Species, Charles Darwin provided a 1 framework which enabled the understanding of somatic selection, diversification and extinction [1]. His famous 1837 drawing of an evolutionary tree diagram was the first to describe the phylogenetic relations among multiple species. Incidentally, exactly a hundred years after Darwin’s publication of The Origin of Species, in 1959, a predoctoral fellow by the name of David Hungerford was working in a genetics lab in Philadel- phia, when he observed under the microscope that the white blood cells from patients with chronic myelogenous leukemia had an abnormally short chromosome 22. Together with a pathologist at Uni- veristy of Pennsylvania named Peter Nowell, they published this observation in 1960 [2], and named the abnormality the Philadelphia chromosome after where the discovery was made. It may be an indisputable fact at the time of the writing of this thesis, but back then, cancer was not believed to be a genetic disease, and no consistent theory of how cancer arises was recognized. The discovery of Philadelphia chromosome provided the first piece of evidence that, alterations in one’s DNA may lead to cancer. More than a decade later, using a novel staining technology, Dr. Janet Rowley at the University of Chicago observed that the part missing from the end of chromosome 22 seems to be relocated to the end of one copy of chormosome 9, causing a translocation event that resulted in a gene fusion [3]. This discovery eventually led to the development of the targeted therapy Imatinib in the early two thousands, which is a landmark drug shown to be very effective in maintaining remission in CML patients. With the increasingly well established idea that cancer is caused by aberrations in one’s DNA, Nowell, who co-discovered the Philadelphia chromosome, went on to formalize the foundation of the theory of tumor evolution in 1976 [4]. Much like species evolution where heritable traits were passed down from ancestor to offsprings, somatic alterations that are heritable from parent to daughter cells 2 Figure 1.1: Illustration of the Nowell’s theory of tumor evolution. Clonal alterations occur in the cancer founder cell, which initiates the abnormal cell growth. As the clonal population expands, additional subclonal alterations also occur and are inherited by daugh- ter cells during cell replication. The accumulation of such somatic alterations over time results in a heterogeneous tumor. during cell replication accumulate over the course of tumor development, resulting in a heterogeneous cell population (Figure 1.1). Under the “infinite sites assumption” [5], that states that, as the length of the genome is large, the probability of the same nucleotide change at a particular locus in distinct samples happening independently is negligibly small, the observation of shared somatic mutations in distinct samples at different frequencies serve as a piece of evidence for the relative timing of when the mutation occurred along the course of the development of the tumor. 1.3 Advancements in sequencing technologies further understanding of tumor evo- lution While traditionally, tumor evolution has been conceptualized as the gradual accumulation of point mutations under selective pressures [4], more recently, distinct mutational processes that lead to the rapid accumulation of structural variants during catastrophic events — such as chromothripsis and 3 chromoplexy — have been increasingly recognized as critical drivers of tumor initiation and adapta- tion [6, 7]. Furthermore, tumor evolution can also be shaped by non-genetic, or even non-heritable, determinants—especially epigenetic alterations such as DNA methylation or histone acetylation [8]. The advancement in our understanding of tumor evolution largely owes to the advances in sequencing technologies, which have allowed us to study somatic alterations at the base pair level. Novel sequenc- ing technologies afford us with more, and new types of information about tumor samples, all the while with decreasing sequencing cost. Specifically, the three main pieces of work discussed in this thesis will respectively leverage tumor data generated with three novel sequencing technologies introduced within the past decade. First, single-cell bisulfite sequencing [9], introduced in 2014, generates CpG methylation information within individual cells in a bisulfite-treated tumor sample. This technology is further augmented by the scTrio-seq protocol in 2016 [10], that additionally profiled RNA-seq data from the same single cells. Next, Oxford Nanopore Technology long read sequencing, which was made commercially available in 2015, profiles ultra long reads at the kilo basepair to mega basepair scale while allowing simultaneous detection of genetic and epigenetic base modifications. Lastly, spatial transcriptomics, first reported in 2016 [11], provides gene expression read outs from spot with known spatial coordinates on a tumor slide. The discovery of the Philadelphia chromosome, discussed in the previous section, not only was an important event in the development of the theory of tumor evolution, the story also highlights the importance of constantly adopting and leveraging novel technologies, such as the staining method used to discover the translocation event. This thesis will discuss how we leverage new opportunities and address the challenges brought by the three novel sequencing technologies in our study of tumor evolution. 4 Figure 1.2: Illustration of tumor phylogeny reconstruction from multi-sampled bulk data. 1.4 Phylogeny inference using multi-sampled bulk data One way to study the evolutionary process that is tumor evolution is through phylogeny recon- struction. The problem of algorithmically inferring a phylogenetic tree that may be interpreted as a tumor progression history from multiple bulk samples per patient has been widely studied in the past decade. In such experiments, the tumor samples obtained from the patient are subjected to bulk se- quencing, and in each sample, for each unique point mutation, the number of sequencing reads with the reference allele, and that with the variant allele are obtained. Under some reasonable assumptions that the mutations are heterozygous from a diploid genome, the observed frequency of each mutation per sample can be computed. Then, the subclone phylogeny could be obtained by either - first explic- itly clustering the mutations by frequencies then determining their order along the phylogeny - or - incorporating some error model to find the most likely tree that resulted in the observed variant allele frequencies (Figure 1.2). Methods developed to address this problem include: TuMult [12], PhyloSub [13], PyClone [14], 5 SciClone [15], MEDICC [16,17], PhyloWGS [18], AncesTree [19], CITUP [20], Canopy [21], Treeomics [22] and others. Tumor phylogenies and the implied intra-tumor heterogeneity are occasionally used to make clinical decisions [23]. For understanding basic cancer biology, a key question is whether tumor progression is more branched (leading to a wider and shallower tree) or more linear (leading to a nar- rower and deeper tree) [24]. For metastases, results on the question of branched vs. linear evolution are mixed (e.g., [25–27]). Hong and colleagues showed that results are sensitive to the method used for phylogeny inference [28]. Complicating matters further is that most of the methods listed above use bulk sequencing data as input, and El-Kebir and colleagues showed that under reasonable and models, the optimal phylogeny from bulk sequencing data may not be unique [29]. Therefore, recent efforts have looked beyond bulk sequencing data to elucidate the progression history of a tumor. In a major step towards this goal, Quinn and colleagues introduced technologies for cell lineage tracing and applied those technologies to study single-cell data in xenograft tumor models. To our knowledge, lineage tracing is not yet possible in human cancer samples, but analysis of single-cell data of other types is possible [30]. 1.5 Phylogeny inference using single-cell data The recent rise of single-cell sequencing technology empowers more accurate tumor lineage inference by allowing the examination of intratumor heterogeneity at a cellular resolution [31–36] (Figure 1.3). However, since single-cell sequencing data are derived from a limited amount of genetic material, the signals obtained are more scarce than those from bulk sequencing [37]. Genomic events commonly used in bulk tumor lineage inference, such as SNVs and CNAs, are observed with less fre- quency or measured with less confidence in single cell datasets. Single-cell tumor lineage inferences 6 Figure 1.3: Illustration of tumor phylogeny reconstruction from single-cell data. using SNVs often suffer from false negatives in the datasets, which calls for data imputation or ag- gregation; on the other hand, tumor lineage inference using CNAs called from single-cell sequencing data often involves ambiguities in the determination of chromosomal breakpoints that may mislead the conclusion. 1.6 DNA methylation as phylogenetic marker More recently, there have been growing interests in examining heritable and non-genetic deter- minants in tumor evolution [38], as while genomic alterations such as SNVs and CNAs have been con- ventionally used to infer tumor phylogenies, other non-genomic alterations can also influence clonal phenotype and fitness. DNA methylation, the addition of a methyl group to cytosine, which results in the formation of 5mC especially in the context of CpG sites, is an epigenetic marker that has been extensively studied for its role in regulating gene expression and maintaining cellular memory [39]. As Gaiti and colleagues importantly noted, the methylation maintenance machinery has been shown to have an error rate four orders of magnitude higher than that for DNA replication [40, 41] and there- 7 fore may provide a greater amount of observable evidence for tumor evolution than alternative data modalities, which is a critical asset especially in the single-cell context [42]. Recent studies have leveraged CpG methylation in single cells sampled from CRC [43], CLL [42], GBM and IDH-mutant glioma [44] tumors to construct lineages, showing CpG methylation to be a valuable signal for lineage inference. That said, using CpG methylation to construct tumor lineage tree has its challenges. First, single cell methylation data has shallow read coverage. In each cell, most of the covered sites have only a couple of read support. The shallow read coverage also means it is very likely that the reads do not cover all alleles present. Furthermore, methylation is known to be a generally more transient signal than mutations, and in a recent work by Meir and colleagues [45], they show that, while the methylation statuses at some CpG sites are persistently inherited during cell replication, others seem to be independently resampled at each generation. For the goal of constructing tumor phylogenies using methylation as features, it is intuitive that only sites of the persistency model could harbor changes in methylation status that would be lineage-informative, while those of the mixture model add noise at best. As such, in Chapter 2 of this thesis, we aim to answer the following questions. Can we leverage single-cell methylation data to infer tumor evolutionary trajectories? Can we in the meantime identify CpG sites that are lineage-informative? And finally, can we uncover biological insights from these results? 8 1.7 Long-read sequencing technology Long read sequencing technologies, such as those from PacBio HiFi and ONT, have pushed the frontier of genomic analysis: with read lengths ranging from tens to hundreds of kilobases, with ultra- long reads reaching a maximum read length > 4 million bases, long read sequencing data has enabled more accurate characterization of genomes, especially of the highly repetitive regions and large SVs, which have been difficult to assess with short reads with lengths of the order of hundreds of bases. In- deed, long-read technology has empowered the near-complete assembly of the human genome by the T2T Consortium, elucidating centromeric satellite arrays, recent tandem duplications, and short arms of acrocentric chromosomes [46]. Similarly, it has further empowered accurate read-based haplotype phasing, which is the disambiguation of maternally- and paternally-inherited alleles using heterozy- gous SNVs. This is intuitive because long reads are likely to each span multiple polymorphic sites, leading to the construction of phase blocks of more substantial lengths, resolving allele-specific char- acteristics across a longer range of genomic coordinates. New tools have been developed specifically for phasing long reads [47–51]. The results discussed in Chapter 3 were based on data generated by ONT PromethION R10 flow cells. As a piece of genetic molecule passes through the protein nanopore, the change in electrical current caused by the sequence of nucleic acid bases is monitored and recorded. Neural network-based basecallers, such as guppy, and hidden Markov model-based caller, such as Nanopolish, have been developed to translate the recorded raw signals back into the genetic sequences that generated them. Furthermore, many basecallers have been trained to call base modifications like the methylation of the 5mC directly from the Nanopore raw signals. The advantage presented by the ability to directly call the methylation status of CpG sites from raw sequencing signals is two folds: first, it allows the 9 simultaneous assessment of many characteristics of the genetic material; secondly, the data will not have bias and artifact introduced to the genetic material by the additional round of bisulfite-treatment, which has so far been the gold-standard of methylation sequencing. The capability of ONT long-read sequencing to jointly profile different types of genomic and epigenomic variants motivates the work discussed in Chapter 3. Expanding upon what we estab- lished in Chapter 2, that epigenomic variants also play a part in tumor evolution, the main goal of this proof-of-concept study is to showcase how cohering different types of somatic variants along a tumor phylogeny may reveal a holistic model of tumor evolution that describe the interplay among them. 1.8 Sequencing-based spatial transcriptomics Spatial transcriptomics techonologies have enabled studying the spatial organizations of het- erogeneous cell populations within a tissue. In particular, the 10x Genomics Visium platform - an extension of a previous generation of barcoding-based spatial transcriptomics technology [11] - is one that offers transcriptome-scale information at the resolutions of 1 to 10 cells per spot. One of the most cited success stories of the use of the Visium technology was produced by Maynard and col- leagues [52]. In their work, Maynard and colleagues used the spatial information afforded by the Visium technology to study the six-layered human dorsolateral prefrontal cortex tissue and identified clinically relevant layer-enriched gene expression signatures, overcoming the difficulties in associating whole-cell expression with anatomical locations with single-cell, single-nucleus RNA-sequencing, or microscopy-based technologies. The spatial information adds another dimension to genomic analysis. During sequencing, the prepared tumor tissue sample is overlaid over the 6.5mm-by-6.5mm capture area, which is composed of a field of around 5,000 spots 55µm in diameter and 100µm apart. 10 Spatially barcoded oligonucleotides present on each spot then binds to the mRNA from the tissue, and the reverse transcribed barcoded cDNA are collected to be processed for downstream Illumina short-read sequencing. It is important to note that, the Visium technology uses the 3’ instead of whole transcript method; as a result, while gene expression can be reliably quantified, one cannot attempt to call expressed variants from Visium data due to the lack of full transcript information. In Chapter 4 of this thesis, we examine how the addition of spatial information further our understanding of intra-tumor heterogeneity. Specifically, recent works have established that spatial technologies can further elucidate cell-cell interactions, which previously can only be inferred from co-expression of gene markers [53–58], by confirming their physical contact evidenced by their spatial co-location [59–67]. We ask the question, with additional information from full-length scRNA-seq data, can we extend our current knowledge on tumor-immune interaction to the subclonal level? This work explores how heterogeneous subclones, characterized by various types of somatic alterations described in Chapter 3, may interact with the tumor microenvironment differently, and therefore may have differential response to immunity. 1.9 Outline of Thesis First, in Chapter 2, we establish the possibility to reconstruct tumor phylogenies using data from single-cell bisulfite-treated sequencing technology that profiles CpG methylation information on a single-cell level, highlighting the valuable phylogenetic signals in epigenetic data that have been un- derexplored. The newly developed method, Sgootr, jointly infers a tumor phylogeny and the amount of phylogenetic information in each CpG site from single-cell CpG methylation data. The effectiveness of the tool in inferring tumor phylogenies that suggest parsimonious migration histories was demon- 11 strated in simulated, U87MG glioma cell line, and multi-regionally sampled colorectal cancer data sets, respectively. The migration hisotry Sgootr inferred from the metastatic colorectal cancer patient, specifically, was more parsimonious than the one inferred from copy number changes in the original work. Next, in Chapter 3, we present a novel framework to study both the genomic and epigenomic changes during tumor evolution using Nanopore long-read sequencing data of single-cell derived tu- mor sublines. From the B2905 cell line derived from the M4 mouse melanoma model, which is known to represent RAS-mutant transitory TCGA human melanoma with varied responses to anti-CTLA- 4 treatment, 23 single cells were isolated and derived into sublines, which were then subjected to Nanopore long-read sequencing. SNVs, SVs, CNAs, as well as CpG methylation were profiled. A tumor phylogeny was constructed using a filtered set of mutations, revealing 4 major clades with dis- tinct phenotypes validated via orthogonal experiments. The evolutionary timing of the various types of somatic variants was resolved against the tree, allowing the characterization of key alterations in terms of their clonality, and of the major clades in terms of the key alterations that may have contributed to their phenotypes. Critically, the clonal resolution this data set uniquely affords allowed the inference of active mutational processes, confident identification of parallel CNAs, and detection of differentially methylated regions, all at a subclonal level. Besides the biological insights, this data set also serves as a valuable resource for the community to generate pseudobulk of various clonality for benchmarking computational methods developed for heterogeneous cancer data. In Chapter 4, we present a framework to examine tumor subclones at spatial resolution. Us- ing the B2905 mouse malanoma cell line discussed in the previous section, heterogenous mixtures of cells from the cell line culture are injected into mice that are divided into control and treatment group. Anti-CTLA-4 immune checkpoint blockade therapy was administered to the mice in the treated group, 12 and tumors are resected for sequencing after various amounts of time. For each treatment condition, some tumors were subject to scRNA-seq via the Smart-seq2 full length transcriptomics sequencing, and the rest were subject to 10x Genomics Visium spatial transcriptomics sequencing. Six tumor subclones were identified using mutations profiled from the Smart-seq2 data, and subclone-specific expression profiles were used in cell type deconvolution of the spatial data by SpaCET. We validated the estimation of subclonal proportions at each spatial spot by examining their association with the few clone-specific mutations detectable directly on the spatial slides. The associations between tumor subclones with various stromal and immune cell types were accessed with established spatial statis- tics. By contrasting subclone-immune activities between treatment responders and non-responders, we show that differences in tumor response to anti-CTLA-4 treatment may relate to subclone-specific activities in its microenvironment. Chapter 5 provides the conclusion to the thesis. 13 Chapter 2: Reconstructing tumor phylogenies with single-cell CpG methylation data using Sgootr 2.1 Motivation There are two main challenges in constructing single-cell CpG methylation-based tumor lineage trees. First, detailed examination reveals that data obtained via single-cell bisulfite sequencing (scBS- seq) [43], single-cell reduced representation bisulfite sequencing (scRRBS) [68, 69], or multiplexed scRRBS (MscRRBS) [42] protocols exhibit high level of sparsity. Cells with subpar bisulfite conver- sion rates have information at few CpG sites, and of the millions of unique CpG sites across all cells in a dataset, less than one out of a hundred are sequenced in a sufficient number of cells to be useful in lineage reconstruction. Furthermore, even when a CpG site is sequenced in a cell, it is oftentimes covered by less than a few sequencing reads, which is not likely to capture heterozygosity in aneuploid cancers. Besides sparsity, another key challenge is that not all CpG sites have their methylation statuses stably retained during tumor evolution. In particular, Meir and colleagues posited two models of methylation dynamics: mixture, where the methylation status of a CpG site is resampled (from the parental status) in each cell replication, and persistency, where that is an exact copy of the parent cell [45]. It is the CpG sites whose status follows the persistency model would be informative in tumor lineage reconstruction because the maintenance of information from the parental generation is the necessary condition for the infinite sites assumption [5,70] and Dollo parsimony [71], which form the basis for tumor lineage inference tools based on mutation profiles. However, as of now, there is no known set of CpG sites that follow either inheritance model, highlighting the necessity to perform 14 intentional CpG site selection while reconstructing tumor lineages by CpG methylation. To address these two challenges, in this work, we introduce Sgootr (Single-cell Genomic methylatiOn tumOr Tree Reconstruction, Figure 2.1), the first distance-based computational method to jointly select informative CpG sites and reconstruct tumor lineages from single-cell methylation data. 2.2 Methods Sgootr (Figure 2.1) consists of five components: (i) optional biclustering of cells and sites for reducing sparsity-induced noise; (ii) likelihood-based sequencing error correction accounting for copy number estimates; (iii) expected distance calculation between cell pairs for tree construction; (iv) pruning of CpG sites according to a tree-based methylation status persistence measure; and (v) inference of migration history from the lesion-labeled tree leveraging prior knowledge on migration patterns. Components (iii) and (iv) are iteratively applied. Code for this work is available on GitHub as a Snakemake [72] workflow. Results from this work are reproducible, and intermediate experiment outputs of Sgootr are available for further analysis. 15 Inp ut ce ll- by -s ite s in gl e- ce ll C pG m et hy lat io n re ad c ou nt m at rix Cells C pG s ite s m m m m N/ A ce ll- by -s ite c op y nu m be r es tim at es Cells C pG s ite s 5 2 1 Optional Sg oo tr (S ing le- ce ll G en om ic m et hy lat iO n tu m O r T re e Re co ns tru ct ion ) O ut pu t t* Li ne ag e tre e t* (v) T um or m ig ra tio n hi st or y in fe rre d fro m les io n- lab ele d lin ea ge tr ee (i) O pt io na l b icl us te rin g of c ell s an d Cp G s tie s fo r r ed uc in g sp ar sit y- in du ce d no ise M̂ Cells C pG s ite s m m m m m m m m m m m m m m m (ii) L ike lih oo d- ba se d se qu en cin g er ro r c or re ct io n ac co un tin g fo r co py n um be r (iii ) E xp ec te d di st an ce c alc ul at io n be tw ee n ce ll p air s fo r l in ea ge tr ee co ns tru ct io n Ev alu at io n of R ob in so n- Fo ul ds (R F) di st an ce b et w ee n tre es in co ns ec ut ive it er at io ns t i t i− 1 Af te r u se r- de fin ed nu m be r o f ite ra tio ns De te ct io n of it er at io n * (iv ) P ru ni ng o f C pG s ite s ac co rd in g to a tr ee -b as ed m et hy lat io n st at us pe rs ist en ce m ea su re Fi gu re 2. 1: O ve rv ie w of S g o o t r . S g o o t r le ve ra ge s si ng le -c el l m et hy la tio n se - qu en ci ng da ta fr om tu m or sa m pl es , in co rp or at in g co py nu m be r in fo rm at io n w he n av ai la bl e, to jo in tly in fe r a si ng le -c el lt um or lin ea ge tr ee an d id en tif y C pG si te s th at m ay ha rb or lin ea ge -i nf or m at iv e m et hy la tio n ch an ge s. 16 2.2.1 Biclustering of cells and sites for reducing sparsity-induced noise Sgootr requires a fairly accurate estimate of the differences between the methylation status of distinct CpG sites across all cell pairs. Prior to submitting the input dataset to Sgootr, it is advisable that one filters out cells for which the number of CpG sites with non-zero read coverage is low and CpG sites with non-zero coverage in only a few cells. For example, for the Bian metastatic CRC dataset [43] we will analyze further in depth in the Results section, we apply the following heuristic filters to the cells and sites: (i) among all input cells, remove low-quality ones with coverage in fewer than 4 million CpG sites; (ii) remove CpG sites covered in less than 2 3 of the remaining cells. For that particular dataset, we additionally removed CpG sites on sex chromosomes, which may confound the findings, and Chromosome 21 peri-centromeric regions (hg19, Chr21:9825000-9828000), which we discovered to be an overlooked alignment artifact. After filtering the input for reducing sparsity, we perform sequencing error correction as described in the following subsection. Since this may further introduce sparsity, in an additional post error correction step, we may remove sites covering < 2 3 of cells, and then remove cells with coverage in < 2 3 of the remaining CpG sites. Overall, our filtering approach aims to make sure that the cells remaining are well-covered, and that each cell pair has at least 1 3 of the selected CpG sites as shared dimensions along which cell-to-cell distance can be measured. Our main results on the Bian [43] dataset demonstrate the heuristic filtering scheme described above is effective in practice; however, in principle, the filtering of cells and CpG sites should ideally be coordinated in order to minimize information loss. For that purpose, we present an integer linear program (ILP)-based biclustering formulation as follows. Let the site coverage data be represented in a cell-by-site matrix Mv×u, where v is the number of cells and u is the number of CpG sites; mij = 1 if site j is covered (by at least one read) in cell i, and mij = 0 otherwise. Let α and β respectively be the 17 specified fraction of cells and sites to be kept. Given M,α, and β, we wish to compute a biclustering - namely, a selection of rows of columns - of M , M̂ , so that we maximize the number of CpG sites covered in the resulting ⌊αv⌋ × ⌊βu⌋ submatrix in which rows and columns respectively correspond to the selected cells and sites. To solve this biclustering problem, let C ∈ {0, 1}v, S ∈ {0, 1}u be (unknown) binary vectors respectively indicating whether a cell or site is kept. In addition, let A ∈ {0, 1}v×u denote (unknown) binary matrix such that aij = 1 if and only if ci = 1 and sj = 1 (i.e., if cell i and site j are both kept). The ILP formulation is: maximize: ∑v i=1 ∑u j=1 aijmij subject to: aij ∈ {0, 1} ci ∈ {0, 1} sj ∈ {0, 1} aij ≤ ci aij ≤ sj ci + sj − 1 ≤ aij∑v i=1 ci = ⌊αv⌋∑u j=1 sj = ⌊βu⌋ The output submatrix M̂ is obtained by taking cells {i|ci = 1} and sites {j|sj = 1} from M . We implement the above formulation as an optional step in our software using the Gurobi ILP solver, a licensed software which is free to use for academic purposes [73]. The performance of our implementation in terms of run time, memory, and accuracy (i.e. optimality gap) with respect to the input size and other parameters, as well as a strategy on setting α and β are explored in detail in 18 Supplemental Section A.1 (Supplemental Tables A.1,A.2, Supplemental Figures A.2,A.1,A.3). For a typical computational environment, the current implementation of the biclustering formulation may not be practical for the scale of the single-cell CpG methylation datasets analyzed in this study (see Supplemental Section A.1). However, since for given α and β values the ILP formulation produces a submatrix no sparser than that produced by the heuristic filters, it should be the preferred choice whenever it is feasible. In Supplemental Section A.1, we show that memory consumption is likely to be the bottleneck, and provide a way to estimate memory consumption of our biclustering implementation given the size of the input so that the users may determine whether performing biclustering may be appropriate for their own datasets of interest. It is intuitive that, the single-cell tumor lineage tree constructed from input data resulting from biclustering is as reliable, if not more reliable, than the tree constructed from heuristically filtered data, since the former will be constructed from at least as much information from the unfiltered data as the latter. Improving the run time and memory efficiency of our ILP implementation and devising alternative strategies to perform biclustering are among our future objectives. 2.2.2 Likelihood-based sequencing error correction accounting for copy number es- timates Errors in sequencing may provide false evidence for an underlying allele methylation status different from the truth. A common strategy employed in prior works to mitigate the effect of such sequencing errors in downstream tasks is to employ the 90% rule, which assigns a CpG site in a cell the methylation status with support from 90% of the reads available for the site in the cell, and discards the site if no status can be assigned [42–44]. This approach has two drawbacks. First, the binarization 19 of methylation status disallows heterozygosity. Second, while there are high levels of copy number gains in some cancer types, this approach does not distinguish between having reads sampled from two underlying alleles and having reads sampled from many. For a CpG site in a cell, we would like a copy number-aware way to determine, if we observe both reads with the site methylated and reads with it unmethylated, whether they are truly sampled from heterozygous alleles, or they are sampled from homozygous alleles and the methylation statuses on the disagreeing reads need to be corrected. To this end, we present a maximum log-likelihood-based approach to correct likely errors in methylation reads, incorporating site copy number estimates in each cell (Algorithm 1). For a CpG site in a cell, given n ≥ 0 reads with a methylated status, m ≥ 0 reads with an unmethylated status, its copy number c, and sequencing error probability 0 ≤ ϵ < 1 (which we set to .01 in our experiments to proxy the per-base sequencing error rate in Illumina sequencing instruments), we can enumerate the likelihood of having drawn the observed reads from all possible underlying allele statuses, letting 0 ≤ γ ≤ c be the number of alleles with the CpG site methylated. Note that in Algorithm 1, 1c and 0c respectively denote the events that all c alleles are methylated and unmethylated; 1γ0c−γ denotes the event where γ alleles are methylated and c− γ are not. Sequencing error correction takes place when we have n > 0 and m > 0, but the allele status with the highest log-likelihood is homozygous. In that case, we correct the reads with the minority methylation status to the majority one. When c = 1 the site will implicitly be identified as homozygous and the methylation status of any read that does not conform to the majority allele will be corrected. In the case where sequencing error is needed, yet n = m, which will likely happen for c = 1, the CpG site is discarded. In case copy number information is not available, one can assume an appropriate (uninformative) copy number for all sites in all cells (e.g. c = 2 for diploid). Sex chromosomes are always excluded from our analysis. We use the sequence error-corrected reads for pairwise distance 20 Algorithm 1 Sequencing error correction accounting for copy number n,m← number of methylated and unmethylated reads for a CpG site in a cell, respectively c← copy number for the CpG site in the cell ϵ← sequencing error probability function SEQUENCINGERRORCORRECTION(n,m, c, ϵ) if c = 0 then return None ▷ site is discarded in case of deletion event L[1c],L[0c]← ln[(1− ϵ)nϵm], ln[ϵn(1− ϵ)m] for 0 < γ < c do L[1γ0c−γ]← ln[γ c n(1− γ c )m] end for status← argmaxL if [status = 1c OR status = 0c] AND [n ̸= 0 AND m ̸= 0] then if n > m then n,m← n+m, 0 else if n < m then m,n← n+m, 0 else return None ▷ site is discarded if correction is impossible end if return n,m end function estimation between cells, as described in the following section. 2.2.3 Expected distance calculation between cell pairs for tree construction In this section, we present a formulation for computing the expected distance between two cells given their respective copy number and (sequencing error-corrected) reads. We consider such a for- mulation with the goal of correcting for the potential bias contributed by low coverage CpG sites. Intuitively, this distance measure serves as an estimate for the expected proportion of methylation statuses altered across all CpG sites between a cell pair. Given a cell, let the copy number at a CpG site be c, and we have 0 ≤ γ ≤ c alleles with the CpG site methylated, and c−γ alleles with the CpG site unmethylated. To allow modeling preferential allele sampling based on site methylation status as previously observed [74], we additionally introduce parameter p, probability of drawing a read from the allele with the CpG site methylated given a pair of 21 alleles heterozygous for the site. We set p to the uninformative .5 in our experiments (see Supplemental Section A.8 for when to choose alternative values). Then, we can compute the probability of drawing from an allele with the CpG site methylated by normalizing over all alleles (Equation 2.1): pc,γ = γp γp+ (c− γ)(1− p) (2.1) It follows that the probability of drawing from an allele with the CpG site unmethylated is 1− pc,γ . Consider the case where we observe n reads at a CpG site for a cell, and all n reads are methy- lated for the site. Let the copy number at the site for the cell be c, the probability of sampling all n methylated reads from c alleles with the CpG site methylated (i.e. γ = c) is: P(reads | 1c) = 1; here 1c is the event that in all c copies the CpG site is methylated, and reads is the event that all n sampled reads are methylated. The probability of drawing all n methylated reads from c alleles with the CpG site unmethylated (i.e. γ = 0) is: P(reads | 0c) = 0. To compute the probability of drawing n methy- lated reads from c alleles with mixed methylation status for the site, we need to sum the probabilities over all other possible values of γ, the number of alleles with the site methylated, assuming that all other possible values of γ are equally likely; and when there is a copy number loss (i.e. c = 1), the formulation assigns only non-zero probability to a homozygous combined allele status: P(reads | mixed) =  1 c−1 ∑c−1 γ=1 p n c,γ , c ≥ 2 0 , c = 1 Given user-defined prior probabilities P(1c), P(0c), and P(mixed), which we all set to the uninfor- mative .33 in our experiments (see Supplemental Section A.8 for when to choose alternative values), 22 let a = P(reads | 1c)P(1c)+P(reads | mixed)P(mixed)+P(reads | 0c)P(0c), then we apply Bayes’ The- orem to get the likelihood of any allele status given the observed reads (Equations 2.2): P(1c | reads) = P(reads |1c)P(1c) a = P(1c) P(1c)+P(mixed) c−1 ∑c−1 γ=1 p n c,γ P(mixed | reads) = P(reads | mixed)P(mixed) a = P(mixed) c−1 ∑c−1 γ=1 p n c,γ P(1c)+P(mixed) c−1 ∑c−1 γ=1 p n c,γ (2.2) P(0c | reads) = P(reads | 0c)P(0c) a = 0 In the case where the n reads all have the CpG site unmethylated, the three values can be com- puted similarly with P(reads | 1c) = 0 and P(reads | 0c) = 1. In the case we observe among the n reads both ones with the CpG site methylated and ones with the site unmethylated, we have P(mixed | reads)=1, and P(1c| reads) = P(0c| reads) = 0. Then, for a given CpG site s in Cell A and Cell B, the respective copy numbers cA,s, cB,s, and respective reads, we can compute the expected distance over the possible combinations of allele status between the two cells at s (Equation 2.3): dist(readsA,s, cA,s, readsB,s, cB,s) = ∑ statusA,s ∈{1cA , mixed, 0cA} ∑ statusB,s ∈{1cB , mixed, 0cB} P(statusA,s|readsA,s)P(statusB,s|readsB,s)dist(statusA,s, statusB,s) (2.3) where dist(11,11)=dist(10,10)=dist(00,00)=0, dist(11,10)=dist(10,11)=dist(00,10)=dist(10,00)=0.5, and dist(11,00)=dist(00,11)=1. The total expected distance between cells A and B can now be computed with some distance function over the vector of expected distances over all shared sites. The L1 norm normalized by the 23 number of shared sites is computed via Equation 2.4. The use of shared CpG sites between pairs of single cells to estimate their distances was established by pdclust: dist(A,B) = ∑ s∈sitesA∩sitesB dist(readsA,s, cA,s, readsB,s, cB,s) |sitesA ∩ sitesB| (2.4) A comparison of our distance formulation against a baseline distance measure commonly used in prior studies! [42,44] on simulated data can be found in Supplemental Section A.2.1 (Supplemental Figures A.4,A.5,A.6). After computing the distance between each pair of cells, one can incorporate any distance-based tree reconstruction method into Sgootr to obtain the lineage tree and arbitrarily choose a single cell from the normal tissue as the root. We use the scikit-bio implementation of the NJ algorithm [75] for our main analysis. It is possible to replace NJ with an alternative distance- based tree construction method in Sgootr. We compare Sgootr’s performance while using FastME 2.0 [76], a popular alternative, against that using NJ in Supplemental Section A.5 and show that they lead to similar results (Supplemental Figures A.15,A.16, A.17). 2.2.4 Pruning of CpG sites according to a tree-based methylation status persistence measure The main body of Sgootr consists of an iterative procedure: at each iteration, it (i) computes the pairwise distances among single cells to form the tumor lineage tree using a distance-based algo- rithm, then (ii) measures the methylation persistence score of each CpG site along the tumor lineage tree and prunes away a fraction of CpG sites that have the least scores before continuing onto the next iteration, and (iii) outputs the tumor lineage tree of the particular iteration where distance from the tumor lineage trees obtained in consecutive iterations are minimum possible. 24 We have described how to compute pairwise distance among single cells in the previous section, and we leverage the widely used RF distance [77, 78] to measure the differences among the tumor lineage trees constructed on the same set of cells in consecutive iterations. In this section, we focus on how we measure the methylation persistence of CpG sites at each iteration given the tumor lineage tree obtained for these CpG sites. In particular, we define the methylation persistence score for a CpG site given a tumor lineage tree (Equation 2.6): the higher the methylation persistence score for a CpG site, the more stably maintained its methylation status change along a tumor lineage tree. To facilitate the analysis, we assign to each CpG site covered in each cell its most likely methy- lation status: given sequencing error-corrected reads from component (ii) of Sgootr, we call a site homozygously methylated if all of its reads have the site methylated, homozygously unmethylated if all unmethylated, and heterozygous if the read status for the site is mixed. The status remains unknown if there are no reads covering the CpG site. After this step, we measure the persistence of each CpG site independently: first at each branch of the tree, then for the overall tree. 2.2.4.1 Methylation persistence score for a CpG site at a particular branch in the lineage tree Each branch in the methylation tumor lineage tree induces a bipartition of the tree and subse- quently of the leaf nodes, which represent single cells. In other words, let TN denote the full set of leaf nodes in a methylation tumor lineage tree t, cutting a branch b in the tree creates disjoint subsets of leaf nodes TNb and TNb. Let Qs,tn denote the probability distribution across three possible methylation statuses - homozygously methylated, heterozygously methylated, and homozygously unmethylated - at CpG site s across a subset of cells tn, and let Is represent the set of cells with status information at site 25 s. Then, we defineMPs,b, the methylation persistence score of CpG site s at branch b (Equation 2.5): MPs,b = √ JSD ( Qs,TNb∩Is , Qs,TNb∩Is ) . (2.5) Here JSD(X,Y ) denotes the Jensen-Shannon divergence [79, 80] for a pair of distributions X, Y , and is defined as JSD(X, Y ) = 1 2 D(X∥Z) + 1 2 D(Y ∥Z) with Z = 1 2 (X + Y ) and D(K∥L) denoting the Kullback-Leibler divergence measure for any arbitrary distributions K and L [81]. It is worth noting that the square root of the Jensen-Shannon divergence measure, which computesMPs,b (Equation 2.5), is metric [82, 83] and is commonly referred to as Jensen-Shannon distance in popular implementations. The intuition is that, if the change in methylation status of CpG site s has an observed persistent effect in tumor progression - namely, s is lineage-informative - there exists a branch b∗ in the methyla- tion tumor lineage tree such that the leaf nodes in the two subtrees induced by b∗ show very different distributions of methylation statuses. In other words, MPs,b∗ will be large for such b∗. In contrast, suppose the bipartitions contain the same distributions for methylation statuses for s - either that most cells share the same status, or that both bipartitions are similarly heterogeneous - the score will be low. 2.2.4.2 Overall methylation persistence score for a CpG site in the lineage tree We define the overall methylation persistence score of CpG site s in methylation tumor lineage tree t as the maximum methylation persistence score the site s has across all valid bipartitions (Equa- tion 2.6). We recognize that (i) an extreme difference in the number of cells between the bipartitions, or (ii) a severe lack of cells with status information could both lead to meaningless divergence mea- surements. Therefore, we only consider bipartitions induced by branch b such that (i) both partitions 26 contain no fewer than a user-defined fraction δ of total number of leaves, and (ii) there is read infor- mation in no fewer than a user-defined fraction ω of cells in both partitions. In our experiments, we set δ = .05 and ω = .5. MPs,T = max b∈{b||TNb|≥δ|TN |∧|TNb|≥δ|TN | ∧|TNb∩Is|≥ω|TNb|∧|TNb∩Is|≥ω|TNb|} MPs,b (2.6) 2.2.4.3 Iterative joint tumor lineage tree reconstruction and lineage-informative site identification In a given iteration i, Sgootr first computes the tumor lineage tree ti with the distances between pairs of cells based on the persistent sites identified in iteration i− 1 (for iteration i = 0 the entire set of sites remaining after component (i) and (ii) of Sgootr are used). Then, for each CpG site s used in computing ti, Sgootr calculates its overall methylation persistence scoreMPs,ti . Among the CpG sites with overall methylation persistence scores, Sgootr prunes away a user-defined fraction κ of the CpG sites with the lowest scores, along with those with tying scores at the threshold. It outputs the remaining CpG sites to be used in iteration i + 1. The process continues for a user-defined maximum number of iterations, and Sgootr outputs t∗ = ti where i is the last iteration with RF (ti, ti−1) equals the global minimum across iterations. Intuitively, we would like to detect an approximate point in the iterative procedure where most non-lineage-informative CpG sites have been pruned out (leading to the initial roughly decreasing trend of RF distance) but most lineage-informative CpG sites still remain (whose further elimination will lead to increasingly inaccurate distance measurements between cell pairs and therefore increas- ingly unstable tree topologies, which is reflected in a once-again increasing trend of RF distance). We 27 demonstrate empirical observations corroborate with such intuition (Figure 2.2C., Supplemental Fig- ures A.11,A.17) and provide practical recommendations for choosing κ and the maximum number of iterations in Supplemental Section A.9. 2.2.5 Inference of migration history from lesion-labeled tumor lineage tree leverag- ing prior knowledge on migration patterns Given a rooted tumor lineage tree t∗ produced by Sgootr through the procedures described above, provided that the single cells in the tree are labeled by their lesion of origin, we can infer the underlying tumor migration history, which we represent as a directed multigraph (without self loops) where each vertex represents a distinct lesion and each edge represents a distinct migration event from the source lesion to the target lesion. This intuitive representation of metastatic migration events was introduced as a “migration graph” by El-Kebir and colleagues [29]. We accomplish this by first adapting the well-known Fitch-Hartigan algorithm [84,85] with slight modification to obtain a unique, maximally parsimonious labeling of the internal nodes of t∗, then identifying the migration events. The Fitch-Hartigan algorithm was originally intended to solve the Small Parsimony Problem, namely, labeling the internal nodes of a character-based evolutionary tree in which every leaf is labeled with a single character and the objective is to minimize the total number of changes (i.e., mutations from parent node to child node) [84, 85]. In our reuse, the bottom-up phase of the Fitch-Hatigan algorithm computes in a tree the number of migration events - in other words, the change of lesion of origin labels from a parent node to a child - in the most parsimonious labeling of its internal nodes. This phase also produces a set of possible labels for each internal node in the tree. The top-down phase, then, assigns each internal node a label from its set, as described in Algorithm 2, starting with 28 Algorithm 2 Partial order-guided modification to the Fitch-Hartigan top-down node label assignment label set(node) ← maps a node to the set of labels it is given by the bottom-up phase of Fitch- Hartigan algorithm fitch,Hartigan1973 parent(node), children(node)← returns the parent and the two children of node, respectively poset minimal(labels) ← returns the minimal element in subset labels of a poset L, breaking ties arbitrarily assignment[node]← records the final label assignments of all nodes node← a node in the tree function LABELASSIGNMENT(node) candidate labels← label set(node) if node is not root AND assignment[parent(node)] is in candidate labels then assignment[node]← assignment[parent(node)] else assignment[node]← poset minimal(candidate labels) end if if node is not a leaf then child1, child2← children(node) LABELASSIGNMENT(child1) LABELASSIGNMENT(child2) end if end function root node of t∗. A general formulation of Fitch-Hartigan algorithm may produce multiple optimal labelings of the internal nodes, and there are existing tools to capture the space of all optimal labelings: for exam- ple, MACHINA by El-Kebir and colleagues enumerates optimal solutions allowing restrictions on the types of transition events [29], and FitchCount by [30] generates a stochastic matrix summarizing all transition events in the optimal solution space. However, some of these optimal labelings may adhere to our prior understanding of the sequential-progression (a.k.a. metastatic-cascade) model of tumor evolution [86] better than others. One can encode such prior understanding as a partial order over the lesion of origin labels of single cells. For example, in this work, we leverage the following partial order: 29 Normal Tissue ≺ Primary Tumor ≺ Lymph Nodes ≺ Distant Metastases. In our modification to the Fitch-Hartigan algorithm, during the top-down phase, when a node cannot inherit the label of its parent, we assign it the minimal element in its candidate label set accord- ing to the partial order, breaking ties arbitrarily when the elements in the label set cannot be compared (Algorithm 2). This returns a maximally parsimonious labeling of the internal nodes in the tree that also acknowledges the sequential-progression model. With a fully labeled tree, we produce a directed multigraph that corresponds to the tumor mi- gration history as follows. First, we generate a vertex for each distinct lesion represented in the tree. Then, as we traverse down the tree from the root, for each label change from a parent to a child in the tree, we add one directed edge to the multigraph, from the lesion vertex corresponding to the parent label to that corresponding to the child label. Parallel edges between a pair of vertices correspond to a polyclonal seeding event, and any back edge indicates a reseeding event. Note that our proposed method resolves equally optimal labelings of the internal nodes with prior knowledge on the likely migration patterns among sampled sites. We have that prior knowledge for the primary dataset we analyze in this work, and the method provides a basis on which we compare single- cell lineage trees with lesion labeled leaves constructed with Sgootr or alternative methods. In the case where no prior knowledge is available and the goal is to survey the whole space of parsimonious migration histories possibly suggested by a lesion-labeled lineage tree, users may instead use tools like MACHINA [29] or FitchCount [30]. It may also be intuitive to more directly encode such prior knowledge by assigning different weights to migration events between distinct lesions, and solve the weighted version of the Small Parsimony Problem with Sankoff’s algorithm [87]; however, the output of Sankoff’s algorithm is highly sensitive to the input weights, and it is not immediately clear which 30 set of weights would be most appropriate for our purpose. 2.3 Results The performances of Sgootr’s distinct components on simulated data are described in the Sup- plemental Sections A.1 (biclustering), A.2.1 (distance calculation), and A.2.2 (iterative pruning; Sup- plemental Figure A.7). Each of these components improves upon the respective baseline approach on a wide range of settings, and the results of the iterative pruning component demonstrate that Sgootr can capture the migration history of a tumor accurately. Furthermore, we applied Sgootr to scRRBS data of single cells from an U87MG glioblastoma cell line ex vivo clone tree experimentally generated by [88]. We show in Supplemental Section A.3 that, even though the original study fails to construct an accurate lineage tree from microsatellite allelotype data (see Supplemental Figure S10 of the RETrace study), Sgootr is able to reconstruct a CpG methylation-based lineage tree that accurately captures the ground truth clonal relationships (Supplemental Figure A.9). In the remainder of the chapter we focus on the application of Sgootr to the scBS-seq dataset generated by Bian and colleagues [43] on 9 metastatic CRC patients1, among whom 2 (CRC01 and CRC04) have matching scRNA-seq data (via the scTrio-seq2 protocol) from both normal and tumor cells with which copy number calling can be performed. We highlight results from patient CRC01 since in addition to having copy number calls available, they have the largest number of distinct tissue types, sampling locations, and treatment conditions. Full details and results for the other 8 patients can be found in Supplemental Section A.4 (Supplemental Table A.3, Supplemental Figure A.10). We additionally present in Supplemental Section A.7 Sgootr’s results on a GBM patient by Chaligne and 1Out of 12 total patients in the Bian cohort, we excluded those without scBS-seq data (CRC03 and CRC06), and those without metastasis information (CRC09). 31 colleagues [44], which demonstrate the applicability of our approach to MscRRBS data (Supplemental Figure A.19). To the best of our knowledge, this is the only other single-cell bisulfite sequencing dataset with multi-regional tumor sampling that is publicly available. 2.3.1 Sgootr obtains a simple tumor migration history with scBS-seq and scRNA- seq data from patient CRC01. The single cells of CRC01 are sampled from 4 distinct lesions - PT, LN, ML, and MP - as well as NC adjacent to the primary lesion; furthermore, there are multiple sampling locations within each lesion (Figure 2.2A.). We only include cells with both scBS-seq and scRNA-seq data, and employ inferCNV2 to call copy number from scRNA-seq data. We apply Sgootr with κ = .1, terminating the iterative procedure after 40 rounds. A unique global minimum among the RF distances is identi- fied at iteration 8 (Figure 2.2C.), hence t∗ = t8 (Figure 2.2B.) is output as the lineage tree. The tumor migration history inferred by Sgootr (Figure 2.2D.) is simpler than the CNA-based results previ- ously reported (see Supplemental Figure S7 of the work by Bian and colleagues): NC grows into PT, followed by a polyclonal migration to LN, which appears to proceed to seed both ML and MP. While the migration history also suggests a polyclonal reseeding from ML to LN, both edges in the graph have low support: they are both due to a subtree of a particular cellular origin harboring a (potentially mislabeled) singular cell of a different origin. For comparison, we also present in Supplemental Section A.6 the tumor lineage tree we con- structed on mutations called from the matching scRNA-seq data from patient CRC01 using the Trisicell toolkit [91]. Due to the high level of sparsity in the mutation calls, we clustered single cells prior to constructing the mutation tree. We observed a high level of heterogeneity in terms of lesion of origin 2from the Trinity CTAT Project https://github.com/broadinstitute/inferCNV 32 https://github.com/broadinstitute/inferCNV A. ML1 ML2 ML3 ML4 Treatment-naive Liver Metastasis (ML) Post-treatment Liver Metastasis (MP) Lymph Node Metastasis (LN) PT1 MP1 MP2 MP3 MP4 MP5 LN1 LN2 LN3 1 2 3 4 1 2 3 5 4 1 2 3 B. Normal Colon (NC) PT2 PT3 PT4 Primary Tumor (PT) 1 2 3 4 C. D. E. F. Figure 2.2: Application of Sgootr to patient CRC01 scTrio-seq data by Bian and col- leagues. A. The multi-regional patient data consists of single cells sampled from 4 distinct lesions (in addition to normal tissue) and multiple sampling locations within each lesion. B. Tumor lineage tree constructed by Sgootr with κ = .1. Each single cell is represented as a leaf, colored by its sampling location. C. RF distances between the trees of consecu- tive iterations of the pruning procedure, with the global minimum occurring at t∗ = t8. D. Tumor migration history inferred by Sgootr for patient CRC01. E. Fraction of CpG sites located in CGI (regions with over-representation of CpG sites) [89], CpG shore (2kb-long regions flanking both sides of CGIs) [90], CpG shelf (2kb-long regions adjacent to CpG shores on the side away from CGIs) [90], and inter-CGI regions in the CRC01 trees at each iteration of the pruning procedure of Sgootr, from t0 to t∗. F. Top 10 GO terms with significant (< .05) q-values in enrichment analysis of non-pseudo, protein-coding genes spanning lineage-informative sites in intra-CGI and inter-CGI regions respectively in CRC01. See Supplemental Figure A.12A. for a full list of enriched terms. 33 and sampling locations in the cell clusters (Supplemental Figure A.18), which makes inferring sim- ple tumor migration history from trees constructed on them unlikely. Together with the CNA-based results(see Supplemental Figure S7 of the work by Bian and colleagues), this mutation-based result further highlights the usefulness of CpG methylation in reconstructing single cell lineage trees when sparse SNV and CNA data fail to provide sufficient signal for lineage reconstruction. 2.3.2 Sgootr infers migration histories simpler than alternative methods for the Bian CRC cohort. We benchmark Sgootr against (i) a naı̈ve baseline distance-based tree construction method (column “baseline” in Figure 2.3A.), (ii) IQ-TREE [92] with the two-state general time reversible model (GTR2), an instance of the popular maximum-likelihood-based tool used for inferring lineage trees from single-cell bisulfite sequencing data in prior studies [42, 44] (column “IQ-TREE” in Fig- ure 2.3A.), and (iii) IQ-TREE (with GTR2model) preceded by FG, a lineage-informative site-selection method previously described by Gaiti and colleagues [42], which returns a subset of input CpG sites with lower than expected epimutation rate (column “FG+IQ-TREE” in Figure 2.3A.). Note that it is IQ-TREE preceded by FG that was employed exactly in previous studies [42, 44]. However, as discussed in more detail below, the FG step is not only very slow, it also leads to highly complex mi- gration histories. These observations lead us to additionally benchmark against IQ-TREE without the FG step, even though it had not been directly applied to single-cell bisulfite sequencing data in prior studies. See Supplemental Section A.10 for further experiment details. 34 B . C . A . Fi gu re 2. 3: O ve rv ie w of re su lts fr om th e B ia n m et as ta tic C R C co ho rt . A .N um be r of m ig ra tio n ev en ts an d B A M D PO va lu es of m ig ra tio n hi st or ie s ob ta in ed vi a (i )n aı̈ ve ba se lin e m et ho d, (i i) IQ -T R E E ,( iii )I Q -T R E E on FG -s el ec te d C pG si te s, (iv ) S g o o t r in te rm ed ia te tr ee pr io r to ite ra tiv e pr oc ed ur e (t 0 ), an d (v ) fin al tr ee ou tp ut by S g o o t r (t ∗) . IQ -T R E E re su lt is re pr es en te d by th e m ea n ac ro ss 10 ru ns w ith di ff er en t ra nd om se ed s, w ith th e er ro r ba rs de no tin g th e 95 % co nfi de nc e in te rv al of th e m ea n. T he la ck of da ta po in ts in “I Q -T R E E ” co lu m ns m ea ns IQ -T R E E fa il to fin is h m od el op tim iz at io n to st ar tt re e se ar ch af te r2 4 ho ur s, an d th at in “F G +I Q -T R E E ” co lu m ns m ea ns FG an al ys is fa il to te rm in at e af te r1 00 ho ur s. B . R un tim e co m pa ri so n be tw ee n S g o o t r an d IQ -T R E E .E ac h ex pe ri m en ti s pe rf or m ed on 4 co re s. T he nu m be r in br ac ke ts ne xt to “I Q -T R E E ” on th e x- ax is re pr es en ts th e nu m be r of IQ -T R E E ru ns (o ut of 10 ) th at ha ve co nv er ge d w ith in 24 ho ur s. T he nu m be ra bo ve ea ch IQ -T R E E ba rr ep re se nt s th e to ta ln um be ro fr un s (o ut of 10 )w ith an y tr ee se ar ch ou tp ut (c on ve rg ed or in te rm ed ia te af te r2 4 ho ur s) ,a nd he nc e co nt ri bu te s to th e ru n tim e an d m ig ra tio n hi st or y co m pa ri so n. C .F ra ct io n of C pG si te s lo ca te d in C pG is la nd ,C pG sh or e, C pG sh el f, an d in te r- C G I re gi on s, be fo re (t 0 ) an d af te r (t ∗) th e ite ra tiv e pr un in g co m po ne nt of S g o o t r . 35 For each benchmarking experiment, we take as input the cell-by-site read count matrices result- ing from first-pass heuristic filtering step of Sgootr, namely, removing low-quality cells, CpG sites with coverage in less than 2 3 of remaining cells, sites on sex chromosomes, and sites within Chromo- some 21 peri-centromeric regions. Then, for each CpG site covered in each cell, call whether it is methylated or not by the 90% rule [42–44]. This binarization step is necessary in generating input for the GTR2 model. For the FG+IQ-TREE experiments, we further perform FG (Supplemental Sec- tion A.10, Supplemental Figures A.20, A.21) to select CpG sites to input to IQ-TREE. Additionally, we further filter the input to the benchmarking experiments so that all experiments (Figure 2.3A.) on a particular patient act on the same set of input cells. This calibration is to ensure fair comparisons among results from different methods using the two performance measures we will introduce below. To facilitate fair runtime comparison between IQ-TREE and Sgootr, we also filter to ensure the input to IQ-TREE for each patient is the same set as those to Sgootr, so that the runtime measurements in Figure 2.3B. are taken on the same set of cells and CpG sites for each patient. For each patient in the Bian cohort, we generate a baseline lineage tree by first computing the Hamming distance between every pair of cells over their binarized CpG site methylation status vectors, averaging across the sites covered in both cells, then applying NJ to the distance matrix. As IQ- TREE is a stochastic method, we perform 10 separate runs of each IQ-TREE experiment with different random seeds, and for each instance report the intermediate tree at convergence or after 24 hours if convergence has yet to be achieved by then. If none of the 10 runs of IQ-TREE for a particular experiment have finished model optimization, started tree search, and produced some intermediate result, we do not report IQ-TREE result for that experiment. Similarly, if FG does not terminate in 100 hours, we do not report FG+IQ-TREE result for that experiment. We root the baseline and IQ- TREE trees with the same normal cells we used to root the trees by Sgootr, and we infer the tumor 36 migration histories from them via Sgootr’s modified Fitch-Hartigan algorithm. We compare the migration histories inferred via Sgootr and the three alternative approaches by two measures: (i) the total number of migration events, namely the number of edges in our multigraph representation, and (ii) one we name BAMDPO, which is the Hamming distance between the binary adjacency matrix of the simple directed acyclic graph induced by the lesion partial order and that obtained by collapsing all parallel edges in the migration history multigraph. Intuitively, the first measure (output from Fitch-Hartigan algorithm) captures the conventional definition of parsimony, and the latter measures the degree the inferred history deviates from the sequential-progression model of tumor evolution. Together, they offer complementary perspectives on the complexity of the inferred histories. For both measures, a lower value would correspond to a simpler history. Figure 2.3A. compares the tree t∗ obtained by Sgootr against the alternatives with respect to the two measures. The figure also includes results of t0 for each patient, the tree obtained by Sgootr prior to its iterative site pruning procedure. Comparisons of the measures for t0 against those for t∗ show Sgootr’s iterative procedure almost always reduces the complexity of the inferred migration history, provided such a reduction is at all possible. In the case of patient CRC15, even though the total number of migration events increases from t0 to t∗, BAMDPO decreases; this suggests what first appeared to be deviations from the simple sequential-progression model could potentially be resolved as polyclonal seeding events more in concordance with the model, once the tree structure becomes more refined through the iterative procedure. Compared to naı̈ve baseline and the two IQ-TREE-based methods, Sgootr typically infers as simple, if not simpler, migration histories; the only exception is patient CRC10, where IQ-TREE (without FG site-selection) infers a migration history with the lowest BAMDPO. Note that all patients except CRC01 and CRC15 have single cells sampled from only 2 lesions 37 in addition to the normal tissue (Supplemental Table A.3); among them, patients CRC02, CRC04, CRC12, and CRC14 have low number of input cells. For that reason, one may expect a lower number of migration events detected from the constructed lineage trees. Furthermore, Sgootr has achieved the minimal possible values in terms of BAMDPO prior to the iterative procedure in CRC02, CRC04, CRC11, CRC12; for CRC02, CRC04, and CRC12, Sgootr has also achieved the minimal possible value for the number of events prior to the iterative procedure. In those cases, there are no changes in terms of those two measures between the initialized tree (t0) and the optimal tree (t∗). 2.3.3 Sgootr-identified lineage-informative CpG sites construct lineages suggest- ing simpler migration histories than those identified via FG. Specifically, comparisons between the measures from running IQ-TREE on FG-selected sites and those from Sgootr show that Sgootr consistently outperforms the former in terms of the num- ber of events and BAMDPO where further improvements were possible. This suggests the sites se- lected by Sgootr may be more meaningful for the purpose of lineage reconstruction than those by FG. FG is also time-consuming: for patient CRC11 and CRC12, FG failed to complete after 100 hours, and no experimental result was reported (Figure 2.3A.). 2.3.4 Sgootr is orders-of-magnitude faster than IQ-TREE. While Sgootr produces as simple, if not simpler migration history than IQ-TREE (without FG site-selection) in terms of the two measures in all but one case, Sgootr obtains the results in a fraction of time IQ-TREE took given the same input dimensions. Each IQ-TREE run was allotted 24 hours. If the program converged within 24 hours, we report the wall clock time elapsed; if the program 38 did not converge within 24 hours but produced some intermediate output from tree search, we report the program run time as 24 hours; and if IQ-TREE failed to produce any intermediate result from tree search within 24 hours, we do not include the elapsed time from that run. Within the allotted time, IQ-TREE produces converged results in 6 out of 9 patients, some (intermediate) results for 8 out of 9 patients. IQ-TREE is not able to finish model optimization and start tree search within 24 hours for any of the 10 runs for CRC11 (Figure 2.3B.). Sgootr is deterministic and obtains lineage trees suggesting simpler histories in a mere fraction of the time when given the same number of cores and memory as IQ-TREE, positioning itself as the better option in more time-frugal settings. 2.3.5 Sgootr-identified lineage-informative CpG sites are enriched in inter-CGI regions. In all 9 CRC patients in the Bian cohort, as well as the multi-regional GBM patient MGH105 in the [44] cohort, the set of lineage-informative sites identified by Sgootr show enrichment in inter- CGI regions (Figures 2.2E., Supplemental Figures 2.3C., A.19F.). The median distances between CpG sites and their closest CGIs also increase as the iteration