ABSTRACT

Title of Dissertation: STUDYING TUMOR EVOLUTION
WITH EMERGING SEQUENCING
TECHNOLOGIES

Yuelin Liu
Doctor of Philosophy, 2025

Dissertation Directed by: Professor Mihai Pop
Department of Computer Science

Doctor S. Cenk Sahinalp
National Institutes of Health

The path along which heterogeneous subpopulations within a tumor arise can be modeled as

an evolutionary process, and a better understanding of such process may improve cancer diagnostics,

prognostics, and treatment outcome. How to construct more accurate tumor phylogenies and how to

leverage the inferred phylogenetic relations among tumor subclones in understanding the disease are

both key in the study of tumor evolution. The advances in novel sequencing technologies present op-

portunities to examine crucial aspects of the evolution of a tumor, beyond the types of information

short-read genomic and transcriptomic sequencing technologies offer. As such, this work leverages

three such technologies - single-cell bisulfite sequencing, long-read sequencing, and spatial transcrip-

tomics - in answering key questions in tumor evolution.

In the first part of the thesis we present the first distance-based method to jointly construct a


tumor phylogeny and identify lineage-informative CpG sites from single-cell methylation sequencing

data. The tool provides a fast and accurate way to examine tumor evolution through the methylome at a

single-cell resolution and allows the incorporation of matched RNA sequencing information whenever

applicable.

Next, we present a novel framework to study both the genomic and epigenomic changes during

tumor evolution using Nanopore long-read sequencing data of single-cell derived tumor sublines. Sin-

gle nucleotide variants, structural variants, copy number alterations, as well as CpG methylation were

profiled. The clonal resolution this data set uniquely affords allowed the inference of active mutational

processes, confident identification of parallel CNAs, and detection of differentially methylated regions,

all at a subclonal level.

Finally, we present a framework to examine tumor subclones at spatial resolution. Melanoma

tumors from mice in both control and anti-CTLA-4 treatment group were subjected to both Smart-seq2

single-cell full length transcriptomics sequencing and 10x Visium spatial transcriptomics sequencing.

Six tumor subclones were identified using mutations profiled from the Smart-seq2 data, and their

proportion were estimated from the spatial transcriptomics slides via cell type deconvolution. The

associations between tumor subclones with various stromal and immune cell types were accessed with

established spatial statistics. By contrasting subclone-immune activities between responding and non-

responding tumors, we show that there exist subclonal differences in their response to the immune

microenvironment.


STUDYING TUMOR EVOLUTION WITH EMERGING SEQUENCING
TECHNOLOGIES

by

Yuelin Liu

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Mihai Pop, Chair/Advisor
Dr. S. Cenk Sahinalp, Co-Advisor
Professor Rob Patro
Professor David M. Mount
Professor Stephen M. Mount


© Copyright by
Yuelin Liu

2025


Acknowledgments

The PhD journey has been full of challenges in ways I could not even have previously under-

stood, and there’s a long list of people I would like to acknowledge who have helped me get to where

I am today.

First, I want to acknowledge my collaborators and members of the Cancer Data Science Labo-

ratory at the National Cancer Institute. In particular, I would like to thank members of the Sahinalp

lab, with whom I have had countless productive discussions even when we are not working on the

same project. Especially Chih Hao, John, and Jake, who have consistently driven me to and from the

Bethesda Campus of the National Institutes of Health. I for sure owe them dinner, and probably a lot

of gas money.

I would like to thank past and present students and faculty of the Center for Bioinformatics and

Computational Biology at the University of Maryland, who made me immediately at home when I first

set foot in IRB-3112. Specifically, I would like to thank Zinat and Noor, who were in the same year

as me and have supported each other through losing our first advisors, and Hirak, who imparted with

me so much wisdom in my first year when he left for a postdoc, and is doing that once again now as

he settles in a faculty position at Vanderbilt.

I would like to thank the past and present members of CS GradCo at the University of Maryland,

and all the friends I have met while organizing for it. It is amazing to be among such community-

minded people. From GradCo Day, to fall picnic, and to the very brief time we tried to start a jazz

band, I cannot believe what we were able to pull off.

I would like to thank my past and present housemates at 48th Avenue who have witnessed

ourselves at our best as much as at our worst, all in between Costco runs and Biryani City orders.

ii


I would like to thank the DC swing dance community for bringing me so much joy through our

collective love for the art form.

I would like to thank the very dear friends in my life for inspiring me to be the best version of

myself, Yang, Tori, Yujia, Daniel, Ann, and especially Jwi, Shramay, and Gibson for making sure I did

not forget to eat in the weeks leading up to my defense.

There are three mentors who have played instrumental roles in this process. I would like to

thank Dr. Mihai Pop, who has been tremendously supportive during Max’s departure from UMD and

over the years. He has been a kind mentor and have set an example for all of us students in CBCB.

I would like to thank Dr. Lenore Cowen, my undergrad advisor, who changed my life in the literal

sense by introducing me to computational biology and being the first to say “I think you can be good

at this”. And, of course, I would like to thank Dr. S. Cenk Sahinalp, who supervised all the work in

this thesis, who has not only taught me a lot about science, but also a whole lot about being a scientist.

His guidance, both spoken and unspoken, has been truly invaluable in this journey.

Last but not least, I would like to thank my parents, Fei Dai and Zhuangcheng Liu, who have

sacrificed the unimaginable in making me who I am today. Without their unconditional love and

support, none of this would ever have been possible.

iii


Table of Contents

Acknowledgements ii

Table of Contents iv

List of Tables viii

List of Figures ix

List of Abbreviations xii

Chapter 1: Introduction 1
1.1 Cancer as a heterogeneous disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Modeling how tumor heterogeneity arises as an evolutionary process . . . . . . . . . . 1
1.3 Advancements in sequencing technologies further understanding of tumor evolution . . 3
1.4 Phylogeny inference using multi-sampled bulk data . . . . . . . . . . . . . . . . . . . 5
1.5 Phylogeny inference using single-cell data . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 DNA methylation as phylogenetic marker . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Long-read sequencing technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.8 Sequencing-based spatial transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . 10
1.9 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 2: Reconstructing tumor phylogenies with single-cell CpG methylation data using
Sgootr 14

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Biclustering of cells and sites for reducing sparsity-induced noise . . . . . . . 17
2.2.2 Likelihood-based sequencing error correction accounting for copy number es-

timates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Expected distance calculation between cell pairs for tree construction . . . . . 21
2.2.4 Pruning of CpG sites according to a tree-based methylation status persistence

measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.5 Inference of migration history from lesion-labeled tumor lineage tree leverag-

ing prior knowledge on migration patterns . . . . . . . . . . . . . . . . . . . . 28
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.1 Sgootr obtains a simple tumor migration history with scBS-seq and scRNA-
seq data from patient CRC01. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3.2 Sgootr infers migration histories simpler than alternative methods for the
Bian CRC cohort. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.3 Sgootr-identified lineage-informative CpG sites construct lineages suggest-
ing simpler migration histories than those identified via FG. . . . . . . . . . . 38

2.3.4 Sgootr is orders-of-magnitude faster than IQ-TREE. . . . . . . . . . . . . . 38

iv


2.3.5 Sgootr-identified lineage-informative CpG sites are enriched in inter-CGI
regions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.3.6 Genes associated with Sgootr-identified lineage-informative CpG sites are
enriched for pan-cancer and CRC-related terms. . . . . . . . . . . . . . . . . . 39

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Chapter 3: Studying melanoma evolution with Nanopore long-read sequencing data 42
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.1 Long-read sequencing of single-cell derived subclones of B2905 mouse melanoma
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2 Constructed phylogeny reveals phenotypically-distinct major clades . . . . . . 46
3.2.3 Phylogeny-guided harmonization with TreeHarmonizer provides evolu-

tionary timing of genomic variants . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.4 Evolutionary dynamics of somatic SNVs and mutational processes . . . . . . . 49
3.2.5 Somatic SVs dynamics suggest early drivers and late genomic instability . . . . 51
3.2.6 Recurrent CNAs in independent lineages suggest parallel adaptation process . . 53
3.2.7 Methylation changes contribute to differentiation of an aggressive clonal lineage 56
3.2.8 Systematic analysis of multiple variation types provides insights into B2905

lineages differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.9 Curated set of subclonal variant calls enables benchmarking of long-read tools . 62

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Chapter 4: Studying melanoma subclones at spatial resolution 66
4.1 Spatial transcriptomics profiling of the control and anti-CTLA-4 treated tumors. . . . . 66
4.2 Identifying tumor subclones from Smart-seq2 full-length scRNA-seq data of the con-

trol and anti-CTLA-4 treated tumors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3 Estimating tumor subclone proportions on the ST slides via deconvolution using SpaCET 68
4.4 Validation of SpaCET-deconvolved tumor subclone proportions . . . . . . . . . . . . . 69
4.5 Analysis of SpaCET-deconvolved tumor subclone proportions . . . . . . . . . . . . . 73

4.5.1 SpaCET-deconvolved proportions reveal differences in subclone growth dy-
namics between treatment responders and non-responders . . . . . . . . . . . . 73

Chapter 5: Concluding remarks 76

Appendix A: Supplemental Material for Chapter 2 80
A.1 Biclustering: benchmarking on simulated data and a case study on CRC01 . . . . . . . 80

A.1.1 Running time and memory benchmark on simulated data . . . . . . . . . . . . 81
A.1.2 Biclustering case study on CRC01 from the Bian cohort . . . . . . . . . . . . . 85

A.2 Benchmarking Sgootr on simulated data . . . . . . . . . . . . . . . . . . . . . . . . 89
A.2.1 Expected distance calculation on simulated data . . . . . . . . . . . . . . . . . 89
A.2.2 Iterative pruning on simulated data . . . . . . . . . . . . . . . . . . . . . . . . 94

A.3 Application of Sgootr on single cells from U87MG ex vivo tree from the RETrace
study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

v


A.4 Application of Sgootr on the complete set of 9 metastatic CRC patients made avail-
able by Bian and colleagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

A.5 Application of Sgootr on the cohort, replacing NJ with FastME 2.0 . . . . . . . . . . 111
A.6 Patient CRC01 tumor lineage tree constructed using mutation . . . . . . . . . . . . . . 115
A.7 Application of Sgootr to multi-regional GBM patient MGH105 MscRRBS data by

Chaligne and collegues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A.8 How to choose p, P(1c), P(0c), and P(mixed) for Sgootr . . . . . . . . . . . . . . . . 118
A.9 How to choose κ for Sgootr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.10 Experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

A.10.1 Sgootr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.10.2 FastME 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.10.3 IQ-TREE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.10.4 FG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Appendix B: Supplemental Material for Chapter 3 124
B.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

B.1.1 Constructing tumor phylogeny using long read sequencing data from 23 sublines124
B.1.2 Determining the phenotypes of the 4 major clades . . . . . . . . . . . . . . . . 125
B.1.3 Assessing the effectiveness of TreeHarmonizer in determining the place-

ment of variants impacted by filtering during tree construction . . . . . . . . . 128
B.1.4 Mutational signature analysis along the tumor phylogeny . . . . . . . . . . . . 129
B.1.5 Computing dN/dS values along the tumor phylogeny . . . . . . . . . . . . . . 130
B.1.6 Pathway enrichment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
B.1.7 Determining genes impacted by SVs . . . . . . . . . . . . . . . . . . . . . . . 131
B.1.8 Determining the set of cancer-relevant mouse genes-of-interest according to

the COSMIC Gene Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
B.1.9 Timing analysis of chr4 copy-neutral LOH . . . . . . . . . . . . . . . . . . . . 134
B.1.10 Subclone-specific differential methylation analysis . . . . . . . . . . . . . . . 137
B.1.11 Permutation test on the number of DMRs obtained . . . . . . . . . . . . . . . 137
B.1.12 Correlation between the number of DMRs and cumulative SNVs at a branch

in the phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
B.1.13 Differential gene expression analysis of subline scRNA-seq data . . . . . . . . 138
B.1.14 Discovering key regions where changes in methylation is monotonic along the

evolutionary trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
B.1.15 Permutation test on the number of genes associated with monotonic methyla-

tion changes along the evolutionary trajectory . . . . . . . . . . . . . . . . . . 140
B.1.16 Calling copy number from bWES and Smart-Seq2 full-length scRNA-seq data

on the 23 sublines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
B.2 Technology comparison: long-read, bWES, and scRNA-seq . . . . . . . . . . . . . . . 141
B.3 Additional Supplementary Figures and Tables . . . . . . . . . . . . . . . . . . . . . . 145

Appendix C: Supplemental Material for Chapter 4 161
C.1 SpaCET deconvolution results on 17 ST slides . . . . . . . . . . . . . . . . . . . . . . 161

C.1.1 Control STs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
C.1.2 Treated STs (Non-responders) . . . . . . . . . . . . . . . . . . . . . . . . . . 165

vi


C.1.3 Treated STs (Responders) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
C.2 Validation of SpaCET-deconvolved tumor subclone proportions on 17 ST slides . . . . 171

C.2.1 Control STs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
C.2.2 Treated STs (Non-responders) . . . . . . . . . . . . . . . . . . . . . . . . . . 174
C.2.3 Treated STs (Responders) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Bibliography 180

vii


List of Tables

A.1 Feature sizes of the biclustering formulation model as a function of the number of cells,
v, and the number of CpG sites, u in the input. . . . . . . . . . . . . . . . . . . . . . . 81

A.2 Densities of the original binary matrix and the subsampled matrices. . . . . . . . . . . 87
A.3 Data summary of the Bian metastatic CRC cohort. . . . . . . . . . . . . . . . . . . . . 102
A.4 Overlap coefficients of GO terms associated with different genomic locations (CpG

island and inter-CGI) discovered by GOrilla and DAVID. . . . . . . . . . . . . . . . . 110
A.5 Spearman correlation of CpG site distance from closest CpG island and homogeneity. . 111

B.1 Phenotypes of the 4 major clades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.2 DMRs associated with the red clade that overlap putative promoter or putative proxi-

mal enhancers of known genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

viii


List of Figures

1.1 Illustration of the Nowell’s theory of tumor evolution. . . . . . . . . . . . . . . . . . . 3
1.2 Illustration of tumor phylogeny reconstruction from multi-sampled bulk data. . . . . . 5
1.3 Illustration of tumor phylogeny reconstruction from single-cell data. . . . . . . . . . . 7

2.1 Overview of Sgootr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Application of Sgootr to patient CRC01 scTrio-seq data by Bian and colleagues. . . 33
2.3 Overview of results from the Bian metastatic CRC cohort. . . . . . . . . . . . . . . . . 35

3.1 An overview of the long-read approach and analysis of 23 subclones of a mouse
melanoma tumor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Phylogenetic tree and mutational landscape of 23 single-cell derived sublines. . . . . . 47
3.3 Patterns and dynamics of somatic SVs. . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Genomic regions recurrently affected by CNAs suggest parallel evolution. . . . . . . . 54
3.5 DMRs detected at subclonal resolution. . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 A model of B2905 melanoma evolution informed by analyses of long-read sequencing

data on the 23 sublines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7 Subline-derived pseudobulk samples enables benchmarking the performances of long-

read variant callers at various clonality. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.1 Growth dynamics of the 17 tumors from which the ST slides were derived. . . . . . . . 67
4.2 Mutations profiled from Smart-seq2 full-length scRNA-seq data of control and treated

tumors reveal 6 major tumor subclones. . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Validation of tumor subclone proportions estimated by SpaCET using subclone-specific

mutations detected on ST slides. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Deconvolved subclone proportions inform differences in tumor growth dynamics. . . . 74

A.1 Patient CRC01 cell and site coverate level distribution . . . . . . . . . . . . . . . . . . 80
A.2 Running time and memory behavior on 200 simulated square binary matrices. . . . . . 84
A.3 Mean and standard deviation of wall clock time-to-convergence in hours, optimality

gap at 20 hours, and resulting submatrix density across the 10 subsamples, over a
range of (α, β) value pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A.4 Probability mass functions of A. ZTPois(λ), where λ = {1, 2, 3, 4, 5, 6} and B. the
empirical per-site read coverage distribution from patient CRC01. . . . . . . . . . . . 91

A.5 Baseline distance measure and expected distance measure on 13, 500 sets of read count
vectors, simulated from cell pairs with known normalized ground truth distance π =
{.1, .2, .3, .4, .5, .6, .7, .8, .9}, with level of read coverage depth modeled by ZTPois(λ),
where λ = {1, 2, 3, 4, 5, 6}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

A.6 Baseline distance measure and expected distance measure on 2, 250 sets of read count
vectors, simulated from cell pairs with known normalized ground truth distance π =
{.1, .2, .3, .4, .5, .6, .7, .8, .9}, with level of read coverage depth modeled by the empir-
ical read coverage distribution gathered from the CRC01 sample. . . . . . . . . . . . . 94

A.7 Topologies of the simulated clone trees. . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.8 Results on 10 simulated clone trees from Sgootr and baseline method. . . . . . . . . 98

ix


A.9 Application of Sgootr to scRRBS data from U87MG ex vivo tree by the RETrace
study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

A.10 Sgootr-constructed (κ = .1) tumor lineage trees (t∗s) and Sgootr-inferred tumor
migration histories for the 9 patients in the Bian cohort . . . . . . . . . . . . . . . . . 104

A.11 RF distances across iterations for each patient in the metastatic CRC cohort applying
Sgootr with different κ values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

A.12 GOrilla -identified GO terms with significant (<.05) q-values in enrichment analysis
of non-pseudo, protein coding genes whose gene bodies and 1000bp upstream pro-
moter regions contain the Sgootr-identified lineage-informative CpG sites, stratified
by whether they locate in CpG island or inter-CGI regions. . . . . . . . . . . . . . . . 107

A.13 DAVID-identified GO terms with significant (<.05) q-values in enrichment analysis
of non-pseudo, protein coding genes whose gene bodies and 1000bp upstream pro-
moter regions contain the Sgootr-identified lineage-informative CpG sites, stratified
by whether they locate in CpG island or inter-CGI regions. . . . . . . . . . . . . . . . 109

A.14 Median number of bp to the closest CGI from CpG sites used to construct the methy-
lation lineage tree at each iteration for all CRC patients and GBM patient MGH105. . . 110

A.15 Sgootr-constructed (κ = .1) tumor lineage trees (t∗s) and Sgootr-inferred tumor
migration histories for the 9 patients in the Bian cohort, replacing NJ with FastME 2.0. 113

A.16 Number of migration events and BAMDPO of migration histories obtained via apply-
ing Sgootr with FastME 2.0 in comparison to NJ. . . . . . . . . . . . . . . . . . . . 113

A.17 RF distances across iterations for each patient in the Bian metastatic CRC cohort ap-
plying Sgootr - substituting NJ with FastME 2.0 - with different κ values. . . . . . . 114

A.18 Tumor lineage tree based on SNVs/INDELs obtained from scRNA-seq data from pa-
tient CRC01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

A.19 Application of Sgootr to patient MGH105 MscRRBS data by Chaligne and colleagues.117
A.20 FG frequencies (averaged across 5 runs) for the Bian cohort. . . . . . . . . . . . . . . 122
A.21 Pearson product-moment correlation coefficients between FG frequencies across sites

between downsampled runs for the Bian cohort. . . . . . . . . . . . . . . . . . . . . . 123

B.1 DGI of sublines in C57BL/6 and nude mice. . . . . . . . . . . . . . . . . . . . . . . . 127
B.2 Mutational signature activities along the tumor phylogeny. . . . . . . . . . . . . . . . 130
B.3 SNVs along the tumor phylogeny. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.4 SNV evidence for the timing clonal CNA events. . . . . . . . . . . . . . . . . . . . . 135
B.5 Comparison among long-read, bWES, and scRNA-seq data of the 23 sublines. . . . . . 142
B.6 SNVs and SVs placed by TreeHarmonizer. . . . . . . . . . . . . . . . . . . . . . 145
B.7 SNV density and SV events present in individual sublines in the green clade. . . . . . . 146
B.8 SNV density and SV events present in individual sublines in the blue clade. . . . . . . 147
B.9 SNV density and SV events present in individual sublines in the red clade. . . . . . . . 148
B.10 SV-disrupted genes and gene fusion annotated by Padfoot. . . . . . . . . . . . . . . 149
B.11 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV

breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro-
moters in 23 sublines (1/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

B.12 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV
breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro-
moters in 23 sublines (3/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

x


B.13 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV
breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro-
moters in 23 sublines (4/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

B.14 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV
breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro-
moters in 23 sublines (5/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

B.15 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV
breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro-
moters in 23 sublines (6/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

B.16 Mouse homologs of COSMIC genes impacted by (1) non-synonymous SNV, (2) SV
breakpoint, (3) CNA, or (4) (hypo/hyper-) DMRs in their (putative) enhancers or pro-
moters in 23 sublines (7/7). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

B.17 Permutation tests on the number of DMRs obtained for all subtree sizes of the phy-
logeny in Figure 3.2a. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

B.18 Associating truncating Dnah6 mutation with (haplotype-specific) methylation of Dnah6
promoter in major clades. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

B.19 CNA profiles of 23 sublines obtained using long-read, bWES, and scRNA-seq data. . . 160

xi


xii


List of Abbreviations

AID Activation-induced cytidine deaminase
ANOVA Analysis of variance
ASM Allele-specific methylation

BAMDPO Binary adjacency matrix distance from partial order
bp Base pair(s)
bWES Bulk whole-exome sequencing

cDNA complementary DNA
CGC Cancer Gene Census
CGI CpG island
chr Chromosome
CLL Chronic lymphocitic leukemia
COSMIC Catalogue Of Somatic Mutations In Cancer
CpG Cytosine-guanine
CNA Copy number aberration
CRC Colorectal cancer

dbSNP The Single Nucleotide Polymorphism Database
DGI Delay of growth initiation
DMR Differentially methylated region
DNA Deoxyribonucleic acid

FG Four gametes

GATK Genomic Analysis Toolkit
GBM Glioblastoma
GENCODE Encyclopedia of genes and gene variants
GO Gene Ontology

HSD Honestly Significant Difference

ICB Immune checkpoint blockade
INDEL Insertion-deletion

kb kilo-base pairs

LN Lymph node metastasis
LOH Loss of heterozygosity

xiii


ML Liver metastasis
MP Post-treatment liver metastasis
mRNA Messenger RNA
MRCA Most-recent common ancestor
MscRRBS Multiplexed single-cell reduced representation bisulfite sequencing

NC Normal colon tissue
ND Not designated
NJ Neighbor joining

ONC (Proto-)Oncogene
ONT Oxford Nanopore Technologies

PacBio Pacific Biosciences
PT Primary tumor

RF Robinson-Foulds
RNA Ribonucleic acid
ROS Reactive oxygen species

SBS Single base substitution
scBS-seq Single-cell bisulfite sequencing
scRNA-seq Single-cell RNA sequencing
scRRBS Single-cell reduced representation bisulfite sequencing
scTrio-seq Single-cell triple omics sequencing
SNP Single nucleotide polymorphism
SNV Single nucleotide variant
ST Spatial transcriptomics
SV Structural variant

TCGA The Cancer Genome Atlas Program
TSG Tumor Suppressor Gene
T2T Telomere-to-Telomere

UV Ultraviolet (radiation)

WGS Whole-genome sequencing

5mC 5-methylcytosine

xiv


Chapter 1: Introduction

1.1 Cancer as a heterogeneous disease

Cancer is a group of diseases characterized by the abnormal growth of cells with the potential to

spread to other parts of the body. It is highly heterogeneous. Different tumor types are characterized by

the primary tissue or cell type of origin. Treatment decisions for different tumor types are frequently

specific to the primary site, since tumors in different tissues likely are caused by different endogenous

or exogenous processes. Within each tumor type, there could be further molecular subclassifications

based on histological or genomic profiling. Distinguishing among different tumor subtypes is im-

portant, as they may differ in invasiveness, metastatic potential, proneness to relapse, prognosis, and

response to specific treatment. These types of differences we call inter-tumor heterogeneity.

Looking within a single tumor, there is extensive heterogeneity as well. There may be cellular

subpopulations that differ by genetic or non-genetic characteristics. We call each group of these cells

subclones. From a more clinically relevant perspective, how well a cancer patient responds to a type of

treatment may be explained by the tumor’s subclone composition. For example, treatment resistance

and relapse may be caused by the presence of a stubborn subclone unaffected by treatment. If we can

better characterize such intra-tumor heterogeneity, we may achieve more accurate cancer diagnostics,

prognostics, and more effective and targeted treatment.

The latter, intra-tumor heterogeneity, will be the focus of this thesis.

1.2 Modeling how tumor heterogeneity arises as an evolutionary process

The model with which we study how such intra-tumor heterogeneity arises has its root in species

evolution. In his revolutionary work in 1859, The Origin of Species, Charles Darwin provided a

1


framework which enabled the understanding of somatic selection, diversification and extinction [1].

His famous 1837 drawing of an evolutionary tree diagram was the first to describe the phylogenetic

relations among multiple species.

Incidentally, exactly a hundred years after Darwin’s publication of The Origin of Species, in

1959, a predoctoral fellow by the name of David Hungerford was working in a genetics lab in Philadel-

phia, when he observed under the microscope that the white blood cells from patients with chronic

myelogenous leukemia had an abnormally short chromosome 22. Together with a pathologist at Uni-

veristy of Pennsylvania named Peter Nowell, they published this observation in 1960 [2], and named

the abnormality the Philadelphia chromosome after where the discovery was made.

It may be an indisputable fact at the time of the writing of this thesis, but back then, cancer was

not believed to be a genetic disease, and no consistent theory of how cancer arises was recognized. The

discovery of Philadelphia chromosome provided the first piece of evidence that, alterations in one’s

DNA may lead to cancer.

More than a decade later, using a novel staining technology, Dr. Janet Rowley at the University

of Chicago observed that the part missing from the end of chromosome 22 seems to be relocated to

the end of one copy of chormosome 9, causing a translocation event that resulted in a gene fusion [3].

This discovery eventually led to the development of the targeted therapy Imatinib in the early two

thousands, which is a landmark drug shown to be very effective in maintaining remission in CML

patients.

With the increasingly well established idea that cancer is caused by aberrations in one’s DNA,

Nowell, who co-discovered the Philadelphia chromosome, went on to formalize the foundation of the

theory of tumor evolution in 1976 [4]. Much like species evolution where heritable traits were passed

down from ancestor to offsprings, somatic alterations that are heritable from parent to daughter cells

2


Figure 1.1: Illustration of the Nowell’s theory of tumor evolution. Clonal alterations
occur in the cancer founder cell, which initiates the abnormal cell growth. As the clonal
population expands, additional subclonal alterations also occur and are inherited by daugh-
ter cells during cell replication. The accumulation of such somatic alterations over time
results in a heterogeneous tumor.

during cell replication accumulate over the course of tumor development, resulting in a heterogeneous

cell population (Figure 1.1). Under the “infinite sites assumption” [5], that states that, as the length

of the genome is large, the probability of the same nucleotide change at a particular locus in distinct

samples happening independently is negligibly small, the observation of shared somatic mutations in

distinct samples at different frequencies serve as a piece of evidence for the relative timing of when

the mutation occurred along the course of the development of the tumor.

1.3 Advancements in sequencing technologies further understanding of tumor evo-

lution

While traditionally, tumor evolution has been conceptualized as the gradual accumulation of

point mutations under selective pressures [4], more recently, distinct mutational processes that lead to

the rapid accumulation of structural variants during catastrophic events — such as chromothripsis and

3


chromoplexy — have been increasingly recognized as critical drivers of tumor initiation and adapta-

tion [6, 7]. Furthermore, tumor evolution can also be shaped by non-genetic, or even non-heritable,

determinants—especially epigenetic alterations such as DNA methylation or histone acetylation [8].

The advancement in our understanding of tumor evolution largely owes to the advances in sequencing

technologies, which have allowed us to study somatic alterations at the base pair level. Novel sequenc-

ing technologies afford us with more, and new types of information about tumor samples, all the while

with decreasing sequencing cost.

Specifically, the three main pieces of work discussed in this thesis will respectively leverage

tumor data generated with three novel sequencing technologies introduced within the past decade.

First, single-cell bisulfite sequencing [9], introduced in 2014, generates CpG methylation information

within individual cells in a bisulfite-treated tumor sample. This technology is further augmented by the

scTrio-seq protocol in 2016 [10], that additionally profiled RNA-seq data from the same single cells.

Next, Oxford Nanopore Technology long read sequencing, which was made commercially available in

2015, profiles ultra long reads at the kilo basepair to mega basepair scale while allowing simultaneous

detection of genetic and epigenetic base modifications. Lastly, spatial transcriptomics, first reported

in 2016 [11], provides gene expression read outs from spot with known spatial coordinates on a tumor

slide.

The discovery of the Philadelphia chromosome, discussed in the previous section, not only was

an important event in the development of the theory of tumor evolution, the story also highlights the

importance of constantly adopting and leveraging novel technologies, such as the staining method

used to discover the translocation event. This thesis will discuss how we leverage new opportunities

and address the challenges brought by the three novel sequencing technologies in our study of tumor

evolution.

4


Figure 1.2: Illustration of tumor phylogeny reconstruction from multi-sampled bulk
data.

1.4 Phylogeny inference using multi-sampled bulk data

One way to study the evolutionary process that is tumor evolution is through phylogeny recon-

struction. The problem of algorithmically inferring a phylogenetic tree that may be interpreted as a

tumor progression history from multiple bulk samples per patient has been widely studied in the past

decade. In such experiments, the tumor samples obtained from the patient are subjected to bulk se-

quencing, and in each sample, for each unique point mutation, the number of sequencing reads with

the reference allele, and that with the variant allele are obtained. Under some reasonable assumptions

that the mutations are heterozygous from a diploid genome, the observed frequency of each mutation

per sample can be computed. Then, the subclone phylogeny could be obtained by either - first explic-

itly clustering the mutations by frequencies then determining their order along the phylogeny - or -

incorporating some error model to find the most likely tree that resulted in the observed variant allele

frequencies (Figure 1.2).

Methods developed to address this problem include: TuMult [12], PhyloSub [13], PyClone [14],

5


SciClone [15], MEDICC [16,17], PhyloWGS [18], AncesTree [19], CITUP [20], Canopy [21], Treeomics [22]

and others. Tumor phylogenies and the implied intra-tumor heterogeneity are occasionally used to

make clinical decisions [23]. For understanding basic cancer biology, a key question is whether tumor

progression is more branched (leading to a wider and shallower tree) or more linear (leading to a nar-

rower and deeper tree) [24]. For metastases, results on the question of branched vs. linear evolution

are mixed (e.g., [25–27]). Hong and colleagues showed that results are sensitive to the method used

for phylogeny inference [28]. Complicating matters further is that most of the methods listed above

use bulk sequencing data as input, and El-Kebir and colleagues showed that under reasonable and

models, the optimal phylogeny from bulk sequencing data may not be unique [29]. Therefore, recent

efforts have looked beyond bulk sequencing data to elucidate the progression history of a tumor. In

a major step towards this goal, Quinn and colleagues introduced technologies for cell lineage tracing

and applied those technologies to study single-cell data in xenograft tumor models. To our knowledge,

lineage tracing is not yet possible in human cancer samples, but analysis of single-cell data of other

types is possible [30].

1.5 Phylogeny inference using single-cell data

The recent rise of single-cell sequencing technology empowers more accurate tumor lineage

inference by allowing the examination of intratumor heterogeneity at a cellular resolution [31–36]

(Figure 1.3). However, since single-cell sequencing data are derived from a limited amount of genetic

material, the signals obtained are more scarce than those from bulk sequencing [37]. Genomic events

commonly used in bulk tumor lineage inference, such as SNVs and CNAs, are observed with less fre-

quency or measured with less confidence in single cell datasets. Single-cell tumor lineage inferences

6


Figure 1.3: Illustration of tumor phylogeny reconstruction from single-cell data.

using SNVs often suffer from false negatives in the datasets, which calls for data imputation or ag-

gregation; on the other hand, tumor lineage inference using CNAs called from single-cell sequencing

data often involves ambiguities in the determination of chromosomal breakpoints that may mislead the

conclusion.

1.6 DNA methylation as phylogenetic marker

More recently, there have been growing interests in examining heritable and non-genetic deter-

minants in tumor evolution [38], as while genomic alterations such as SNVs and CNAs have been con-

ventionally used to infer tumor phylogenies, other non-genomic alterations can also influence clonal

phenotype and fitness. DNA methylation, the addition of a methyl group to cytosine, which results

in the formation of 5mC especially in the context of CpG sites, is an epigenetic marker that has been

extensively studied for its role in regulating gene expression and maintaining cellular memory [39].

As Gaiti and colleagues importantly noted, the methylation maintenance machinery has been shown

to have an error rate four orders of magnitude higher than that for DNA replication [40, 41] and there-

7


fore may provide a greater amount of observable evidence for tumor evolution than alternative data

modalities, which is a critical asset especially in the single-cell context [42]. Recent studies have

leveraged CpG methylation in single cells sampled from CRC [43], CLL [42], GBM and IDH-mutant

glioma [44] tumors to construct lineages, showing CpG methylation to be a valuable signal for lineage

inference.

That said, using CpG methylation to construct tumor lineage tree has its challenges. First, single

cell methylation data has shallow read coverage. In each cell, most of the covered sites have only

a couple of read support. The shallow read coverage also means it is very likely that the reads do

not cover all alleles present. Furthermore, methylation is known to be a generally more transient

signal than mutations, and in a recent work by Meir and colleagues [45], they show that, while the

methylation statuses at some CpG sites are persistently inherited during cell replication, others seem to

be independently resampled at each generation. For the goal of constructing tumor phylogenies using

methylation as features, it is intuitive that only sites of the persistency model could harbor changes in

methylation status that would be lineage-informative, while those of the mixture model add noise at

best.

As such, in Chapter 2 of this thesis, we aim to answer the following questions. Can we leverage

single-cell methylation data to infer tumor evolutionary trajectories? Can we in the meantime identify

CpG sites that are lineage-informative? And finally, can we uncover biological insights from these

results?

8


1.7 Long-read sequencing technology

Long read sequencing technologies, such as those from PacBio HiFi and ONT, have pushed the

frontier of genomic analysis: with read lengths ranging from tens to hundreds of kilobases, with ultra-

long reads reaching a maximum read length > 4 million bases, long read sequencing data has enabled

more accurate characterization of genomes, especially of the highly repetitive regions and large SVs,

which have been difficult to assess with short reads with lengths of the order of hundreds of bases. In-

deed, long-read technology has empowered the near-complete assembly of the human genome by the

T2T Consortium, elucidating centromeric satellite arrays, recent tandem duplications, and short arms

of acrocentric chromosomes [46]. Similarly, it has further empowered accurate read-based haplotype

phasing, which is the disambiguation of maternally- and paternally-inherited alleles using heterozy-

gous SNVs. This is intuitive because long reads are likely to each span multiple polymorphic sites,

leading to the construction of phase blocks of more substantial lengths, resolving allele-specific char-

acteristics across a longer range of genomic coordinates. New tools have been developed specifically

for phasing long reads [47–51].

The results discussed in Chapter 3 were based on data generated by ONT PromethION R10 flow

cells. As a piece of genetic molecule passes through the protein nanopore, the change in electrical

current caused by the sequence of nucleic acid bases is monitored and recorded. Neural network-based

basecallers, such as guppy, and hidden Markov model-based caller, such as Nanopolish, have been

developed to translate the recorded raw signals back into the genetic sequences that generated them.

Furthermore, many basecallers have been trained to call base modifications like the methylation of

the 5mC directly from the Nanopore raw signals. The advantage presented by the ability to directly

call the methylation status of CpG sites from raw sequencing signals is two folds: first, it allows the

9


simultaneous assessment of many characteristics of the genetic material; secondly, the data will not

have bias and artifact introduced to the genetic material by the additional round of bisulfite-treatment,

which has so far been the gold-standard of methylation sequencing.

The capability of ONT long-read sequencing to jointly profile different types of genomic and

epigenomic variants motivates the work discussed in Chapter 3. Expanding upon what we estab-

lished in Chapter 2, that epigenomic variants also play a part in tumor evolution, the main goal of this

proof-of-concept study is to showcase how cohering different types of somatic variants along a tumor

phylogeny may reveal a holistic model of tumor evolution that describe the interplay among them.

1.8 Sequencing-based spatial transcriptomics

Spatial transcriptomics techonologies have enabled studying the spatial organizations of het-

erogeneous cell populations within a tissue. In particular, the 10x Genomics Visium platform - an

extension of a previous generation of barcoding-based spatial transcriptomics technology [11] - is one

that offers transcriptome-scale information at the resolutions of 1 to 10 cells per spot. One of the

most cited success stories of the use of the Visium technology was produced by Maynard and col-

leagues [52]. In their work, Maynard and colleagues used the spatial information afforded by the

Visium technology to study the six-layered human dorsolateral prefrontal cortex tissue and identified

clinically relevant layer-enriched gene expression signatures, overcoming the difficulties in associating

whole-cell expression with anatomical locations with single-cell, single-nucleus RNA-sequencing, or

microscopy-based technologies. The spatial information adds another dimension to genomic analysis.

During sequencing, the prepared tumor tissue sample is overlaid over the 6.5mm-by-6.5mm

capture area, which is composed of a field of around 5,000 spots 55µm in diameter and 100µm apart.

10


Spatially barcoded oligonucleotides present on each spot then binds to the mRNA from the tissue,

and the reverse transcribed barcoded cDNA are collected to be processed for downstream Illumina

short-read sequencing. It is important to note that, the Visium technology uses the 3’ instead of whole

transcript method; as a result, while gene expression can be reliably quantified, one cannot attempt to

call expressed variants from Visium data due to the lack of full transcript information.

In Chapter 4 of this thesis, we examine how the addition of spatial information further our

understanding of intra-tumor heterogeneity. Specifically, recent works have established that spatial

technologies can further elucidate cell-cell interactions, which previously can only be inferred from

co-expression of gene markers [53–58], by confirming their physical contact evidenced by their spatial

co-location [59–67]. We ask the question, with additional information from full-length scRNA-seq

data, can we extend our current knowledge on tumor-immune interaction to the subclonal level? This

work explores how heterogeneous subclones, characterized by various types of somatic alterations

described in Chapter 3, may interact with the tumor microenvironment differently, and therefore may

have differential response to immunity.

1.9 Outline of Thesis

First, in Chapter 2, we establish the possibility to reconstruct tumor phylogenies using data

from single-cell bisulfite-treated sequencing technology that profiles CpG methylation information on

a single-cell level, highlighting the valuable phylogenetic signals in epigenetic data that have been un-

derexplored. The newly developed method, Sgootr, jointly infers a tumor phylogeny and the amount of

phylogenetic information in each CpG site from single-cell CpG methylation data. The effectiveness

of the tool in inferring tumor phylogenies that suggest parsimonious migration histories was demon-

11


strated in simulated, U87MG glioma cell line, and multi-regionally sampled colorectal cancer data

sets, respectively. The migration hisotry Sgootr inferred from the metastatic colorectal cancer patient,

specifically, was more parsimonious than the one inferred from copy number changes in the original

work.

Next, in Chapter 3, we present a novel framework to study both the genomic and epigenomic

changes during tumor evolution using Nanopore long-read sequencing data of single-cell derived tu-

mor sublines. From the B2905 cell line derived from the M4 mouse melanoma model, which is known

to represent RAS-mutant transitory TCGA human melanoma with varied responses to anti-CTLA-

4 treatment, 23 single cells were isolated and derived into sublines, which were then subjected to

Nanopore long-read sequencing. SNVs, SVs, CNAs, as well as CpG methylation were profiled. A

tumor phylogeny was constructed using a filtered set of mutations, revealing 4 major clades with dis-

tinct phenotypes validated via orthogonal experiments. The evolutionary timing of the various types of

somatic variants was resolved against the tree, allowing the characterization of key alterations in terms

of their clonality, and of the major clades in terms of the key alterations that may have contributed to

their phenotypes. Critically, the clonal resolution this data set uniquely affords allowed the inference of

active mutational processes, confident identification of parallel CNAs, and detection of differentially

methylated regions, all at a subclonal level. Besides the biological insights, this data set also serves as

a valuable resource for the community to generate pseudobulk of various clonality for benchmarking

computational methods developed for heterogeneous cancer data.

In Chapter 4, we present a framework to examine tumor subclones at spatial resolution. Us-

ing the B2905 mouse malanoma cell line discussed in the previous section, heterogenous mixtures of

cells from the cell line culture are injected into mice that are divided into control and treatment group.

Anti-CTLA-4 immune checkpoint blockade therapy was administered to the mice in the treated group,

12


and tumors are resected for sequencing after various amounts of time. For each treatment condition,

some tumors were subject to scRNA-seq via the Smart-seq2 full length transcriptomics sequencing,

and the rest were subject to 10x Genomics Visium spatial transcriptomics sequencing. Six tumor

subclones were identified using mutations profiled from the Smart-seq2 data, and subclone-specific

expression profiles were used in cell type deconvolution of the spatial data by SpaCET. We validated

the estimation of subclonal proportions at each spatial spot by examining their association with the

few clone-specific mutations detectable directly on the spatial slides. The associations between tumor

subclones with various stromal and immune cell types were accessed with established spatial statis-

tics. By contrasting subclone-immune activities between treatment responders and non-responders,

we show that differences in tumor response to anti-CTLA-4 treatment may relate to subclone-specific

activities in its microenvironment.

Chapter 5 provides the conclusion to the thesis.

13


Chapter 2: Reconstructing tumor phylogenies with single-cell CpG methylation data

using Sgootr

2.1 Motivation

There are two main challenges in constructing single-cell CpG methylation-based tumor lineage

trees. First, detailed examination reveals that data obtained via single-cell bisulfite sequencing (scBS-

seq) [43], single-cell reduced representation bisulfite sequencing (scRRBS) [68, 69], or multiplexed

scRRBS (MscRRBS) [42] protocols exhibit high level of sparsity. Cells with subpar bisulfite conver-

sion rates have information at few CpG sites, and of the millions of unique CpG sites across all cells

in a dataset, less than one out of a hundred are sequenced in a sufficient number of cells to be useful

in lineage reconstruction. Furthermore, even when a CpG site is sequenced in a cell, it is oftentimes

covered by less than a few sequencing reads, which is not likely to capture heterozygosity in aneuploid

cancers.

Besides sparsity, another key challenge is that not all CpG sites have their methylation statuses

stably retained during tumor evolution. In particular, Meir and colleagues posited two models of

methylation dynamics: mixture, where the methylation status of a CpG site is resampled (from the

parental status) in each cell replication, and persistency, where that is an exact copy of the parent

cell [45]. It is the CpG sites whose status follows the persistency model would be informative in

tumor lineage reconstruction because the maintenance of information from the parental generation is

the necessary condition for the infinite sites assumption [5,70] and Dollo parsimony [71], which form

the basis for tumor lineage inference tools based on mutation profiles. However, as of now, there is

no known set of CpG sites that follow either inheritance model, highlighting the necessity to perform

14


intentional CpG site selection while reconstructing tumor lineages by CpG methylation.

To address these two challenges, in this work, we introduce Sgootr (Single-cell Genomic

methylatiOn tumOr Tree Reconstruction, Figure 2.1), the first distance-based computational method

to jointly select informative CpG sites and reconstruct tumor lineages from single-cell methylation

data.

2.2 Methods

Sgootr (Figure 2.1) consists of five components: (i) optional biclustering of cells and sites

for reducing sparsity-induced noise; (ii) likelihood-based sequencing error correction accounting for

copy number estimates; (iii) expected distance calculation between cell pairs for tree construction;

(iv) pruning of CpG sites according to a tree-based methylation status persistence measure; and (v)

inference of migration history from the lesion-labeled tree leveraging prior knowledge on migration

patterns. Components (iii) and (iv) are iteratively applied. Code for this work is available on GitHub as

a Snakemake [72] workflow. Results from this work are reproducible, and intermediate experiment

outputs of Sgootr are available for further analysis.

15


Inp
ut

ce
ll-

by
-s

ite
 s

in
gl

e-
ce

ll C
pG

 
m

et
hy

lat
io

n 
re

ad
 c

ou
nt

 m
at

rix

Cells

C
pG

 s
ite

s

m

m
m

m N/
A

ce
ll-

by
-s

ite
 c

op
y 

nu
m

be
r 

es
tim

at
es

Cells

C
pG

 s
ite

s

5
2 1

Optional

Sg
oo

tr
 (S

ing
le-

ce
ll G

en
om

ic 
m

et
hy

lat
iO

n 
tu

m
O

r T
re

e 
Re

co
ns

tru
ct

ion
)

O
ut

pu
t

t*
Li

ne
ag

e 
tre

e 
t*

(v)
 T

um
or

 m
ig

ra
tio

n 
hi

st
or

y 
in

fe
rre

d 
fro

m
 

les
io

n-
lab

ele
d 

lin
ea

ge
 tr

ee

(i) 
O

pt
io

na
l b

icl
us

te
rin

g 
of

 c
ell

s 
an

d 
Cp

G
 s

tie
s 

fo
r r

ed
uc

in
g 

sp
ar

sit
y-

in
du

ce
d 

no
ise

M̂

Cells

C
pG

 s
ite

s

m
m

m
m

m
m

m

m
m

m
m

m
m

m
m

(ii)
 L

ike
lih

oo
d-

ba
se

d 
se

qu
en

cin
g 

er
ro

r c
or

re
ct

io
n 

ac
co

un
tin

g 
fo

r 
co

py
 n

um
be

r

(iii
) E

xp
ec

te
d 

di
st

an
ce

 c
alc

ul
at

io
n 

be
tw

ee
n 

ce
ll p

air
s 

fo
r l

in
ea

ge
 tr

ee
 

co
ns

tru
ct

io
n

Ev
alu

at
io

n 
of

 R
ob

in
so

n-
Fo

ul
ds

 (R
F)

 
di

st
an

ce
 b

et
w

ee
n 

tre
es

 in
 

co
ns

ec
ut

ive
 it

er
at

io
ns

t i
t i−

1

Af
te

r u
se

r-
de

fin
ed

 
nu

m
be

r o
f 

ite
ra

tio
ns

De
te

ct
io

n 
of

 it
er

at
io

n 
*

(iv
) P

ru
ni

ng
 o

f C
pG

 s
ite

s 
ac

co
rd

in
g 

to
 a

 tr
ee

-b
as

ed
 m

et
hy

lat
io

n 
st

at
us

 
pe

rs
ist

en
ce

 m
ea

su
re

Fi
gu

re
2.

1:
O

ve
rv

ie
w

of
S
g
o
o
t
r

.
S
g
o
o
t
r

le
ve

ra
ge

s
si

ng
le

-c
el

l
m

et
hy

la
tio

n
se

-
qu

en
ci

ng
da

ta
fr

om
tu

m
or

sa
m

pl
es

,
in

co
rp

or
at

in
g

co
py

nu
m

be
r

in
fo

rm
at

io
n

w
he

n
av

ai
la

bl
e,

to
jo

in
tly

in
fe

r
a

si
ng

le
-c

el
lt

um
or

lin
ea

ge
tr

ee
an

d
id

en
tif

y
C

pG
si

te
s

th
at

m
ay

ha
rb

or
lin

ea
ge

-i
nf

or
m

at
iv

e
m

et
hy

la
tio

n
ch

an
ge

s.

16


2.2.1 Biclustering of cells and sites for reducing sparsity-induced noise

Sgootr requires a fairly accurate estimate of the differences between the methylation status of

distinct CpG sites across all cell pairs. Prior to submitting the input dataset to Sgootr, it is advisable

that one filters out cells for which the number of CpG sites with non-zero read coverage is low and

CpG sites with non-zero coverage in only a few cells. For example, for the Bian metastatic CRC

dataset [43] we will analyze further in depth in the Results section, we apply the following heuristic

filters to the cells and sites: (i) among all input cells, remove low-quality ones with coverage in fewer

than 4 million CpG sites; (ii) remove CpG sites covered in less than 2
3

of the remaining cells. For that

particular dataset, we additionally removed CpG sites on sex chromosomes, which may confound the

findings, and Chromosome 21 peri-centromeric regions (hg19, Chr21:9825000-9828000), which we

discovered to be an overlooked alignment artifact. After filtering the input for reducing sparsity, we

perform sequencing error correction as described in the following subsection. Since this may further

introduce sparsity, in an additional post error correction step, we may remove sites covering < 2
3

of

cells, and then remove cells with coverage in < 2
3

of the remaining CpG sites. Overall, our filtering

approach aims to make sure that the cells remaining are well-covered, and that each cell pair has at least

1
3

of the selected CpG sites as shared dimensions along which cell-to-cell distance can be measured.

Our main results on the Bian [43] dataset demonstrate the heuristic filtering scheme described

above is effective in practice; however, in principle, the filtering of cells and CpG sites should ideally

be coordinated in order to minimize information loss. For that purpose, we present an integer linear

program (ILP)-based biclustering formulation as follows. Let the site coverage data be represented in

a cell-by-site matrix Mv×u, where v is the number of cells and u is the number of CpG sites; mij = 1 if

site j is covered (by at least one read) in cell i, and mij = 0 otherwise. Let α and β respectively be the

17


specified fraction of cells and sites to be kept. Given M,α, and β, we wish to compute a biclustering

- namely, a selection of rows of columns - of M , M̂ , so that we maximize the number of CpG sites

covered in the resulting ⌊αv⌋ × ⌊βu⌋ submatrix in which rows and columns respectively correspond

to the selected cells and sites.

To solve this biclustering problem, let C ∈ {0, 1}v, S ∈ {0, 1}u be (unknown) binary vectors

respectively indicating whether a cell or site is kept. In addition, let A ∈ {0, 1}v×u denote (unknown)

binary matrix such that aij = 1 if and only if ci = 1 and sj = 1 (i.e., if cell i and site j are both kept).

The ILP formulation is:

maximize:
∑v

i=1

∑u
j=1 aijmij

subject to: aij ∈ {0, 1}

ci ∈ {0, 1}

sj ∈ {0, 1}

aij ≤ ci

aij ≤ sj

ci + sj − 1 ≤ aij∑v
i=1 ci = ⌊αv⌋∑u
j=1 sj = ⌊βu⌋

The output submatrix M̂ is obtained by taking cells {i|ci = 1} and sites {j|sj = 1} from M .

We implement the above formulation as an optional step in our software using the Gurobi ILP

solver, a licensed software which is free to use for academic purposes [73]. The performance of our

implementation in terms of run time, memory, and accuracy (i.e. optimality gap) with respect to the

input size and other parameters, as well as a strategy on setting α and β are explored in detail in

18


Supplemental Section A.1 (Supplemental Tables A.1,A.2, Supplemental Figures A.2,A.1,A.3). For a

typical computational environment, the current implementation of the biclustering formulation may

not be practical for the scale of the single-cell CpG methylation datasets analyzed in this study (see

Supplemental Section A.1). However, since for given α and β values the ILP formulation produces

a submatrix no sparser than that produced by the heuristic filters, it should be the preferred choice

whenever it is feasible. In Supplemental Section A.1, we show that memory consumption is likely to be

the bottleneck, and provide a way to estimate memory consumption of our biclustering implementation

given the size of the input so that the users may determine whether performing biclustering may be

appropriate for their own datasets of interest. It is intuitive that, the single-cell tumor lineage tree

constructed from input data resulting from biclustering is as reliable, if not more reliable, than the tree

constructed from heuristically filtered data, since the former will be constructed from at least as much

information from the unfiltered data as the latter. Improving the run time and memory efficiency of our

ILP implementation and devising alternative strategies to perform biclustering are among our future

objectives.

2.2.2 Likelihood-based sequencing error correction accounting for copy number es-

timates

Errors in sequencing may provide false evidence for an underlying allele methylation status

different from the truth. A common strategy employed in prior works to mitigate the effect of such

sequencing errors in downstream tasks is to employ the 90% rule, which assigns a CpG site in a cell

the methylation status with support from 90% of the reads available for the site in the cell, and discards

the site if no status can be assigned [42–44]. This approach has two drawbacks. First, the binarization

19


of methylation status disallows heterozygosity. Second, while there are high levels of copy number

gains in some cancer types, this approach does not distinguish between having reads sampled from

two underlying alleles and having reads sampled from many. For a CpG site in a cell, we would like

a copy number-aware way to determine, if we observe both reads with the site methylated and reads

with it unmethylated, whether they are truly sampled from heterozygous alleles, or they are sampled

from homozygous alleles and the methylation statuses on the disagreeing reads need to be corrected.

To this end, we present a maximum log-likelihood-based approach to correct likely errors in

methylation reads, incorporating site copy number estimates in each cell (Algorithm 1). For a CpG

site in a cell, given n ≥ 0 reads with a methylated status, m ≥ 0 reads with an unmethylated status, its

copy number c, and sequencing error probability 0 ≤ ϵ < 1 (which we set to .01 in our experiments

to proxy the per-base sequencing error rate in Illumina sequencing instruments), we can enumerate

the likelihood of having drawn the observed reads from all possible underlying allele statuses, letting

0 ≤ γ ≤ c be the number of alleles with the CpG site methylated. Note that in Algorithm 1, 1c and 0c

respectively denote the events that all c alleles are methylated and unmethylated; 1γ0c−γ denotes the

event where γ alleles are methylated and c− γ are not.

Sequencing error correction takes place when we have n > 0 and m > 0, but the allele status

with the highest log-likelihood is homozygous. In that case, we correct the reads with the minority

methylation status to the majority one. When c = 1 the site will implicitly be identified as homozygous

and the methylation status of any read that does not conform to the majority allele will be corrected.

In the case where sequencing error is needed, yet n = m, which will likely happen for c = 1, the CpG

site is discarded. In case copy number information is not available, one can assume an appropriate

(uninformative) copy number for all sites in all cells (e.g. c = 2 for diploid). Sex chromosomes are

always excluded from our analysis. We use the sequence error-corrected reads for pairwise distance

20


Algorithm 1 Sequencing error correction accounting for copy number
n,m← number of methylated and unmethylated reads for a CpG site in a cell, respectively
c← copy number for the CpG site in the cell
ϵ← sequencing error probability

function SEQUENCINGERRORCORRECTION(n,m, c, ϵ)
if c = 0 then return None ▷ site is discarded in case of deletion event
L[1c],L[0c]← ln[(1− ϵ)nϵm], ln[ϵn(1− ϵ)m]
for 0 < γ < c do
L[1γ0c−γ]← ln[γ

c
n(1− γ

c
)m]

end for
status← argmaxL
if [status = 1c OR status = 0c] AND [n ̸= 0 AND m ̸= 0] then

if n > m then n,m← n+m, 0
else if n < m then m,n← n+m, 0
else return None ▷ site is discarded if correction is impossible

end if
return n,m

end function

estimation between cells, as described in the following section.

2.2.3 Expected distance calculation between cell pairs for tree construction

In this section, we present a formulation for computing the expected distance between two cells

given their respective copy number and (sequencing error-corrected) reads. We consider such a for-

mulation with the goal of correcting for the potential bias contributed by low coverage CpG sites.

Intuitively, this distance measure serves as an estimate for the expected proportion of methylation

statuses altered across all CpG sites between a cell pair.

Given a cell, let the copy number at a CpG site be c, and we have 0 ≤ γ ≤ c alleles with the

CpG site methylated, and c−γ alleles with the CpG site unmethylated. To allow modeling preferential

allele sampling based on site methylation status as previously observed [74], we additionally introduce

parameter p, probability of drawing a read from the allele with the CpG site methylated given a pair of

21


alleles heterozygous for the site. We set p to the uninformative .5 in our experiments (see Supplemental

Section A.8 for when to choose alternative values). Then, we can compute the probability of drawing

from an allele with the CpG site methylated by normalizing over all alleles (Equation 2.1):

pc,γ =
γp

γp+ (c− γ)(1− p)
(2.1)

It follows that the probability of drawing from an allele with the CpG site unmethylated is 1− pc,γ .

Consider the case where we observe n reads at a CpG site for a cell, and all n reads are methy-

lated for the site. Let the copy number at the site for the cell be c, the probability of sampling all n

methylated reads from c alleles with the CpG site methylated (i.e. γ = c) is: P(reads | 1c) = 1; here

1c is the event that in all c copies the CpG site is methylated, and reads is the event that all n sampled

reads are methylated. The probability of drawing all n methylated reads from c alleles with the CpG

site unmethylated (i.e. γ = 0) is: P(reads | 0c) = 0. To compute the probability of drawing n methy-

lated reads from c alleles with mixed methylation status for the site, we need to sum the probabilities

over all other possible values of γ, the number of alleles with the site methylated, assuming that all

other possible values of γ are equally likely; and when there is a copy number loss (i.e. c = 1), the

formulation assigns only non-zero probability to a homozygous combined allele status:

P(reads | mixed) =


1

c−1

∑c−1
γ=1 p

n
c,γ , c ≥ 2

0 , c = 1

Given user-defined prior probabilities P(1c), P(0c), and P(mixed), which we all set to the uninfor-

mative .33 in our experiments (see Supplemental Section A.8 for when to choose alternative values),

22


let a = P(reads | 1c)P(1c)+P(reads | mixed)P(mixed)+P(reads | 0c)P(0c), then we apply Bayes’ The-

orem to get the likelihood of any allele status given the observed reads (Equations 2.2):

P(1c | reads) =
P(reads |1c)P(1c)

a
=

P(1c)
P(1c)+P(mixed)

c−1

∑c−1
γ=1 p

n
c,γ

P(mixed | reads) =
P(reads | mixed)P(mixed)

a
=

P(mixed)
c−1

∑c−1
γ=1 p

n
c,γ

P(1c)+P(mixed)
c−1

∑c−1
γ=1 p

n
c,γ

(2.2)

P(0c | reads) =
P(reads | 0c)P(0c)

a
= 0

In the case where the n reads all have the CpG site unmethylated, the three values can be com-

puted similarly with P(reads | 1c) = 0 and P(reads | 0c) = 1. In the case we observe among the n

reads both ones with the CpG site methylated and ones with the site unmethylated, we have P(mixed |

reads)=1, and P(1c| reads) = P(0c| reads) = 0. Then, for a given CpG site s in Cell A and Cell B, the

respective copy numbers cA,s, cB,s, and respective reads, we can compute the expected distance over

the possible combinations of allele status between the two cells at s (Equation 2.3):

dist(readsA,s, cA,s, readsB,s, cB,s) =
∑

statusA,s

∈{1cA ,
mixed,
0cA}

∑
statusB,s
∈{1cB ,
mixed,
0cB}

P(statusA,s|readsA,s)P(statusB,s|readsB,s)dist(statusA,s, statusB,s)

(2.3)

where dist(11,11)=dist(10,10)=dist(00,00)=0, dist(11,10)=dist(10,11)=dist(00,10)=dist(10,00)=0.5, and

dist(11,00)=dist(00,11)=1.

The total expected distance between cells A and B can now be computed with some distance

function over the vector of expected distances over all shared sites. The L1 norm normalized by the

23


number of shared sites is computed via Equation 2.4. The use of shared CpG sites between pairs of

single cells to estimate their distances was established by pdclust:

dist(A,B) =

∑
s∈sitesA∩sitesB

dist(readsA,s, cA,s, readsB,s, cB,s)

|sitesA ∩ sitesB|
(2.4)

A comparison of our distance formulation against a baseline distance measure commonly used

in prior studies! [42,44] on simulated data can be found in Supplemental Section A.2.1 (Supplemental

Figures A.4,A.5,A.6). After computing the distance between each pair of cells, one can incorporate

any distance-based tree reconstruction method into Sgootr to obtain the lineage tree and arbitrarily

choose a single cell from the normal tissue as the root. We use the scikit-bio implementation of

the NJ algorithm [75] for our main analysis. It is possible to replace NJ with an alternative distance-

based tree construction method in Sgootr. We compare Sgootr’s performance while using FastME

2.0 [76], a popular alternative, against that using NJ in Supplemental Section A.5 and show that they

lead to similar results (Supplemental Figures A.15,A.16, A.17).

2.2.4 Pruning of CpG sites according to a tree-based methylation status persistence

measure

The main body of Sgootr consists of an iterative procedure: at each iteration, it (i) computes

the pairwise distances among single cells to form the tumor lineage tree using a distance-based algo-

rithm, then (ii) measures the methylation persistence score of each CpG site along the tumor lineage

tree and prunes away a fraction of CpG sites that have the least scores before continuing onto the next

iteration, and (iii) outputs the tumor lineage tree of the particular iteration where distance from the

tumor lineage trees obtained in consecutive iterations are minimum possible.

24


We have described how to compute pairwise distance among single cells in the previous section,

and we leverage the widely used RF distance [77, 78] to measure the differences among the tumor

lineage trees constructed on the same set of cells in consecutive iterations. In this section, we focus

on how we measure the methylation persistence of CpG sites at each iteration given the tumor lineage

tree obtained for these CpG sites. In particular, we define the methylation persistence score for a CpG

site given a tumor lineage tree (Equation 2.6): the higher the methylation persistence score for a CpG

site, the more stably maintained its methylation status change along a tumor lineage tree.

To facilitate the analysis, we assign to each CpG site covered in each cell its most likely methy-

lation status: given sequencing error-corrected reads from component (ii) of Sgootr, we call a site

homozygously methylated if all of its reads have the site methylated, homozygously unmethylated if

all unmethylated, and heterozygous if the read status for the site is mixed. The status remains unknown

if there are no reads covering the CpG site. After this step, we measure the persistence of each CpG

site independently: first at each branch of the tree, then for the overall tree.

2.2.4.1 Methylation persistence score for a CpG site at a particular branch in the

lineage tree

Each branch in the methylation tumor lineage tree induces a bipartition of the tree and subse-

quently of the leaf nodes, which represent single cells. In other words, let TN denote the full set of leaf

nodes in a methylation tumor lineage tree t, cutting a branch b in the tree creates disjoint subsets of leaf

nodes TNb and TNb. Let Qs,tn denote the probability distribution across three possible methylation

statuses - homozygously methylated, heterozygously methylated, and homozygously unmethylated - at

CpG site s across a subset of cells tn, and let Is represent the set of cells with status information at site

25


s. Then, we defineMPs,b, the methylation persistence score of CpG site s at branch b (Equation 2.5):

MPs,b =
√

JSD
(
Qs,TNb∩Is , Qs,TNb∩Is

)
. (2.5)

Here JSD(X,Y ) denotes the Jensen-Shannon divergence [79, 80] for a pair of distributions

X, Y , and is defined as JSD(X, Y ) = 1
2
D(X∥Z) + 1

2
D(Y ∥Z) with Z = 1

2
(X + Y ) and D(K∥L)

denoting the Kullback-Leibler divergence measure for any arbitrary distributions K and L [81]. It is

worth noting that the square root of the Jensen-Shannon divergence measure, which computesMPs,b

(Equation 2.5), is metric [82, 83] and is commonly referred to as Jensen-Shannon distance in popular

implementations.

The intuition is that, if the change in methylation status of CpG site s has an observed persistent

effect in tumor progression - namely, s is lineage-informative - there exists a branch b∗ in the methyla-

tion tumor lineage tree such that the leaf nodes in the two subtrees induced by b∗ show very different

distributions of methylation statuses. In other words, MPs,b∗ will be large for such b∗. In contrast,

suppose the bipartitions contain the same distributions for methylation statuses for s - either that most

cells share the same status, or that both bipartitions are similarly heterogeneous - the score will be low.

2.2.4.2 Overall methylation persistence score for a CpG site in the lineage tree

We define the overall methylation persistence score of CpG site s in methylation tumor lineage

tree t as the maximum methylation persistence score the site s has across all valid bipartitions (Equa-

tion 2.6). We recognize that (i) an extreme difference in the number of cells between the bipartitions,

or (ii) a severe lack of cells with status information could both lead to meaningless divergence mea-

surements. Therefore, we only consider bipartitions induced by branch b such that (i) both partitions

26


contain no fewer than a user-defined fraction δ of total number of leaves, and (ii) there is read infor-

mation in no fewer than a user-defined fraction ω of cells in both partitions. In our experiments, we

set δ = .05 and ω = .5.

MPs,T = max
b∈{b||TNb|≥δ|TN |∧|TNb|≥δ|TN |

∧|TNb∩Is|≥ω|TNb|∧|TNb∩Is|≥ω|TNb|}

MPs,b (2.6)

2.2.4.3 Iterative joint tumor lineage tree reconstruction and lineage-informative site

identification

In a given iteration i, Sgootr first computes the tumor lineage tree ti with the distances between

pairs of cells based on the persistent sites identified in iteration i− 1 (for iteration i = 0 the entire set

of sites remaining after component (i) and (ii) of Sgootr are used). Then, for each CpG site s used in

computing ti, Sgootr calculates its overall methylation persistence scoreMPs,ti . Among the CpG

sites with overall methylation persistence scores, Sgootr prunes away a user-defined fraction κ of

the CpG sites with the lowest scores, along with those with tying scores at the threshold. It outputs the

remaining CpG sites to be used in iteration i + 1. The process continues for a user-defined maximum

number of iterations, and Sgootr outputs t∗ = ti where i is the last iteration with RF (ti, ti−1) equals

the global minimum across iterations.

Intuitively, we would like to detect an approximate point in the iterative procedure where most

non-lineage-informative CpG sites have been pruned out (leading to the initial roughly decreasing

trend of RF distance) but most lineage-informative CpG sites still remain (whose further elimination

will lead to increasingly inaccurate distance measurements between cell pairs and therefore increas-

ingly unstable tree topologies, which is reflected in a once-again increasing trend of RF distance). We

27


demonstrate empirical observations corroborate with such intuition (Figure 2.2C., Supplemental Fig-

ures A.11,A.17) and provide practical recommendations for choosing κ and the maximum number of

iterations in Supplemental Section A.9.

2.2.5 Inference of migration history from lesion-labeled tumor lineage tree leverag-

ing prior knowledge on migration patterns

Given a rooted tumor lineage tree t∗ produced by Sgootr through the procedures described

above, provided that the single cells in the tree are labeled by their lesion of origin, we can infer the

underlying tumor migration history, which we represent as a directed multigraph (without self loops)

where each vertex represents a distinct lesion and each edge represents a distinct migration event

from the source lesion to the target lesion. This intuitive representation of metastatic migration events

was introduced as a “migration graph” by El-Kebir and colleagues [29]. We accomplish this by first

adapting the well-known Fitch-Hartigan algorithm [84,85] with slight modification to obtain a unique,

maximally parsimonious labeling of the internal nodes of t∗, then identifying the migration events.

The Fitch-Hartigan algorithm was originally intended to solve the Small Parsimony Problem,

namely, labeling the internal nodes of a character-based evolutionary tree in which every leaf is labeled

with a single character and the objective is to minimize the total number of changes (i.e., mutations

from parent node to child node) [84, 85]. In our reuse, the bottom-up phase of the Fitch-Hatigan

algorithm computes in a tree the number of migration events - in other words, the change of lesion of

origin labels from a parent node to a child - in the most parsimonious labeling of its internal nodes.

This phase also produces a set of possible labels for each internal node in the tree. The top-down

phase, then, assigns each internal node a label from its set, as described in Algorithm 2, starting with

28


Algorithm 2 Partial order-guided modification to the Fitch-Hartigan top-down node label assignment

label set(node) ← maps a node to the set of labels it is given by the bottom-up phase of Fitch-
Hartigan algorithm fitch,Hartigan1973
parent(node), children(node)← returns the parent and the two children of node, respectively
poset minimal(labels) ← returns the minimal element in subset labels of a poset L, breaking
ties arbitrarily
assignment[node]← records the final label assignments of all nodes
node← a node in the tree

function LABELASSIGNMENT(node)
candidate labels← label set(node)
if node is not root AND assignment[parent(node)] is in candidate labels then

assignment[node]← assignment[parent(node)]
else

assignment[node]← poset minimal(candidate labels)
end if
if node is not a leaf then

child1, child2← children(node)
LABELASSIGNMENT(child1)
LABELASSIGNMENT(child2)

end if
end function

root node of t∗.

A general formulation of Fitch-Hartigan algorithm may produce multiple optimal labelings of

the internal nodes, and there are existing tools to capture the space of all optimal labelings: for exam-

ple, MACHINA by El-Kebir and colleagues enumerates optimal solutions allowing restrictions on the

types of transition events [29], and FitchCount by [30] generates a stochastic matrix summarizing all

transition events in the optimal solution space. However, some of these optimal labelings may adhere

to our prior understanding of the sequential-progression (a.k.a. metastatic-cascade) model of tumor

evolution [86] better than others. One can encode such prior understanding as a partial order over the

lesion of origin labels of single cells. For example, in this work, we leverage the following partial

order:

29


Normal Tissue ≺ Primary Tumor ≺ Lymph Nodes ≺ Distant Metastases.

In our modification to the Fitch-Hartigan algorithm, during the top-down phase, when a node

cannot inherit the label of its parent, we assign it the minimal element in its candidate label set accord-

ing to the partial order, breaking ties arbitrarily when the elements in the label set cannot be compared

(Algorithm 2). This returns a maximally parsimonious labeling of the internal nodes in the tree that

also acknowledges the sequential-progression model.

With a fully labeled tree, we produce a directed multigraph that corresponds to the tumor mi-

gration history as follows. First, we generate a vertex for each distinct lesion represented in the tree.

Then, as we traverse down the tree from the root, for each label change from a parent to a child in the

tree, we add one directed edge to the multigraph, from the lesion vertex corresponding to the parent

label to that corresponding to the child label. Parallel edges between a pair of vertices correspond to a

polyclonal seeding event, and any back edge indicates a reseeding event.

Note that our proposed method resolves equally optimal labelings of the internal nodes with prior

knowledge on the likely migration patterns among sampled sites. We have that prior knowledge for the

primary dataset we analyze in this work, and the method provides a basis on which we compare single-

cell lineage trees with lesion labeled leaves constructed with Sgootr or alternative methods. In the

case where no prior knowledge is available and the goal is to survey the whole space of parsimonious

migration histories possibly suggested by a lesion-labeled lineage tree, users may instead use tools

like MACHINA [29] or FitchCount [30]. It may also be intuitive to more directly encode such prior

knowledge by assigning different weights to migration events between distinct lesions, and solve the

weighted version of the Small Parsimony Problem with Sankoff’s algorithm [87]; however, the output

of Sankoff’s algorithm is highly sensitive to the input weights, and it is not immediately clear which

30


set of weights would be most appropriate for our purpose.

2.3 Results

The performances of Sgootr’s distinct components on simulated data are described in the Sup-

plemental Sections A.1 (biclustering), A.2.1 (distance calculation), and A.2.2 (iterative pruning; Sup-

plemental Figure A.7). Each of these components improves upon the respective baseline approach on

a wide range of settings, and the results of the iterative pruning component demonstrate that Sgootr

can capture the migration history of a tumor accurately. Furthermore, we applied Sgootr to scRRBS

data of single cells from an U87MG glioblastoma cell line ex vivo clone tree experimentally generated

by [88]. We show in Supplemental Section A.3 that, even though the original study fails to construct an

accurate lineage tree from microsatellite allelotype data (see Supplemental Figure S10 of the RETrace

study), Sgootr is able to reconstruct a CpG methylation-based lineage tree that accurately captures

the ground truth clonal relationships (Supplemental Figure A.9).

In the remainder of the chapter we focus on the application of Sgootr to the scBS-seq dataset

generated by Bian and colleagues [43] on 9 metastatic CRC patients1, among whom 2 (CRC01 and

CRC04) have matching scRNA-seq data (via the scTrio-seq2 protocol) from both normal and tumor

cells with which copy number calling can be performed. We highlight results from patient CRC01

since in addition to having copy number calls available, they have the largest number of distinct tissue

types, sampling locations, and treatment conditions. Full details and results for the other 8 patients

can be found in Supplemental Section A.4 (Supplemental Table A.3, Supplemental Figure A.10). We

additionally present in Supplemental Section A.7 Sgootr’s results on a GBM patient by Chaligne and

1Out of 12 total patients in the Bian cohort, we excluded those without scBS-seq data (CRC03 and CRC06), and those
without metastasis information (CRC09).

31


colleagues [44], which demonstrate the applicability of our approach to MscRRBS data (Supplemental

Figure A.19). To the best of our knowledge, this is the only other single-cell bisulfite sequencing

dataset with multi-regional tumor sampling that is publicly available.

2.3.1 Sgootr obtains a simple tumor migration history with scBS-seq and scRNA-

seq data from patient CRC01.

The single cells of CRC01 are sampled from 4 distinct lesions - PT, LN, ML, and MP - as well

as NC adjacent to the primary lesion; furthermore, there are multiple sampling locations within each

lesion (Figure 2.2A.). We only include cells with both scBS-seq and scRNA-seq data, and employ

inferCNV2 to call copy number from scRNA-seq data. We apply Sgootr with κ = .1, terminating

the iterative procedure after 40 rounds. A unique global minimum among the RF distances is identi-

fied at iteration 8 (Figure 2.2C.), hence t∗ = t8 (Figure 2.2B.) is output as the lineage tree. The tumor

migration history inferred by Sgootr (Figure 2.2D.) is simpler than the CNA-based results previ-

ously reported (see Supplemental Figure S7 of the work by Bian and colleagues): NC grows into PT,

followed by a polyclonal migration to LN, which appears to proceed to seed both ML and MP. While

the migration history also suggests a polyclonal reseeding from ML to LN, both edges in the graph

have low support: they are both due to a subtree of a particular cellular origin harboring a (potentially

mislabeled) singular cell of a different origin.

For comparison, we also present in Supplemental Section A.6 the tumor lineage tree we con-

structed on mutations called from the matching scRNA-seq data from patient CRC01 using the Trisicell

toolkit [91]. Due to the high level of sparsity in the mutation calls, we clustered single cells prior to

constructing the mutation tree. We observed a high level of heterogeneity in terms of lesion of origin
2from the Trinity CTAT Project https://github.com/broadinstitute/inferCNV

32

https://github.com/broadinstitute/inferCNV


A.
ML1
ML2
ML3
ML4

Treatment-naive
Liver Metastasis
(ML)

Post-treatment
Liver Metastasis

(MP)

Lymph Node
Metastasis

(LN)

PT1

MP1
MP2
MP3
MP4
MP5

LN1
LN2
LN3

1
2 3 4

1
2 3

5

4

1

2
3

B.
Normal Colon

(NC)

PT2
PT3
PT4

Primary Tumor
(PT)

1

2 3
4

C.

D.

E.

F.

Figure 2.2: Application of Sgootr to patient CRC01 scTrio-seq data by Bian and col-
leagues. A. The multi-regional patient data consists of single cells sampled from 4 distinct
lesions (in addition to normal tissue) and multiple sampling locations within each lesion.
B. Tumor lineage tree constructed by Sgootr with κ = .1. Each single cell is represented
as a leaf, colored by its sampling location. C. RF distances between the trees of consecu-
tive iterations of the pruning procedure, with the global minimum occurring at t∗ = t8. D.
Tumor migration history inferred by Sgootr for patient CRC01. E. Fraction of CpG sites
located in CGI (regions with over-representation of CpG sites) [89], CpG shore (2kb-long
regions flanking both sides of CGIs) [90], CpG shelf (2kb-long regions adjacent to CpG
shores on the side away from CGIs) [90], and inter-CGI regions in the CRC01 trees at
each iteration of the pruning procedure of Sgootr, from t0 to t∗. F. Top 10 GO terms
with significant (< .05) q-values in enrichment analysis of non-pseudo, protein-coding
genes spanning lineage-informative sites in intra-CGI and inter-CGI regions respectively
in CRC01. See Supplemental Figure A.12A. for a full list of enriched terms.

33


and sampling locations in the cell clusters (Supplemental Figure A.18), which makes inferring sim-

ple tumor migration history from trees constructed on them unlikely. Together with the CNA-based

results(see Supplemental Figure S7 of the work by Bian and colleagues), this mutation-based result

further highlights the usefulness of CpG methylation in reconstructing single cell lineage trees when

sparse SNV and CNA data fail to provide sufficient signal for lineage reconstruction.

2.3.2 Sgootr infers migration histories simpler than alternative methods for the

Bian CRC cohort.

We benchmark Sgootr against (i) a naı̈ve baseline distance-based tree construction method

(column “baseline” in Figure 2.3A.), (ii) IQ-TREE [92] with the two-state general time reversible

model (GTR2), an instance of the popular maximum-likelihood-based tool used for inferring lineage

trees from single-cell bisulfite sequencing data in prior studies [42, 44] (column “IQ-TREE” in Fig-

ure 2.3A.), and (iii) IQ-TREE (with GTR2model) preceded by FG, a lineage-informative site-selection

method previously described by Gaiti and colleagues [42], which returns a subset of input CpG sites

with lower than expected epimutation rate (column “FG+IQ-TREE” in Figure 2.3A.). Note that it

is IQ-TREE preceded by FG that was employed exactly in previous studies [42, 44]. However, as

discussed in more detail below, the FG step is not only very slow, it also leads to highly complex mi-

gration histories. These observations lead us to additionally benchmark against IQ-TREE without the

FG step, even though it had not been directly applied to single-cell bisulfite sequencing data in prior

studies. See Supplemental Section A.10 for further experiment details.

34


B
.

C
.

A
.

Fi
gu

re
2.

3:
O

ve
rv

ie
w

of
re

su
lts

fr
om

th
e

B
ia

n
m

et
as

ta
tic

C
R

C
co

ho
rt

.
A

.N
um

be
r

of
m

ig
ra

tio
n

ev
en

ts
an

d
B

A
M

D
PO

va
lu

es
of

m
ig

ra
tio

n
hi

st
or

ie
s

ob
ta

in
ed

vi
a

(i
)n

aı̈
ve

ba
se

lin
e

m
et

ho
d,

(i
i)

IQ
-T

R
E

E
,(

iii
)I

Q
-T

R
E

E
on

FG
-s

el
ec

te
d

C
pG

si
te

s,
(iv

)
S
g
o
o
t
r

in
te

rm
ed

ia
te

tr
ee

pr
io

r
to

ite
ra

tiv
e

pr
oc

ed
ur

e
(t

0
),

an
d

(v
)

fin
al

tr
ee

ou
tp

ut
by

S
g
o
o
t
r

(t
∗)

.
IQ

-T
R

E
E

re
su

lt
is

re
pr

es
en

te
d

by
th

e
m

ea
n

ac
ro

ss
10

ru
ns

w
ith

di
ff

er
en

t
ra

nd
om

se
ed

s,
w

ith
th

e
er

ro
r

ba
rs

de
no

tin
g

th
e

95
%

co
nfi

de
nc

e
in

te
rv

al
of

th
e

m
ea

n.
T

he
la

ck
of

da
ta

po
in

ts
in

“I
Q

-T
R

E
E

”
co

lu
m

ns
m

ea
ns

IQ
-T

R
E

E
fa

il
to

fin
is

h
m

od
el

op
tim

iz
at

io
n

to
st

ar
tt

re
e

se
ar

ch
af

te
r2

4
ho

ur
s,

an
d

th
at

in
“F

G
+I

Q
-T

R
E

E
”

co
lu

m
ns

m
ea

ns
FG

an
al

ys
is

fa
il

to
te

rm
in

at
e

af
te

r1
00

ho
ur

s.
B

.
R

un
tim

e
co

m
pa

ri
so

n
be

tw
ee

n
S
g
o
o
t
r

an
d

IQ
-T

R
E

E
.E

ac
h

ex
pe

ri
m

en
ti

s
pe

rf
or

m
ed

on
4

co
re

s.
T

he
nu

m
be

r
in

br
ac

ke
ts

ne
xt

to
“I

Q
-T

R
E

E
”

on
th

e
x-

ax
is

re
pr

es
en

ts
th

e
nu

m
be

r
of

IQ
-T

R
E

E
ru

ns
(o

ut
of

10
)

th
at

ha
ve

co
nv

er
ge

d
w

ith
in

24
ho

ur
s.

T
he

nu
m

be
ra

bo
ve

ea
ch

IQ
-T

R
E

E
ba

rr
ep

re
se

nt
s

th
e

to
ta

ln
um

be
ro

fr
un

s
(o

ut
of

10
)w

ith
an

y
tr

ee
se

ar
ch

ou
tp

ut
(c

on
ve

rg
ed

or
in

te
rm

ed
ia

te
af

te
r2

4
ho

ur
s)

,a
nd

he
nc

e
co

nt
ri

bu
te

s
to

th
e

ru
n

tim
e

an
d

m
ig

ra
tio

n
hi

st
or

y
co

m
pa

ri
so

n.
C

.F
ra

ct
io

n
of

C
pG

si
te

s
lo

ca
te

d
in

C
pG

is
la

nd
,C

pG
sh

or
e,

C
pG

sh
el

f,
an

d
in

te
r-

C
G

I
re

gi
on

s,
be

fo
re

(t
0
)

an
d

af
te

r
(t
∗)

th
e

ite
ra

tiv
e

pr
un

in
g

co
m

po
ne

nt
of
S
g
o
o
t
r

.

35


For each benchmarking experiment, we take as input the cell-by-site read count matrices result-

ing from first-pass heuristic filtering step of Sgootr, namely, removing low-quality cells, CpG sites

with coverage in less than 2
3

of remaining cells, sites on sex chromosomes, and sites within Chromo-

some 21 peri-centromeric regions. Then, for each CpG site covered in each cell, call whether it is

methylated or not by the 90% rule [42–44]. This binarization step is necessary in generating input

for the GTR2 model. For the FG+IQ-TREE experiments, we further perform FG (Supplemental Sec-

tion A.10, Supplemental Figures A.20, A.21) to select CpG sites to input to IQ-TREE. Additionally,

we further filter the input to the benchmarking experiments so that all experiments (Figure 2.3A.) on

a particular patient act on the same set of input cells. This calibration is to ensure fair comparisons

among results from different methods using the two performance measures we will introduce below.

To facilitate fair runtime comparison between IQ-TREE and Sgootr, we also filter to ensure the input

to IQ-TREE for each patient is the same set as those to Sgootr, so that the runtime measurements in

Figure 2.3B. are taken on the same set of cells and CpG sites for each patient.

For each patient in the Bian cohort, we generate a baseline lineage tree by first computing the

Hamming distance between every pair of cells over their binarized CpG site methylation status vectors,

averaging across the sites covered in both cells, then applying NJ to the distance matrix. As IQ-

TREE is a stochastic method, we perform 10 separate runs of each IQ-TREE experiment with different

random seeds, and for each instance report the intermediate tree at convergence or after 24 hours if

convergence has yet to be achieved by then. If none of the 10 runs of IQ-TREE for a particular

experiment have finished model optimization, started tree search, and produced some intermediate

result, we do not report IQ-TREE result for that experiment. Similarly, if FG does not terminate in

100 hours, we do not report FG+IQ-TREE result for that experiment. We root the baseline and IQ-

TREE trees with the same normal cells we used to root the trees by Sgootr, and we infer the tumor

36


migration histories from them via Sgootr’s modified Fitch-Hartigan algorithm.

We compare the migration histories inferred via Sgootr and the three alternative approaches by

two measures: (i) the total number of migration events, namely the number of edges in our multigraph

representation, and (ii) one we name BAMDPO, which is the Hamming distance between the binary

adjacency matrix of the simple directed acyclic graph induced by the lesion partial order and that

obtained by collapsing all parallel edges in the migration history multigraph. Intuitively, the first

measure (output from Fitch-Hartigan algorithm) captures the conventional definition of parsimony,

and the latter measures the degree the inferred history deviates from the sequential-progression model

of tumor evolution. Together, they offer complementary perspectives on the complexity of the inferred

histories. For both measures, a lower value would correspond to a simpler history.

Figure 2.3A. compares the tree t∗ obtained by Sgootr against the alternatives with respect to

the two measures. The figure also includes results of t0 for each patient, the tree obtained by Sgootr

prior to its iterative site pruning procedure. Comparisons of the measures for t0 against those for t∗

show Sgootr’s iterative procedure almost always reduces the complexity of the inferred migration

history, provided such a reduction is at all possible. In the case of patient CRC15, even though the

total number of migration events increases from t0 to t∗, BAMDPO decreases; this suggests what first

appeared to be deviations from the simple sequential-progression model could potentially be resolved

as polyclonal seeding events more in concordance with the model, once the tree structure becomes

more refined through the iterative procedure. Compared to naı̈ve baseline and the two IQ-TREE-based

methods, Sgootr typically infers as simple, if not simpler, migration histories; the only exception is

patient CRC10, where IQ-TREE (without FG site-selection) infers a migration history with the lowest

BAMDPO.

Note that all patients except CRC01 and CRC15 have single cells sampled from only 2 lesions

37


in addition to the normal tissue (Supplemental Table A.3); among them, patients CRC02, CRC04,

CRC12, and CRC14 have low number of input cells. For that reason, one may expect a lower number

of migration events detected from the constructed lineage trees. Furthermore, Sgootr has achieved

the minimal possible values in terms of BAMDPO prior to the iterative procedure in CRC02, CRC04,

CRC11, CRC12; for CRC02, CRC04, and CRC12, Sgootr has also achieved the minimal possible

value for the number of events prior to the iterative procedure. In those cases, there are no changes in

terms of those two measures between the initialized tree (t0) and the optimal tree (t∗).

2.3.3 Sgootr-identified lineage-informative CpG sites construct lineages suggest-

ing simpler migration histories than those identified via FG.

Specifically, comparisons between the measures from running IQ-TREE on FG-selected sites

and those from Sgootr show that Sgootr consistently outperforms the former in terms of the num-

ber of events and BAMDPO where further improvements were possible. This suggests the sites se-

lected by Sgootr may be more meaningful for the purpose of lineage reconstruction than those by

FG. FG is also time-consuming: for patient CRC11 and CRC12, FG failed to complete after 100 hours,

and no experimental result was reported (Figure 2.3A.).

2.3.4 Sgootr is orders-of-magnitude faster than IQ-TREE.

While Sgootr produces as simple, if not simpler migration history than IQ-TREE (without

FG site-selection) in terms of the two measures in all but one case, Sgootr obtains the results in a

fraction of time IQ-TREE took given the same input dimensions. Each IQ-TREE run was allotted 24

hours. If the program converged within 24 hours, we report the wall clock time elapsed; if the program

38


did not converge within 24 hours but produced some intermediate output from tree search, we report

the program run time as 24 hours; and if IQ-TREE failed to produce any intermediate result from tree

search within 24 hours, we do not include the elapsed time from that run. Within the allotted time,

IQ-TREE produces converged results in 6 out of 9 patients, some (intermediate) results for 8 out of

9 patients. IQ-TREE is not able to finish model optimization and start tree search within 24 hours

for any of the 10 runs for CRC11 (Figure 2.3B.). Sgootr is deterministic and obtains lineage trees

suggesting simpler histories in a mere fraction of the time when given the same number of cores and

memory as IQ-TREE, positioning itself as the better option in more time-frugal settings.

2.3.5 Sgootr-identified lineage-informative CpG sites are enriched in inter-CGI

regions.

In all 9 CRC patients in the Bian cohort, as well as the multi-regional GBM patient MGH105 in

the [44] cohort, the set of lineage-informative sites identified by Sgootr show enrichment in inter-

CGI regions (Figures 2.2E., Supplemental Figures 2.3C., A.19F.). The median distances between CpG

sites and their closest CGIs also increase as the iteration