ABSTRACT Title of Dissertation: THREE VARIATIONS OF PRECISION MEDICINE: GENE-AWARE GENOME EDITING, ANCESTRY-AWARE MOLECULAR DIAGNOSIS, AND CLONE- AWARE TREATMENT PLANNING Sanju Sinha, Doctor of Philosophy, 2021 Dissertation directed by: Dr. Eytan Ruppin, Department of Computer Science; Prof. Steve Mount, Department of Biology During my Ph.D., I developed several computational approaches to advance precision medicine for cancer prevention and treatment. My thesis presents three such approaches addressing these emerging challenges by analyzing large-scale cancer omics data from both pre-clinical models and patients datasets. In the first project, we studied the cancer risk associated with CRISPR-based therapies. Therapeutics based on CRISPR technologies (for which the chemistry Nobel prize was awarded in 2020) are poised to become widely applicable for treating a variety of human genetic diseases. However, preceding our work, two experimental studies have reported that genome editing by CRISPR?Cas9 can induce a DNA damage response mediated by p53 in primary cells hampering their growth. This could lead to an undesired selection of cells with pre-existing p53 mutations. Motivated by these findings, we conducted the first comprehensive computational and experimental investigation of the risk of CRISPR-induced selection of cancer gene mutants across many different cell types and lineages. I further studied whether this selection is dependent on the Cas9/sgRNA-delivery method and/or the gene being targeted. Importantly, we asked whether other cancer driver mutations may also be selected during CRISPR-Cas9 gene editing and identified that pre-existing KRAS mutants may also be selected for during CRISPR- Cas9 editing. In summary, we established that the risk of selection for pre-existing p53 or KRAS mutations is non-negligible, thus calling for careful monitoring of patients undergoing CRISPR- Cas9-based editing for clinical therapeutics for pre-existing p53 and KRAS mutations. In the second project, we aimed to delineate some of the molecular mechanisms that may underlie the observed differences in cancer incidences across cancer patients of different ancestries, focusing mainly on lung cancer. We found that lung tumors from African American (AA) patients exhibit higher genomic instability, homologous recombination deficiency, and aggressive molecular features such as chromothripsis. We next demonstrated that these molecular differences extend to many other cancer types. The prevalence of germline homologous recombination deficiency (HRD) is also higher in tumors from AAs, suggesting that at least some of the somatic differences observed may have genetic origins. Importantly, our findings provide a therapeutic strategy to treat tumors from AAs with high HRD, with agents such as PARP and checkpoint inhibitors, which is now further explored by our experimental collaborators. In the third project, we developed a new computational framework to leverage single-cell RNA-seq from patients? tumors to guide optimal combination treatments that can target multiple clones in the tumor. We first showed that our predicted viability profile of multiple cancer drugs significantly correlates with their targeted pathway activity at a single-cell resolution, as one would expect. We apply this framework to predict the response to monotherapy and combination treatment in cell lines, patient-derived-cell lines, and most importantly, in a clinical trial of multiple myeloma patients. Following these validations, we next charted the landscape of optimal combination treatments of the existing FDA-approved drugs in multiple myeloma, providing as a resource that could be used to potentially guide combination trials. Taken together, these results demonstrate the power of multi-omics analysis of cancer data to identify potential cancer risks and a strategy to mitigate, to shed light on molecular mechanisms underlying cancer disparity in AA patients, and point to possible ways to improve their treatment, and finally, we developed a new approach to treat cancer patients based on single-cell transcriptomics of their tumors. THREE VARIATIONS OF PRECISION MEDICINE: GENE-AWARE GENOME EDITING, ANCESTRY-AWARE MOLECULAR DIAGNOSIS, AND CLONE-AWARE TREATMENT PLANNING by Sanju Sinha Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2021 Advisory Committee: Professor Steve Mount, Chair (Co-Advisor) Dr. Eytan Ruppin (Advisor) Dr. Brid Ryan (Co-Advisor) Dr. Rob Patro Professor Sridhar Hannenhalli Professor Najib M. El-Sayed ? Copyright by Sanju Sinha 2021 Dedicated to my parents, Sukhnandan Prasad Sinha and Munni Devi, whose love, dedication, and sacrifices provided me the platform to reach this point in life. ii Acknowledgments Firstly, I would like to thank Eytan Ruppin, my chief advisor, for this opportunity, mentorship, and support throughout this journey. With his critical and energetic guidance, this journey has been an exciting challenge with many scientific and non-scientific lessons to keep and cherish for a long time. His care, kindness, and support extended beyond academic life throughout the highs and lows during this period and I am deeply grateful and fortunate to receive them. Secondly, I would like to thank Brid Ryan, my co-advisor, for her invaluable mentorship, support and patiently teaching me countless fundamentals. Her confidence in me has provided me the essential mental support throughout this journey and I am indebted to her for this. I would cherish our discussions for a long time. I am extremely blessed to have such a great pair of mentors. As much as I am excited to take the next steps of my career, my heart is sad that I have to leave soon. During my Ph.D., I have had the honor to work with many excellent scientists, including Curtis Harris, Ze?ev Ronai, Ani Deshpande, Moshe Oren, Kenneth Aldape, Max Leiserson, Silvio Gutkind, and many more, and have learned a wide and diverse set of scientific lessons from them, and I would like to thank them all. With no less enthusiasm than before, I would like to thank my colleagues and friends at the Cancer Data Science Lab at NIH and UMD. It might be impossible to name everyone and show my gratitude for their support, however, to name a few - Sanna Madan, Fiorella Schischlik, Kun Wang, David Crawford, Ethan Iverson, and many more. Including our lunches, hikes, and happy hour, I would keep these memories close to my heart. It has been a true pleasure and a gem of an experience to work closely with Kuoyuan Cheng during this time. I have learned iii numerous lessons while working closely with Alejandro Schaffer, Joo Sang Lee, and Rahul Vegesna and would also like to thank them. Starting from day 1, the staff at UMD have been actively kind, helpful, and truly caring, including Michelle Brook, Gwen Warman, Zakiya Whatley, Hadiya Woodham & Barbara Lewis. At NIH, a shoutout and special thanks to Nadiya Nimley for her help and kindness during the last few years. Looking back, a critical part of this journey was the companionship of two of my close friends, Anshuman Bhanja and Jurrian Van Haaren, and am very grateful and fortunate to have this. Importantly, I would like to thank my girlfriend Samprita for her support, encouragement, and unconditional love throughout the last two years. She has shared my highs with celebrations and lows with support, care, and encouragement throughout this time. Lastly and most importantly, I would like to thank my family, especially my parents for their incomparable love, support, and guidance throughout my life. I would not have reached here without them and their sacrifices and support. iv Table of Content Chapter 1: Higher prevalence of homologous recombination deficiency in tumors from African Americans versus European Americans 1 Chapter 2: A systematic genome-wide mapping of oncogenic mutation-selection during CRISPR-Cas9 genome editing 39 Chapter 3: Designing optimal combination treatment targeting the clonal architecture of the tumor using scRNA-seq 82 Conclusion 112 v Chapter 1: Higher prevalence of homologous recombination deficiency in tumors from African Americans versus European Americans To improve our understanding of longstanding disparities in incidence and mortality in lung cancer across ancestry, we performed a systematic comparative analysis of molecular features in tumors from African Americans (AAs) and European Americans (EAs). We find that lung squamous cell carcinoma tumors from AAs exhibit higher genomic instability?the proportion of non-diploid genome?aggressive molecular features such as chromothripsis and higher homologous recombination deficiency (HRD). In The Cancer Genome Atlas, we demonstrate that high genomic instability, HRD, and chromothripsis among tumors from AAs are found across many cancer types. The prevalence of germline HRD (that is, the total number of pathogenic variants in homologous recombination genes) is higher in tumors from AAs, suggesting that the somatic differences observed have genetic ancestry origins. We also identify AA-specific copy-number-based arm-, focal- and gene-level recurrent features in lung cancer, including higher frequencies of PTEN deletion and KRAS amplification. These results highlight the importance of including under-represented populations in genomics research. 1 Introduction In the United States (US), African Americans (AAs) have the highest cancer incidence and lowest survival across multiple cancer types 1. The reasons for these persistent trends are not clear. Our current understanding of the molecular mechanisms of tumorigenesis is primarily from analyses of tumors derived from European ancestry patients, including The Cancer Genome Atlas (TCGA) where only 8.5% of samples are from AAs. This raises a question about whether there are differences in tumor evolution and molecular features by genetic ancestry. Recently, Yuan et al. compared somatic copy number alteration (SCNA)-based genomic instability (GI) across genetic ancestry in TCGA and found that invasive breast carcinoma (BRCA), head and neck squamous cell carcinoma (HNSC), and uterine corpus endometrial carcinoma (UCEC) tumors from AAs had significantly increased GI compared with tumors from European Americans (EAs) 2. Further, recent work demonstrated that the African pan-genome contains numerous large insertions?whose total length comprises ~10% of the genome?that are not present in the current human reference genome (GRCh38) 3, which was primarily derived from a small number of individuals, primarily of European descent 4. Together these data highlight the need for studies specifically investigating the molecular landscape of cancer in minority and under-represented populations. Lung cancer, the second most common cancer in the US and the leading cause of cancer-related death 5, has persistent disparities in both incidence and mortality. AAs have the highest lung cancer incidence and mortality rates when compared with other racial or ethnic groups 1,6. These disparities persist even after considering tobacco smoking exposure, the strongest risk factor for lung cancer development 6. 2 Population-specific molecular patterns in tumor biology and cancer genomics have been reported in recent years 7-10 with limited power and coverage. Here, we systematically identified ancestry- specific genome-wide copy number features in a racially balanced (EA and AA) cohort of 222 lung tumors. Our analysis reveals higher GI and homologous-recombination deficiency (HRD) in LUSC tumors from AAs compared with EAs. This suggests an ancestry-associated disparity in deficiency of the HR-pathway, which we confirmed by finding a higher prevalence of germline HRD in AA compared with EA patients in LUSC. In the TCGA cohort, we further found the increased GI, HRD, and chromothripsis (CHTP) among AAs across multiple cancer types and pan-cancer. Further, we identify ancestry-specific arm, focal, and gene-level features in LUAD and LUSC. Our results highlight the importance of including minority and under-represented populations in cancer genomics research and may have therapeutic implications. Results LUSC tumors from African Americans have higher GI and HRD We generated genome-wide copy number profiles of 222 non-small cell lung cancer tumor samples from the NCI-MD study (105 LUAD [AA=63, EA=42] and 117 LUSC [AA=63, EA=54] (Supp Table 1) using the OncoScan platform 11, which provides comprehensive coverage of 50?100 kb copy number resolution in cancer genes and 300 Kb across the rest of the genome. Sample characteristics for the patients in this study are shown in Supp Table 1. Based on these copy-number alterations profiles, we first quantified GI?defined as the proportion of the genome with non-diploid copy number?for each sample. We found that LUSC tumors from AAs had significantly higher GI compared with EAs (Figure 1A-top panel; 3 Wilcoxon Rank-Sum (shortened to ?Wilcoxon? henceforth) P<6E-03). In contrast, we did not find significantly higher GI in lung adenocarcinoma (LUAD) in AAs (Figure 1A-middle panel). We tested the hypothesis that higher GI across tumors from AAs is due to a higher prevalence or extent of HRD, which was previously identified as a key contributor to GI in cancer 12. We quantified HRD in tumors using four independent measures of HRD: Loss of heterozygosity (LOH), which is the number of LOH segments 13,14; telomere allelic imbalance (AIL), which is the number of regions of allelic imbalance that extend to one of the sub-telomeres but do not cross the centromere; large-scale state transitions (LST), which is the number of breakpoints between regions longer than 10 Mb after filtering out regions shorter than 3 Mb 13 and lastly, the sum of these three features. All four scores are scaled within the range of 0 to 1. In the NCI-MD study, we observed a strong positive correlation between GI and HRD across the whole cohort for all four features (P<2E-16 for all; Spearman Rho=0.64 for LOH, 0.31 for LST, 0.44 for AIL, 0.51 for the sum), where, in AA tumors, the correlation observed is stronger than in EA tumors (Spearman Rho for AA=0.57, for EA=0.48, P<2.2E-16 for both) (Supp Table 2). Next, we observed significantly higher HRD in AAs with LUSC (FDR adjusted P-value<2E-04 for LOH, Figure 1B-top panel; <2.0E-02 for LST; <3.9E-02 for AIL; P<7.1E-03 for net sum), but not LUAD, which is consistent with our GI-based findings outlined above (Figure 1B-middle panel). This suggests that HRD contributes to the ancestry-specific pattern of higher GI burden in LUSC among AAs. To account for potential confounding factors, we performed multivariate linear regression to model separately GI and HRD in the NCI-MD cohort as a function of ancestry adjusting for tumor stage, patient age, sex, smoking status, pack-years of cigarettes and tumor purity. Here, we found AA ancestry strongly positively associated with GI and HRD in LUSC, but not LUAD, 4 consistent with our previous observations (LUSC: FDR<3E-02 and FDR <5.35E-05, respectively, LUAD: FDR< 0.24 and FDR<0.09, respectively, Supp Table 3). We initially determined ancestry by self-report; however, it is possible that miss-report could have confounded our results 15. Therefore, we inferred ancestry in an unsupervised manner via principal component analysis (PCA) of ancestry-informative SNPs (Methods) followed by classification of the first two PCs via support vector classification (SVC), which identified two classes of ancestry. We found that inferred ancestry class is concordant with self-reported ancestry for 98.6% of subjects; four samples were potentially misclassified (Supp Table 3, Column B). We removed these samples and repeated the analyses above and found consistent results with comparable significance (higher GI and LOH-HRD in LUSC among AAs with Wilcoxon P<6E-03 and P<2E-04). To validate the relationship between GI and the extent of HRD that we found in the NCI-MD cohort, we quantified GI and HRD using the four signatures described above in the TCGA cohort. Both GI and HRD were higher in tumors from AAs compared with EAs in LUSC, but the differences did not reach statistical significance (Figure 1A-B, bottom Panel). This could be due to the limited number of AA tumor samples in TCGA (29 AA compared with 346 EA), which is supported by a power analysis of TCGA samples across ancestry (Methods). Lung tumors from African Americans have more frequent complex structural variants The observed deficiencies in DNA damage repair related with GI in LUSC prompted us to chart the landscape of complex structural variants recently reported to be related with HRD 16. We studied chromothripsis (CHTP), which was first described as a catastrophic event that leads to chromosome shattering and tens to hundreds of simultaneously acquired oscillatory copy number 5 aberrations on one chromosome 17,18. Therefore, we represented CHTP as a binary variable indicating presence/absence. Using the classical definition, i.e., many oscillatory copy number events clustered on a chromosome (Methods) 19, tumor samples with CHTP had significantly higher HRD than samples without CHTP (Wilcoxon P<9E-04) in the NCI-MD lung cancer cohort (Supp Table 2). Further, we observed higher frequency of CHTP in tumors from AAs compared with EAs in LUSC (Figure 1C-Top panel, P<0.12, OR=1.24) and in LUAD, but to a weaker extent (Figure 1C-Middle panel, P<0.49, OR=1.15). These patterns are consistent when adjusted for age, sex stage, smoking status and pack-years of cigarettes (multivariate regression P for ancestry<2.8E-03, Supp Table 3). The same result held qualitatively when an alternative quantification of CHTP, defined by the allowance for two oscillation states in the affected region, was used (Methods) (Supp Table 2- Column AD). Next, we quantified CHTP in the TCGA-LUSC cohort and observed a consistent pattern of higher frequency in tumors from AA (Figure 1C-bottom panel, P<0.12, OR=1.45). We further analyzed the chromosome frequency distribution of CHTP, which varied by histological subtype and ancestry. 6 Figure 1: Differences in GI, HRD and chromothripsis across AA and EA lung cancer patients from the NCI-MD and TCGA cohort. (A) Genomic instability, (B) homologous recombination deficiency and (C) chromothripsis are quantified and presented stratified by genetic ancestry in LUSC (top, n=105 patients (AA=63, EA=42)) and LUAD (middle, and n=117 patients (AA=63, EA=54)) samples from the NCI-MD cohort and LUSC from the TCGA 7 cohort (bottom, and n=375 patients (AA=29, EA=346)). LUAD denotes lung adenocarcinoma and LUSC denotes lung squamous cell carcinoma. Significance for comparison of medians in A) and B) is calculated via one-sided Wilcoxon rank-sum tests and significance for comparison of frequency in C) is calculated via one-sided Fisher?s exact test. The violin plots in A) and B) show the data distribution where the center line denotes the median, the box indicating the interquartile range and the black line represents the rest of the distribution, except for points that are determined to be ?outliers? using a method that is a function of the interquartile range, as in box plots. The landscape of arm- and focal-level SNCAs in AA and EA lung cancer To identify SCNA-based ancestry-specific features in detail, we examined population-specific SCNA profiles in lung cancer for chromosome arm- and focal- level (shorter than half a chromosome arm) events in the NCI-MD study where statistical power for both populations was available. Further support for key observations was demonstrated in TCGA. Recurrent arm-level and focal-level SCNA events were identified for both populations separately using GISTIC 20 (Methods, FDR<0.1) and used to map genome-wide SCNA across histology and ancestry (Figure 2, Supp Tables 4-5). For each chromosome arm, the alteration frequency and the recurrence significance by ancestry for both amplifications and deletions were plotted for patients in the NCI-MD cohort (Figure 2). We identified known cancer-specific arm-level SCNA events, including amplification of 3q and 5p and deletion of 3p 21, in both populations (Supp Tables 4-5). Similarly, 19p deletion, a molecular signature of LUAD, was recurrent in EAs and AAs at similar frequencies of ~45% (Figure 2, Supp Table 5). Recurrent population-specific arm-level SCNA differences 8 were observed, including 4p and 4q arm level deletions in LUSC and 7p and 7q amplifications in LUAD, both occurred at higher frequency in AAs compared with EAs. These observations were replicated in TCGA (Figure 2). Figure 2: Characterization of arm-level SCNA events across EA and AA lung cancer in the NCI-MD cohort. Frequency distribution of aberrant SCNA events on autosomal chromosome arms in LUAD and LUSC for the NCI-MD and TCGA cohorts (LUSC n=375 patients [AA=29, EA=346], LUAD n= 432 patients [AA=51, EA=381]). The diagonal dashed line represents equal AA and EA frequencies, with points falling away from this line indicating chromosome arms with alteration frequency differences between populations. A color code is provided to denote population-specific recurrent SCNA events with statistical significance. Del=deletion, amp=amplification. Statistical significance of recurrence was computed via GISTIC, which provides arm-level FDR-corrected significance with a threshold of 0.1. To visualize genome-wide focal-level SCNA events across populations, including co-occurrence and mutual-exclusivity, we created an SCNA map showing genome-wide SCNA frequency 9 distributions for both LUSC and LUAD in the NCI-MD cohort (Figure 3-left panel). The overlaps in recurrent focal regions among EAs and AAs were 59% and 70% for LUAD and LUSC, respectively (Figure 3-left panel). Further, we observed population-specific patterns of co-occurring and mutually exclusive SCNA events (Figure 3-left panel). To identify potential novel AA-specific copy number-driven focal-level regions, we selected high-confidence recurrent focal-level regions from GISTIC that met the following criteria; 1) alteration frequency greater than 5% in AAs, 2) frequency at least two times higher in AAs than EAs 3) recurrent only in AAs and 4) no recurrent peak of the same type (amplification or deletion) was present in EAs within the region or an extended additional 10% on both sides of the region length. We identified eight potential AA-specific potential driver regions. The top hit ranked by significance is a 22q11.23 deletion in LUSC (Figure 3-right panel) with a frequency of 27% in AAs and 13% in EAs. Following a previous study 22-24, we tested whether this deletion event is somatic or germline by profiling matched-normal tissue samples with genome-wide copy number; we observed that 2/5 normal samples from AAs also have a deletion of 22q11.23, suggesting that this event could be germline (Supp Table 6). This 22q11.23 region deleted in LUSC is disjointed from the nearby region on 22q11.21 that is hemizygously deleted in DiGeorge syndrome 23-24. The region with the second-highest fold change in alteration frequency in LUSC, 12p12.1 (Figure 3-right panel), is a short region including KRAS and is discussed in detail in the next section. Thirdly, common to both LUAD and LUSC, the 20p12.1 region is deleted >4 times as often in AAs compared to EAs. This region includes the genes FLTR3 and MACROD2. We also identified several SCNA events previously linked with AA ancestry in cancer and assessed the relationship between copy number and gene expression (Supp Tables 7-10). We observed an AA-specific amplification of the oncogene KAT6A in LUAD, which was previously 10 observed in 24. We also identified a recurrent deletion of 4q35.2 extending to the telomere in LUSC that includes FBXW7, previously shown to be deleted in colorectal cancer and triple- negative breast cancer among AAs 25,26 (Supp Table 7). In LUAD, a region in near 8q24 was significantly recurrently amplified in AAs only (frequency=33% and 18% in AAs and EAs, respectively). Within a sub-region, 8q24.21, the PVT1 copy number profile was significantly associated with expression (P<7E-03), while in 8q24.3, HSF1, DGAT1, and BOP1 copy number were also significantly associated with gene expression (P<7E-03) (Supp Tables 8 & 10). Figure 3: Global SNCA map across EA and AA lung cancer in the NCI-MD cohort. Segmental deletions and amplifications are shown in blue and green, respectively in the left panel circos plot. In this plot, the top 50 (|Pearson-Rho| > 0.50) highly positively (co-occurring) and negatively (mutually exclusive) correlated copy number segment pairs are connected with 11 yellow and blue arcs, respectively. The overlap and unique recurrent regions between AAs and EAs in LUSC and LUAD are shown via Venn diagrams at the top and bottom. Steps provided in the central box are used to identify SNCA-driven AA-specific potential-driver regions, where the list of regions passing these steps are provided with corresponding frequency in AAs and EAs on the right via bar plots for LUAD and LUSC. Recurrence significance for each focal region was computed via GISTIC in AAs and EAs separately with an FDR-corrected significance threshold of 0.1. The landscape of driver genes SCNAs in AA and EA lung cancer We analyzed the recurrence and alteration frequency of known lung cancer driver genes mined from the cancer gene census of COSMIC (Figure 4A). We identified population-specific SCNA patterns of drivers (Figure 4A) significantly correlated with gene expression (Figure 4B and Figures S3-4). In LUSC, one of the key cancer driver genes, KRAS, is amplified in both populations but is significantly recurrent (FDR<0.1, Methods) and has a higher frequency in AAs (KRAS amp frequency: 23% in EAs compared to 51% in AAs, Figure 4). Similarly, PTEN deletion is significantly recurrent and more frequent in AAs (PTEN del frequency: 32% in EAs compared to 53% in AAs, Figure 4). Another key driver, CDKN2A, was recurrently deleted in both populations, but the frequency was 35% in AAs compared with 64% in EAs (Figures 4A). These three population-specific patterns in frequency were also observed in TCGA. 12 Figure 4: Landscape of SCNA of lung cancer drivers across EA and AA lung cancer in the NCI-MD cohort. (A) Amplification and deletion frequency of lung cancer driver genes across population and histology. Recurrence significance for each gene was computed via GISTIC in AAs and EAs separately with an FDR-corrected significance threshold of 0.1. The diagonal dashed line denotes the null line with points falling away from this line indicating chromosome arms with alteration frequency differences across populations. A color code at the top of panel 4A is provided to denote gene-level population-specific statistically significant recurrent SCNA events, where a gene name being in black implies no statistically significance SCNA recurrence in either population. Del=deletion, amp=amplification. (B) Effect of copy number changes on expression profile (n=91 patients) of drivers with population-specific patterns. Here, at the top of each panel, we have provided the corresponding gene name, with the two-sided Spearman significance (P) and Rho (?) of the mRNA expression of the gene and its SCNA profile. Only 13 genes significantly correlated with expression are plotted (P<0.01 & Spearman Rho>0.2). Here, the centerline denotes the median, the box indicating the interquartile range, and the black line represents the rest of the distribution, except for points that are determined to be ?outliers? using a method that is a function of the interquartile range, as in box plots. A pan-cancer survey of GI, HRD, and chromothripsis in AA vs EA tumors To determine whether a higher prevalence of aggressive molecular features, including GI, HRD, and CHTP, extends to other cancer types, we mined TCGA SCNA profiles of 6,492 tumor samples with available self-reported ancestry from AAs and EAs originating from 23 tumor types (Supp Tables 11 & 12). Consistent with previous observations 3, we initially observed an overall significantly higher GI burden in AA tumors (pan-cancer Wilcoxon P<6.9E-07, Figure 5A). These differences were most significant in breast (BRCA), head and neck (HNSC), stomach adenocarcinoma (STAD), cervical squamous cell carcinoma, and endocervical adenocarcinoma (CESC) cancers, with a general trend towards higher GI burden in 17 out of the 23 cancer types. We repeated this analysis separately for SNCA-loss and -gain-based GI and observed a consistent pattern ( Methods). We quantified HRD using the four measures previously used, i.e., LOH, TIL, LST and normalized sum of the three and observed a strong correlation between GI and HRD in pan- cancer (P<2E-16 for all; Spearman Rho=0.56, Spearman Rho=0.47 for LST, 0.60 for AIL, 0.58 for sum) and cancer type-specific analyses (Supp Table 13), where, in AA tumors, the correlation observed is stronger than in EA tumors for both pan-cancer (LOH-based measure: Rho for AA=0.57, for EA=0.48, P<2.2E-16 for both; AIL-based measure: Rho for AA=0.66, for EA=0.60, P<2.2E-16 for both; LST-based measure: Rho for AA=0.51, for EA=0.47, P<2.2E-16 14 for both) and cancer type-specific analyses (Supp Table 13). Moreover, HRD is significantly higher in AAs in pan-cancer for all four measures (Wilcoxon P<1.5E-02; P<7.7E-02 for LST; P<2.2E-02 for AIL; P<1.9E-02 for sum). This further suggests that HRD contributes to the ancestry-specific pattern of higher GI burden in AAs across cancer types. When analyzed by specific cancer type, we find that BRCA and HNSC have significantly higher HRD across all four measures in AAs compared with EAs (Table S12). A trend towards increased HRD among AAs was observed in 11 out of 17 cancer types where increased GI was also observed. The remaining six cancer types had an inverse trend, including KIRP and KIRC, which have significantly lower GI and HRD. We confirmed these results by quantifying HRD using a somatic mutation profile-based signature 27, i.e., mutational signature (mutSig) 3. This signature is typified by a C>G/A transversion and is strongly associated with HRD 27-29. We leveraged the mSignatureDB database where mutation signatures are profiled 27 on 7,042 tumors from 30 different cancer types and found the mutSig 3 contributions to be higher in tumors from AAs compared with EAs in pan-cancer (Wilcoxon P<1E-03). Testing each cancer type specifically for a higher mutSig 3 in AAs, we found that BRCA and HNSC have a higher prevalence of this HRD-related signature, which is consistent with the SNCA hallmarks-based quantification of HRD described above (Wilcoxon, P<0.01 and P<0.10, respectively). We additionally performed a multivariate regression modeling GI and HRD in pan-cancer as a function of ancestry adjusting for stage, sex, age and smoking status in TCGA samples, and found AA ancestry strongly positively associated with these genomic features, i.e. higher GI (FDR<2.2E-07) and HRD (FDR <4.5E-06 for LOH, <7.8E-07 for AIL, <3.8E-05 for LST, <2E- 01 for mutSigs3) (Supp Table 3). 15 Similar to the NCI-MD cohort, we tested for possible confounding by mislabeled self- reported race. We accessed genotype information of 906,600 SNPs in matched PBMCs that were downloaded from the controlled access part of TCGA (Methods) and inferred unsupervised ancestry (Methods). The overall concordance of our computed inferred ancestry with self- reported ancestry is high (94.7%). Using this inferred ancestry, we removed the possibly misclassified samples and repeated the above analysis, with consistent significant results. Next, we quantified CHTP in TCGA samples and observed that tumor samples with CHTP have significantly higher HRD than samples without CHTP (Wilcoxon P<3.2E-10, for all five HRD markers) in both pan-cancer and cancer type specifically. Consistently, we observed a higher frequency of CHTP in tumors from AAs compared with EAs in pan-cancer samples (Figure 5C, Fisher?s one-sided test P<0.028, odds ratio (OR)=1.25) and in LUSC samples from TCGA (Figure 5C, P<0.11, OR=1.4). These patterns were consistent when adjusted for age, sex, and stage across both cohorts (multivariate regression P for ancestry <2.8E-03) and further when another CHTP definition was used (Methods). Similar to the NCI-MD cohort, we observed chromosome enrichment of CHTP on chromosome 12 among AAs in LUSC (Extended Data Figure 2). 16 Figure 5: Landscape of GI, HRD, and chromothripsis across AA and EA pan-cancer in the TCGA cohort. (A) Genomic instability, (B) homologous recombination (HR) deficiency, and (C) chromothripsis is quantified and provided across genetic ancestry in pan-cancer TCGA samples (n=6,256 patients [AA=692, EA=5,563]). Similar to Figure 1, significance for comparison of medians in A) and B) was calculated via one-sided Wilcoxon rank-sum tests, and significance for comparison of frequency in C) was calculated via one-sided Fisher?s exact test. The violin plot shows the data distribution where the centerline denotes the median, the box indicating the interquartile range, and the black line represents the rest of the distribution, except for points that are determined to be ?outliers? using a method that is a function of the interquartile range, as in box plots. AA tumors have a higher germline prevalence of HRD Given the higher prevalence of HRD in AA tumors across LUSC and pan-cancer, we asked whether the increase of HRD in tumors could be driven by germline factors? We accessed the TCGA database of germline pathogenic variants across 10,389 adult-tumors 30. This study 30 performed whole-exome sequencing on PBMCs and then cataloged pathogenic variants (Methods). Using this dataset, we first counted the total number of pathogenic variants in HR- 17 genes (Supp Table 14) in each patient and defined it as germline HRD. Next, we asked whether AA patients have a higher extent of germline HRD than EA patients. In TCGA pan-cancer and LUSC, but not LUAD, we found that AAs had significantly higher germline HRD than EAs (Figure 6-left panel, OR=1.2; P<0.02 for pan-cancer; Figure 6-right panel, OR=6; P<8E-04 for LUSC; Extended Data Figure 10, P<0.23 for LUAD). Repeating this analysis in LUSC patients for individual genes of the HR pathway, we found predicted pathogenic variants in canonical HR-pathway genes BRCA1, BRCA2, and POLD1 to be enriched in AAs (Supp Table 14) (hypergeometric P<0.15, 0.01, 0.08, respectively). Similarly, in pan-cancer we found predicted pathogenic variants of BARD1, FANCM, BRIP1, PALB2, POLD1, and BRCA2 to be more enriched in AA patients (P<0.06, 0.12, 0.12, 0.19, 0.2, 0.25, respectively). Since some of these genes are mutated in hereditary predisposition syndromes it is possible that AAs in TCGA have a higher incidence of such syndromes. However, the known syndromes do not necessarily match the observed LUSC cancer type. BRCA2 mutations most commonly predispose to breast and ovarian cancers, although there is some evidence of association with lung cancer 31. Mutations in POLD1 have been associated with colorectal cancer 32, but not lung cancer, to our knowledge. We also found BLM and RECQL predicted pathogenic variants to be more enriched in EA patients (P<0.06, 0.22). 18 Figure 6: Landscape of germline HRD across EA and AA pan-cancer and LUSC TCGA cohort. Prevalence of germline HRD in AAs and EAs calculated using total frequency of germline pathogenic variants in HR-pathways genes in pan-cancer (left, total n=8,920 patients [AA=919, EA=8,001]) and LUSC (right, total n=382 patients [AA=31, EA=351]). Significance for comparison of the frequency of germline HRD was calculated via one-sided Fisher?s exact test where exact p-value values are provided. Discussion Here, we mapped molecular features of tumors from EAs and AAs across many cancer types, with greater depth and power in lung cancer. We observed that consistent with previous reports 2, GI is higher in AAs across multiple cancer types. This higher GI is unlikely to be related to the 19 recently identified unmapped 10% of the genome that is found in populations of African ancestry 3 as we found both copy number gain- and copy number loss-based GI to be higher in AAs. We hypothesized, and confirmed, that this higher GI is likely due to a higher prevalence of HRD in tumors from AAs. We also identified a significantly higher prevalence of mutational signature 3?closely associated with HRD 27-29?among a wide range of tumors from AAs (Extended Data Figure 8). We further show that tumors from AAs have a higher frequency of aggressive molecular features, including structural variants. HRD was not uniformly higher among AAs in some cancers, including KIRP and KIRC, where HRD was significantly lower. Higher SNCA-based GI and HRD in tumors from AAs raises the question of whether underlying defective DNA repair mechanisms could drive this observation. While HRD has been linked with germline and somatic mutations in BRCA1/2 33, no striking differences in the somatic mutation frequencies of these genes have been demonstrated in cancer between EAs and AAs 2,34. To investigate whether the increased HRD could be driven by a germline event, we analyzed germline pathogenic variants 30 and identified a higher proportion of HRD-related pathogenic variants among AAs compared with EAs, suggesting that GI/HRD events and tumor evolution could be shaped by these features. The observation that several cancer types occur at an earlier age among AAs 35 and evidence that germline pathogenic events are associated with early-onset disease 30 are consistent with these data. Higher HRD in LUSC and many other cancer types suggests that these tumors could respond to PARP inhibitors and that perhaps, AAs may be more likely to respond. Most trials do not report and/or are not powered to compare differences in response by ancestry group. PARP inhibitors are not commonly used in lung cancer treatment, though in combination with chemotherapy they have shown promising efficacy in both cell lines 36 and a clinical trial 37. In the latter, the benefit 20 from the combination treatment was primarily restricted to LUSC tumors. Further, a recent retrospective analysis of clinical trial data found that response to platinum compounds and survival was significantly better in patients with hallmarks of HRD 38. Thus, future preclinical and clinical studies could include biomarkers of HRD either in the study design or as a covariate in the data analysis. We next identified multiple ancestry-specific chromosome alterations with unknown relevance, including chromosomes 7p and 7q (AA frequency twice than EA). We also observed ancestry-specific patterns of co-occurrence and mutually exclusive events and recurrent focal region alterations. Further, only one out of eight potential AA-specific driver regions identified in this study have previously known driver genes (i.e., KRAS). Next, we found AA-specific recurrent alterations previously linked with ancestry disparities in other cancer types 39-41, including focal deletion of 4q35.2 comprising FBXW7, and amplification of oncogene KAT6A 42 (18% in AA vs. 0% in EA). In summary, we have identified population differences in molecular features, including GI, HRD, and CHTP. As these features are related to therapy response 13,43,44, our findings could have therapeutic implications. We also find higher GI and HRD in LUSC among African Americans and highlight some granular differences at the SNCA level in canonical lung cancer genes, such as CDKN2A, KRAS, and PTEN. As our study used the same platform to compare SCNA events across EAs and AAs, it largely removes the possibility that technical artifacts could confound our observations. Defining these differences in both genome-wide and more focal regions highlights distinct differences in lung tumor biology between AAs and EAs and supports recent work showing that inherited variants and thereby, genetic ancestry, can shape 21 tumor evolution at a molecular level and influence the somatic nature of a tumor 45. Finally, our work highlights the importance of including under-represented populations in balanced genomic studies of molecular patterns and cancer evolution. Acknowledgments We sincerely thank C. Harris for many insightful discussions. SS gratefully acknowledges the support of the NCI-UMD Cancer Research Training Fellowship. This research was supported by the Intramural Research Program of the National Institutes of Health, National Cancer Institute. References 1. DeSantis, C. E., Miller, K. D., Goding Sauer, A., Jemal, A. & Siegel, R. L. Cancer statistics for African Americans, 2019. CA Cancer J Clin, doi:10.3322/caac.21555 (2019). 2. Yuan, J. et al. Integrated Analysis of Genetic Ancestry and Genomic Alterations across Cancers. Cancer cell 34, 549-560 e549, doi:10.1016/j.ccell.2018.08.019 (2018). 3. Sherman, R. M. et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet, doi:10.1038/s41588-018-0273-y (2018). 4. Green, R. E. et al. A draft sequence of the Neandertal genome. Science 328, 710-722, doi:10.1126/science.1188021 (2010). 5. Siegel, R. L., Miller, K. D. & Jemal, A. Cancer statistics, 2019. CA Cancer J Clin 69, 7-34, doi:10.3322/caac.21551 (2019). 6. Ryan, B. M. Lung cancer health disparities. Carcinogenesis 39, 741-751, doi:10.1093/carcin/bgy047 (2018). 22 7. Mitchell, K. A., Zingone, A., Toulabi, L., Boeckelman, J. & Ryan, B. M. Comparative Transcriptome Profiling Reveals Coding and Noncoding RNA Differences in NSCLC from African Americans and European Americans. Clin Cancer Res 23, 7412-7425, doi:10.1158/1078- 0432.CCR-17-0527 (2017). 8. Wallace, T. A. et al. Tumor immunobiological differences in prostate cancer between African- American and European-American men. Cancer Res 68, 927-936, doi:10.1158/0008-5472.CAN- 07-2608 (2008). 9. Guda, K. et al. Novel recurrently mutated genes in African American colon cancers. Proc Natl Acad Sci U S A 112, 1149-1154, doi:10.1073/pnas.1417064112 (2015). 10. Chaisaingmongkol, J. et al. Common Molecular Subtypes Among Asian Hepatocellular Carcinoma and Cholangiocarcinoma. Cancer Cell 32, 57-70 e53, doi:10.1016/j.ccell.2017.05.009 (2017). 11. Foster, J. M. et al. Cross-laboratory validation of the OncoScan(R) FFPE Assay, a multiplex tool for whole genome tumour profiling. BMC Med Genomics 8, 5, doi:10.1186/s12920-015-0079-z (2015). 12. Knijnenburg, T. A. et al. Genomic and Molecular Landscape of DNA Damage Repair Deficiency across The Cancer Genome Atlas. Cell Rep 23, 239-254 e236, doi:10.1016/j.celrep.2018.03.076 (2018). 13. Telli, M. L. et al. Homologous Recombination Deficiency (HRD) Score Predicts Response to Platinum-Containing Neoadjuvant Chemotherapy in Patients with Triple-Negative Breast Cancer. Clin Cancer Res 22, 3764-3773, doi:10.1158/1078-0432.CCR-15-2477 (2016). 14. Swisher, E. M. et al. Rucaparib in relapsed, platinum-sensitive high-grade ovarian carcinoma (ARIEL2 Part 1): an international, multicentre, open-label, phase 2 trial. Lancet Oncol 18, 75-87, doi:10.1016/S1470-2045(16)30559-9 (2017). 23 15. Jin, Y., Schaffer, A. A., Feolo, M., Holmes, J. B. & Kattman, B. L. GRAF-pop: A Fast Distance- Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis. G3 (Bethesda) 9, 2447-2461, doi:10.1534/g3.118.200925 (2019). 16. Ratnaparkhe, M. et al. Defective DNA damage repair leads to frequent catastrophic genomic events in murine and human tumors. Nat Commun 9, 4760, doi:10.1038/s41467-018-06925-4 (2018). 17. Zhang, C. Z. et al. Chromothripsis from DNA damage in micronuclei. Nature 522, 179-+, doi:10.1038/nature14493 (2015). 18. Zhang, C. Z., Leibowitz, M. L. & Pellman, D. Chromothripsis and beyond: rapid genome evolution from complex chromosomal rearrangements. Gene Dev 27, 2513-2530, doi:10.1101/gad.229559.113 (2013). 19. Stephens, P. J. et al. Massive genomic rearrangement acquired in a single catastrophic event during cancer development. Cell 144, 27-40, doi:10.1016/j.cell.2010.11.055 (2011). 20. Mermel, C. H. et al. GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol 12, R41, doi:10.1186/gb- 2011-12-4-r41 (2011). 21. Taylor, A. M. et al. Genomic and Functional Approaches to Understanding Cancer Aneuploidy. Cancer cell 33, 676-689 e673, doi:10.1016/j.ccell.2018.03.007 (2018). 22. Polimanti, R. et al. Haplotype differences for copy number variants in the 22q11.23 region among human populations: a pigmentation-based model for selective pressure. Eur J Hum Genet 23, 116-123, doi:10.1038/ejhg.2014.47 (2015). 23. Burnside, R. D. 22q11.21 Deletion Syndromes: A Review of Proximal, Central, and Distal Deletions and Their Associated Features. Cytogenet Genome Res 146, 89-99, doi:10.1159/000438708 (2015). 24. Baell, J. B. et al. Inhibitors of histone acetyltransferases KAT6A/B induce senescence and arrest tumour growth. Nature 560, 253-257, doi:10.1038/s41586-018-0387-5 (2018). 24 25. Brim, H. et al. Genomic aberrations in an African American colorectal cancer cohort reveals a MSI-specific profile and chromosome X amplification in male patients. PLoS One 7, e40392, doi:10.1371/journal.pone.0040392 (2012). 26. Craig, D. W. et al. Genome and transcriptome sequencing in prospective metastatic triple- negative breast cancer uncovers therapeutic vulnerabilities. Mol Cancer Ther 12, 104-116, doi:10.1158/1535-7163.MCT-12-0781 (2013). 27. Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415- 421, doi:10.1038/nature12477 (2013). 28. Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47-54, doi:10.1038/nature17676 (2016). 29. Ma, J., Setton, J., Lee, N. Y., Riaz, N. & Powell, S. N. The therapeutic significance of mutational signatures from DNA repair deficiency in cancer. Nat Commun 9, 3292, doi:10.1038/s41467- 018-05228-y (2018). 30. Huang, K. L. et al. Pathogenic Germline Variants in 10,389 Adult Cancers. Cell 173, 355-370 e314, doi:10.1016/j.cell.2018.03.039 (2018). 31. Wang, Y. et al. Rare variants of large effect in BRCA2 and CHEK2 affect risk of lung cancer. Nat Genet 46, 736-741, doi:10.1038/ng.3002 (2014). 32. Palles, C. et al. Germline mutations affecting the proofreading domains of POLE and POLD1 predispose to colorectal adenomas and carcinomas. Nature genetics 45, 136-144, doi:10.1038/ng.2503 (2013). 33. Timms, K. M. et al. Association of BRCA1/2 defects with genomic scores predictive of DNA damage repair deficiency among breast cancer subtypes. Breast Cancer Res 16, 475, doi:10.1186/s13058-014-0475-x (2014). 34. Campbell, J. D. et al. Comparison of Prevalence and Types of Mutations in Lung Cancers Among Black and White Populations. JAMA Oncol 3, 801-809, doi:10.1001/jamaoncol.2016.6108 (2017). 25 35. Robbins, H. A., Engels, E. A., Pfeiffer, R. M. & Shiels, M. S. Age at cancer diagnosis for blacks compared with whites in the United States. J Natl Cancer Inst 107, doi:10.1093/jnci/dju489 (2015). 36. Jiang, Y. et al. PARP inhibitors synergize with gemcitabine by potentiating DNA damage in non- small-cell lung cancer. Int J Cancer 144, 1092-1103, doi:10.1002/ijc.31770 (2019). 37. Ramalingam, S. S. et al. Randomized, Placebo-Controlled, Phase II Study of Veliparib in Combination with Carboplatin and Paclitaxel for Advanced/Metastatic Non-Small Cell Lung Cancer. Clin Cancer Res 23, 1937-1944, doi:10.1158/1078-0432.CCR-15-3069 (2017). 38. Kadouri, L. et al. Homologous recombination in lung cancer, germline and somatic mutations, clinical and phenotype characterization. Lung Cancer 137, 48-51, doi:10.1016/j.lungcan.2019.09.008 (2019). 39. Yeh, C. H., Bellon, M. & Nicot, C. FBXW7: a critical tumor suppressor of human cancers. Mol Cancer 17, 115, doi:10.1186/s12943-018-0857-2 (2018). 40. Zhang, Q. et al. FBXW7 Facilitates Nonhomologous End-Joining via K63-Linked Polyubiquitylation of XRCC4. Mol Cell 61, 419-433, doi:10.1016/j.molcel.2015.12.010 (2016). 41. Yumimoto, K. et al. F-box protein FBXW7 inhibits cancer metastasis in a non-cell-autonomous manner. J Clin Invest 125, 621-635, doi:10.1172/JCI78782 (2015). 42. Kytola, V. et al. Mutational Landscapes of Smoking-Related Cancers in Caucasians and African Americans: Precision Oncology Perspectives at Wake Forest Baptist Comprehensive Cancer Center. Theranostics 7, 2914-2923, doi:10.7150/thno.20355 (2017). 43. Farmer, H. et al. Targeting the DNA repair defect in BRCA mutant cells as a therapeutic strategy. Nature 434, 917-921, doi:10.1038/nature03445 (2005). 44. McCabe, N. et al. Deficiency in the repair of DNA damage by homologous recombination and sensitivity to poly(ADP-ribose) polymerase inhibition. Cancer Res 66, 8109-8115, doi:10.1158/0008-5472.CAN-06-0140 (2006). 26 45. Carter, H. et al. Interaction Landscape of Inherited Polymorphisms with Somatic Events in Cancer. Cancer Discov, doi:10.1158/2159-8290.CD-16-1045 (2017). Methods Statistics & reproducibility While generating genome-wide copy number profiles of NCI-MD via OncoScan, two aliquots from the same sample were used to test the reproducibility of the assay for three samples by the company (available on reasonable request). In the NCI-MD study design, no statistical method was used to predetermine sample size. In the additional cohort, TCGA, we mined copy number profile of samples publicly available and excluded cancer types with less than five tumor samples with AA ancestry to provide a minimum statistical power. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment though samples were run on the OncoScan assay in a blinded manner. The processed tables to reproduce our figures and conclusions are provided (Data availability). In this work, we used non-parametric tests using R, including Wilcoxon rank-sum tests to compare the difference in medians, Fisher?s tests to compare frequency, and Spearman?s correlation, with an FDR-corrected P threshold of <0.1 indicating statistical significance. Wherever GISTIC was used, the FDR-corrected significance threshold of < 0.1 was applied to identify significantly recurrent regions. While identifying chromothripsis, the distance between events on a chromosome is compared to the overall distance between events in the samples to identify clustered events using an FDR-corrected P threshold of <0.1. 27 Samples preparation and processing Sample characteristics Patients living in the Baltimore metropolitan area with histologically confirmed cases of lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) were prospectively recruited to the ongoing NCI-MD Case-Control Study 46. Institutional Review Boards at seven participating Baltimore hospitals and the NCI approved the study with written informed consent obtained from all patients. All samples were collected from an NCI IRB-approved study. We conducted a retrospective study of eligible participants that self- reported as AA or EA, with non-Hispanic ethnicity. Additional clinical and sociodemographic data for each patient were obtained from medical records and pathology reports. Macro-dissected primary lung tumor tissues were obtained from patients directly after surgical removal. Samples were placed in collection tubes, flash-frozen, and stored at ?80?C until the OncoScan analyses were performed. Sample characteristics for the patients in which tumor DNAs were extracted can be found in Supp Table 1 (n=142 AA, 108 EA). DNA extraction DNA was extracted from fresh frozen micro-dissected primary lung tumor tissues using the Qiagen DNeasy Blood and Tissue kit spin-column procedure according to the manufacturer's 28 protocol (Qiagen, Valencia, CA). Isolated primary lung tumor DNAs were initially quantified using a DS-11 spectrophotometer (DeNovix, Wilmington, DE). Subsequent Qubit fluorometer analyses were performed to assess DNA integrity and ensure the presence of intact double- stranded DNA of all samples (Invitrogen, Carlsbad, CA). DNA with an A260/A280 ratio between 1.8 and 2.0, a minimum concentration of 12 ng/?L, and a total concentration of 80 ng was used for further analysis. Preprocessing of raw files Genome-wide copy number analysis and data quality control DNA samples were sent for genome-wide copy number analysis using the Affymetrix OncoScan copy number array and run according to suggested manufacturer protocols. The OncoScan array is based on molecular inversion probe technology and provides comprehensive high-resolution copy number detection across the genome and at pan-cancer driver genes. OncoScan fluorescence array intensity (CEL) files were converted to OSCHP files using the hg19 reference (OncoScan_CNV.Ref103.na33.r1.REF_MODEL reference file included with the Affymetrix OncoScan Console software, version 1.3). Manual re-centering of samples was performed by adjusting the TuScan log2 R using the OncoScan Console. Clonality analysis was performed with the Affymetrix OncoClone Composition tool. Segmentation of NCI-MD and TCGA intensity files For these samples, the Chromosome Suite Analysis (CHAS) was used for segmentation of intensity files at the default hyperparameters for output of segments with their copy number, 29 log2R, and B-Allele frequency (BAF) information. For TCGA samples, Level 3 segmented files were retrieved from the firehose pipeline where a consistent version of reference (hg19) was used. Quantifying GI, HRD, and Chromothripsis Quantification of GI Taking the output of segmentation results from the above for every sample in NCI-MD where we have copy number information for each segment, GI was defined by the ratio of the total length of regions with copy number other than two to a constant of 3.3E9, based on previous studies 47,48. We repeated this calculation for TCGA samples where we only selected cancer types with at least five AA samples. Quantification of HRD We identified five independent signatures to define somatic-level HRD (somatic HRD) across tumor samples, where four use copy number profiles and one uses the mutation profile of the tumor. We also used one signature to identify germline-level HRD (germline HRD) using germline variants in blood samples of the patients (detailed methods below). Here, we have described each one of them in detail. 30 I. Somatic HRD quantification Based on loss of heterozygosity (LOH) regions Using the output of allele-specific segmentation, we identified and calculated a total sum of the number of LOH events, segments with only one allele, in each sample. Then, we normalized the value to be in the range [0,1] and termed it as LOH-HRD 13,14 . Based on telomere allelic imbalance (AIL) regions Again using the output of segmentation, we identified and counted the sum of regions with allelic imbalance, an unequal allele copy number, and extension to a sub-telomere without crossing the centromere. Again, we normalized the sum to be in the range [0,1] and termed the normalized sum as AIL-HRD. Based on large-scale state transitions (LST) regions Here also, using the output of allele-specific segmentation, we identified and counted the total number of breakpoints between regions longer than 10 Mb after filtering out regions shorter than 3 Mb 13. Again, we normalized the breakpoint counts to be in [0,1] and termed it as LST-HRD. We defined the fourth method as (LOH-HRD+AIL-HRD+LST-HRD)/3, scaled to 0-1, for each sample. The division by 3 puts the value in the range [0,1]. These four signatures were quantified and used in both NCI-MD and TCGA samples. Exposures for each sample, the proportion of mutations assigned to mutation signature 3, known to be associated with HRD, was mined from mSignatureDB 49, a database of mutation signatures for more than 15,000 tumor samples from more than 73 projects, where only TCGA samples are considered for calculations. 31 II. Germline HRD quantification Using the predicted pathogenic germline variants information in patients from TCGA 30, we calculated the total number of pathogenic variants in HRD-genes in each sample (Supp Table 14) and performed a Fisher?s exact test to identify whether AAs in comparison to EAs have a significantly higher frequency of pathogenic variants. We repeated this analysis for each HRD gene as well. Purity and ploidy calculation Using the OncoClone tool provided by Affymetrix, which uses the algorithm ASCAT 50, we computed the purity and ploidy of samples from the NCI-MD cohort (Supp Table 2). Further, intratumor heterogeneity (ITH) was calculated using TuScan algorithm, a further extension of OncoClone. Accessing variant calls of TCGA patients? blood samples from dbGAP TCGA collection includes non-tumor biospecimen (blood samples were preferred if available, or adjacent non-tumor) for 10,224 patients with informed consent under the authorization of local institutional review boards of the sequence where whole-exome sequencing was performed 30. We requested permission for these data from the database of Genotypes and Phenotypes (dbGaP) and after receiving permission, downloaded the variants from the controlled access part of the TCGA portal. 32 Quantification of chromothripsis With an aim to identify whether an autosomal chromosome had undergone chromothripsis using SNCA profile data, we used four copy number based-hallmark traits of regions that underwent chromothripsis. Some of these hallmarks of chromothripsis have undergone an evolution since the first description, hence we used two partially overlapping hallmarks to identify chromothripsis based on the conventional 20 and an alternative more recent 51,52 description. Chromosomes that had all four hallmark properties were considered to have undergone chromothripsis. We modeled the four hallmarks of chromothripsis via two tests for each sample. First, we filtered for chromosomes with significantly more events than the sample?s background, derived from all other autosomes. Specifically, a chromosome must have a higher number of copy number events than the median number of copy number events per chromosome in the sample. Second, for every chromosome that passed the first test, the distance between the event breakpoints on the chromosomes should be significantly lower than the background distribution of copy number event breakpoints within the rest of the chromosomes. To this end, we tested whether the distances between the breakpoints of events of a given chromosome were lower than the background distribution of distances between the breakpoints of events on the rest of the chromosomes. If not, we removed the terminal event with a higher breakpoint distance from the penultimate and repeated the above step. The above iteration was repeated for a chromosome until we found a region with greater than five events with significantly lower breakpoint distance (clustered, FDR-corrected P<0.1), and the region comprised only one type of copy number event (oscillatory copy number state). We repeated the above steps with a single modification to model and detect CHTP based on the 33 recent definition, wherein a CHTP region can have two oscillatory copy number states or two types of copy number events. Association of copy number change with expression For this study, total RNA sequencing was performed for 56 out of 222 samples with SNCA profiles (31 LUAD & 25 LUSC). The association of copy number with expression was calculated via a one-tailed Wilcoxon rank-sum test, where samples were divided into two categories by thresholding on the median gene copy number to test, in a genome-wide fashion for each gene, whether samples with copy number higher than the gene median copy number in the cohort has expression significantly higher than the rest of the samples. Focal and arm events by GISTIC Generating a copy number map with focal-, arm- level events via GISTIC The GISTIC algorithm was used to find recurrent regions of amplification, deletion, or LOH from the segmented file generated from CHAS. We used the following hyperparameter configuration throughout the study to find recurrent regions ?-genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1?. Based on this configuration, a gene GISTIC algorithm was used where arm level events are defined as aberrant regions with at least the length of half an arm, and regions below this threshold are defined as focal. The confidence level used to calculate the region was 0.90 and the q-value was the default of 0.25. 34 Unsupervised ancestry inference via principal component analysis (PCA) for NCI-MD cohort Genotypes for 217,611 SNPs were generated from OncoScan OSCHP file via apt-tools for the samples from the NCI-MD cohort. We identified 46,217 SNVs variants likely to be ancestry- associated and not somatically acquired that are found to be present in at least 25% of the AAs or EAs in our cohort. In this matrix, where each row represents a patient and each column represents a SNP, we performed a PCA with rank two, constraining the number of principal components (PC) to two (Extended Data Figure 1). Next, we performed a classification using the two PCs using support vector classification (SVC) with a linear kernel to identify two classes. The predominant self-reported race in the class is assigned to be its identity. These two classes were then tested for concordance with self-reported ancestry. Unsupervised ancestry inference via principal component analysis (PCA) for TCGA cohort Genotype information of 906,601 SNPs from the SNP6 array performed on matched PBMC samples of TCGA patients called using BirdSeed, a SNP genotyping algorithm, were downloaded. We requested permission for these data from dbGaP and, after receiving permission, downloaded the variants from the controlled access part of the TCGA portal. To infer ancestry, methods similar to NCI-MD were employed, where after removing low-variance SNPs, we inferred 300,000 SNPs likely to be ancestry-associated that are found to be present in at least 25% of the AAs or EAs in our cohort. Following the methods described above for NCI- MD, we identified two classes of ancestry. 35 Statistical power analysis of TCGA samples from various populations We observed a negative correlation between the FDR-corrected significance for AAs having higher GI and the proportion of AA samples included per cancer type, which was higher than expected when permuted a million times. (Spearman Rho=-0.34, P<0.15; empirical P<1E-04), suggesting that under-representation of samples from AAs is a limiting factor in terms of statistical power when comparing these two populations in certain tumor types in TCGA. SNCA-gain and -loss based genomic instability (GI) analysis For TCGA Pan-Cancer We calculated SNCA-gain and SNCA-loss based GI and consistently observed both GI measures to be higher in AAs (Wilcoxon rank-sum P<5.2E-06 and P<1.5E-06). Further, the trend of higher GI was observed in 16 out of 23 cancer types for both SNCA-gain based and SNCA-loss based GI (Extended Figure 2A-B). For NCI-MD LUSC SNCA-gain and SNCA-loss based GI is calculated for LUSC from the NCI-MD cohort. We observed only SNCA-loss (Wilcoxon P<4.5E-06) and not SNCA-gain (Wilcoxon P<0.34) to be significantly higher in AAs (P<4.5E-06 and <0.34, respectively). Qualitative characterization of NCI-MD cohort tumor samples Purity?the percentage of the tumor cell fraction within a sample?was successfully resolved in 194 out of 222 samples (Supp Table 2) where the mean purity was 34%. LUSC tumor samples 36 (38.5% mean purity) had a significantly higher (Wilcoxon P<0.009) purity than LUAD (30.5%), consistent with the purity differences observed TCGA. The overall mean ploidy was 2.22. Arm-level aberration frequency negatively correlated with the number of genes present on the chromosome arm (NCI-MD) Broad level events across chromosome arms were quantified and plotted against the number of proteins expressing genes. We observed a general trend of negative correlation between the frequency of an aberration on a chromosome arm and the number of genes present on the same arm (median Spearman Rho=0.41). Data availability Human TCGA cohort mutation data were derived from the publicly available mSignatureDB database: [http://tardis.cgu.edu.tw/msignaturedb/]. For the corresponding samples, copy number profiles, Level 3 segmented files were retrieved from the firehose pipeline [https://gdac.broadinstitute.org/] where a consistent version of reference (hg19) was used. The NCI-MD data were derived from patients enrolled in the ongoing NCI-MD Case-Control Study and all relevant data in this work is available on reasonable request, except for the TCGA pathogenic variant calls that required dbGaP controlled access and any sequence information that would make it possible to identify study subjects. Anonymized Level 3 segmented files for each sample, in addition of the raw files for copy number profiles of the NCI-MD patients and their corresponding expression profile, are deposited in dbGAP. 37 Code availability We used open source R v3.6 throughout our work to generate figures. Wherever required, commercially available Adobe illustrator 23.0.3 (2019) was used to create figure grids. All the scripts for analysis and reproducing figures and panels are built in-house and are provided on github here: https://github.com/sanjusinha7/Scripts_MolCharAAvsEA. Methods-only References 46. Enewold, L. et al. Serum concentrations of cytokines and lung cancer survival in African Americans and Caucasians. Cancer Epidemiol Biomarkers Prev 18, 215-222, doi:10.1158/1055- 9965.EPI-08-0705 (2009). 47. Jamal-Hanjani, M. et al. Tracking the Evolution of Non-Small-Cell Lung Cancer. N Engl J Med 376, 2109-2121, doi:10.1056/NEJMoa1616288 (2017). 48. Andor, N. et al. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nat Med 22, 105-113, doi:10.1038/nm.3984 (2016). 49. Huang, P. J. et al. mSignatureDB: a database for deciphering mutational signatures in human cancers. Nucleic Acids Res 46, D964-D970, doi:10.1093/nar/gkx1133 (2018). 50. Van Loo, P. et al. Allele-specific copy number analysis of tumors. Proc Natl Acad Sci U S A 107, 16910-16915, doi:10.1073/pnas.1009843107 (2010). 51. Yi, K. & Ju, Y. S. Patterns and mechanisms of structural variations in human cancer. Exp Mol Med 50, 98, doi:10.1038/s12276-018-0112-3 (2018). 52. Maciejowski, J., Li, Y., Bosco, N., Campbell, P. J. & de Lange, T. Chromothripsis and Kataegis Induced by Telomere Crisis. Cell 163, 1641-1654, doi:10.1016/j.cell.2015.11.054 (2015). 38 Chapter 2: A systematic genome-wide mapping of oncogenic mutation- selection during CRISPR-Cas9 genome editing 39 Abstract Recent studies have reported that genome editing by CRISPR?Cas9 induces a DNA damage response mediated by p53 in primary cells hampering their growth. This could lead to a selection of cells with pre-existing p53 mutations. In this study, employing an integrated computational and experimental framework, we systematically investigated the possibility of selection of additional cancer driver mutations during CRISPR-Cas9 gene editing. We first confirm the previous findings of the selection for pre-existing p53 mutations by CRISPR-Cas9. We next demonstrate that similar to p53, wildtype KRAS may also hamper the growth of Cas9-edited cells, potentially conferring a selective advantage to pre-existing KRAS-mutant cells. These selective effects are widespread, extending across cell-types and methods of CRISPR-Cas9 delivery and the strength of selection depends on the sgRNA sequence and the gene being edited. The selection for pre-existing p53 or KRAS mutations may confound CRISPR-Cas9 screens in cancer cells and more importantly, calls for monitoring patients undergoing CRISPR-Cas9-based editing for clinical therapeutics for pre-existing p53 and KRAS mutations. Introduction CRISPR enables targeted gene-disruption and editing, a powerful technology that expands our understanding of fundamental biological processes1. Beyond its impact on biological research, CRISPR-based approaches have been considered for various applications in medicine, from reparative editing of primary cells to the development of new strategies to treat a variety of genetic diseases, including cancer. However, several clinical trials based on CRISPR technology 40 have been deferred due to significant potential risks, including off-target effects2,3,4, generation of unexpected chromosomal alterations5 and potential immunogenicity6. Other studies have demonstrated that double stranded breaks (DSBs) induced during CRISPR-Cas9-based gene knockout (CRISPR-KO) can lead to DNA damage response, whose level is associated with the copy number of the targeted gene7?10. Recent studies have shown that the DNA damage response following CRISPR-KO can be mediated by p53, a tumor-suppressor gene mutated in over 50% of all human cancers11,12. Genome-wide CRISPR screening in immortalized human retinal pigment epithelial (RPE1) cells12 revealed that a p53-mediated DNA damage response, followed by cell cycle arrest, is induced upon generation of DSBs by the Cas9 endonuclease, favoring the survival of cells that have inactivated the p53 pathway. Most recently, a study showed that exogenous expression of Cas9 can also activate this p53-mediated DNA damage response13. While these studies indicate that CRISPR-Cas9 genome editing techniques may select for p53 mutated cells 11,12,13, several outstanding questions remain unaddressed: First, since most of these p53 studies have involved only a small number of primary or stem cells11,12, it is unclear whether p53 selection can happen broadly across multiple different cell types including transformed cancer cells. Second, it is not clear whether stronger p53 selection can happen when certain genes or parts of the genome are targeted, or the level of selection is more homogenous regardless of the genes being edited. And finally, it remains to be investigated whether this selection is limited to p53 only or that other cancer driver genes can also be selected for during CRISPR-Cas9 genome editing. To address these questions, here we employ a computational framework coupled with experimental validations to conduct a comprehensive evaluation of each cancer driver mutation selection associated with CRISPR-Cas9. We first demonstrate that CRISPR-KO-induced mutant 41 p53 selection can be observed in transformed and non-transformed cells of diverse lineages via both lentivirus and ribonucleoprotein-based Cas9 delivery. More importantly, we systematically characterized mutation selection in other cancer driver genes during CRISPR-Cas9 identifying that KRAS mutants can also be selected for, as demonstrated in large-scale genetic screens and Cas9-expressing cell lines. We further identified the underlying pathways that are likely to mediate this selection. Results CRISPR-Cas9 gene-knockouts selects for p53 mutations in a vast variety of transformed and non-transformed cell types We first sought to address two important gaps in our understanding of CRISPR-KO- driven mutant p53 selection ? firstly, we wanted to investigate whether this selection generalizes across cell types. Secondly, we wanted to understand what type of sgRNAs, genes and gene- networks drive this selection. We analyzed the DepMap15 genome-wide gene essentiality data across 248 cancer cell lines (Table S1), where both CRISPR-Cas9 (AVANA10) and shRNA-based (Achilles15) genetic screens were conducted. We searched for genes whose CRISPR-Cas9-based knockout (CRISPR-KO) reduced cell viability more (i.e. more essential) in p53-wildtype (WT; N=75) than p53-mutant (N=173) cell lines, but do not exhibit such differential essentiality in the shRNA-based screens (Methods). The KO of such genes may select for p53 mutants specifically during CRISPR-Cas9 editing. In the CRISPR-Cas9 screen, we find many more genes (981) that are more essential in p53-WT vs p53-mutant cell lines, compared to the genes that are more essential in p53-mutant cells (237 genes). In contrast, the numbers of such differentially essential genes in the shRNA screens were balanced (~1500 each). Such significantly different patterns 42 between CRISPR-Cas9 and shRNA screens (Figure 1a left panel, Chi-squared test P<1.4E-284) points to a bias that knockout/knockdown of a gene is more likely to impair the fitness of p53- WT cells specifically with CRISPR-Cas9 but not with shRNA. Potential confounding factors including gene copy number, functional impact of p53 variants and phenotype difference between gene knockout/knockdown are discussed and controlled for in this analysis (Supp. Note 1, Figure S1). Among the 981 genes that are more essential in p53-WT cells with CRISPR-KO, 861 genes (87%) do not exhibit this differential essentiality in shRNA screens. We hence termed these CRISPR-specific differentially essential positive (CDE+) genes (Figure 1a right panel; genes listed in Table S2A). We find that these CDE+ genes are preferentially located in chromosomal bands containing common fragile sites (CFSs; hypergeometric P<2.3E-4, Figure 1b, Table S3), which are prone to replicative stress, fork collapse and DNA breaks that cause genomic instability16. As CRISPR-KO could induce kilobase-scale structural alterations near the targeted site17, this finding suggests that CRISPR-targeting near CFSs may enhance DNA damage, promote the p53-dependent cell death response and provide a selective advantage to p53 mutant cells. The sgRNAs of the CDE+ genes also tend to target highly accessible chromatin (hypergeometric P<0.02; Methods), thus inducing a strong damage response as recently reported18. The top pathways enriched within CDE+ genes include DNA damage response, DNA repair and Fanconi anemia (FA; hypergeometric test adjusted P<0.01, Table S2B). This is consistent with the recent report that the FA pathway is involved in repairing Cas9-induced DNA double-strand breaks (DSBs)19 and that their KO may further enhance DNA damage. Analogous to CDE+, we defined CDE- genes, which are more essential in p53-mutant (vs WT) cells with CRISPR-KO, but not showing such difference in shRNA screens (185 genes, 43 right panel of Figure 1a). CDE- genes are involved in cellular processes that engage p53, including mitotic checkpoints, DNA replication and cell cycle (Table S2B, Figure 1d, hypergeometric test adjusted P<0.1), with the top hit being the key cell cycle regulator CDKN1A11,12 (a.k.a. p21, Wilcoxon rank-sum P<1.85E-08, Figure 1c). Transiently inhibiting CDE- genes during CRISPR-KO may mitigate p53 mutation selection and could be of interest from a translational point of view. Top CDE+/- genes are highlighted in Figure 1c. We repeated this CDE+/- identification process using an independent CRISPR-Cas9 screen in 326 cancer cell lines20 and observed concordant results (Supp. Note 2, Figure S2). 44 Figure 1. A genome wide view of p53-mutant selection. (a) Upper panel: number of genes whose essentiality is significantly associated with p53 mutation status in CRISPR and shRNA screens (one sided Wilcoxon rank-sum has been performed with FDR threshold of 0.1). Lower panel: the definition of CDE+ and CDE- genes. (b) Enrichment of p53 CDE+ genes in common fragile sites (CFSs). The x-axis denotes the chromosomal position; the scatter plot (y-axis on the left-hand side) shows the difference of median post-CRISPR-KO cell viability values in p53 mutant vs p53 WT cell lines for p53 CDE+ genes (red dots) and all other genes (grey dots); the density plot (colored orange, y-axis on the right-hand side) shows the fraction of p53 CDE+ genes among all genes per DNA segments of 10 Mbp along the genome; the vertical blue bars indicate the chromosomal bands of CFSs, and prominent sites where peaks of high CDE+ gene density coincide with CFSs are marked by arrows on the top. (c) The distribution of predicted level of CRISPR-Cas9 p53-mutant selection across the genome. Significant CDE+ genes that are part of FA pathway are marked in red and significant CDE- genes that are part of cell cycle regulation in blue. (d) Visualization of the pathways enriched for p53 CDE- genes where significance is calculated using the GSEA method as implemented in the R package fgsea [21]. Only significantly enriched pathways (FDR<0.1) specific to CRISPR (and not in genes showing differential essentiality in the shRNA screens are shown). Pathways are depicted as nodes whose sizes correlate with pathway lengths and colors represent enrichment significance (the darker, the more significant). Pathway nodes are connected and clustered based on their functional similarities, and higher-level functional terms are given for each of the clusters (Methods). For clarity, only the largest clusters are shown. We next performed our own CRISPR-Cas9 essentiality screen, employing CRISPRi- based essentiality screens as a control in a pair of p53-isogenic MOLM13 leukemia cell lines 45 (WT and p53 R248Q mutant). We used a deep (10 guides per gene) and focused sgRNA library targeting top p53 CDE+ and CDE- genes (Methods; Figure 2a, details in Supp. Note 10, Table S4). Here, we observed in the CRISPR-Cas9 screen that the CDE+ genes are more essential in p53-WT vs mutant cell, and vice versa for the CDE- genes (Wilcoxon signed-rank test P<0.08 and P=0.03 for CDE+/- genes respectively). Reassuringly, we do not see such differential essentiality in the CRISPRi screens (Wilcoxon signed-rank P=0.32 and 0.29; Top 10% CDE+/- genes are depicted in Figure 2b). To further assess whether such selection effects can be observed in non-transformed cells, we next tested and observed that indeed our p53 CDE+ genes have a higher essentiality in WT vs isogenic-mutant cells in published CRISPR-Cas9 [12] but not shRNA [47] genome-wide screens performed in non-transformed RPE1 cells (Figure S3, screens quality control discussed in Supp. Note 9 and Figure S4). This finding is further confirmed by mining seven CRISPR-KO genome-wide screens22, including two p53-null and five p53-WT RPE1 cells screens (Figure S5, Supp. Note 3). 46 Figure 2. Validation of p53 CDE genes in isogenic MOLM13 cell lines via pooled CRISPR screens. (a) A flowchart showing the experimental procedure of CRISPR-KO and CRISPRi screening of pooled p53 CDE+/- genes in a pair of p53-isogenic MOLM13 cell lines. See Methods for details. (b) The day 30 to day 0 fold-change (converted to rank) of reads corresponding to the sgRNAs for p53 CDE+ genes (upper panel) and CDE- genes (lower panel), in p53 WT MOLM13 cells (gray boxes) vs the isogenic p53 mutant cells (red boxes) for the CRISPR-KO and CRISPRi screenings, respectively. The bottom P values are for Wilcoxon signed-rank tests comparing p53 WT and mutant cells, the upper ones are P values of non- parametric tests comparing the difference of p53 mutant and WT rank values between CRISPR- KO and CRISPRi experiments. In the boxplots, the center line, box edges and whiskers denotes the median, interquartile range and the rest of the distribution in respective order, except for points that were determined to be outliers using a method that is a function of the interquartile range, as in standard box plots. 47 A competition assay shows selection for p53 mutant over wildtype cells following CRISPR-Cas9 knockout of CDE+ genes To test whether the CRISPR-KO of CDE+ genes leads to selection of p53 mutant cells in a competitive setting23-25, we silenced the top five predicted CDE+ genes using CRISPR-Cas9 and CRISPRi in the p53-isogenic MOLM13 cells (Methods). Following a lentiviral sgRNA transduction, the WT and mutant cells were mixed at an initial ratio of 95:5, and monitored by flow cytometry for up to 25 days (illustrated in Figure 3a; Table S5A). Silencing 2/5 CDE+ genes (NDUFB6 and NDUFB10) induced a strong progressive p53 mutant enrichment of up to five folds over WT at day 25 specifically in CRISPR-KO, across several independent sgRNAs and not for NTC (Figure 3b, blue lines). No inverse enrichment in p53 WT cells was observed in the competitive assays involving the three other CDE+ genes (Figure S6). We observed that sgRNAs targeting NDUFB6 induced significantly higher DNA damage compared to NTC- treated cells specifically in p53 WT cells (Figure S7, despite editing efficiency being higher in the mutant cells as shown in Figure 4c), demonstrating that the DNA damage was not just due to Cas9 expression. This may partly explain their selective competitive advantage upon the CDE+ gene KO. Testing the robustness of this competitive selection advantage for p53 mutant cells, we repeated the CRISPR-KO competitive assay for a larger number of 18 top CDE+ genes with up to 4 unique sgRNAs per gene and monitored the assay up to 15 days (Table S5B). Using the non-targeting sgRNA as a baseline, we observed the competitive outgrowth of p53 mutant cells for 15 out of 28 sgRNAs and 10 out of 18 CDE+ genes tested (Figure 3c). 48 Figure 3. Selection for p53 mutant cells under CRISPR-Cas9 knockout of CDE+ genes in a co-culture of p53 WT/mutant cells. (a) An illustration showing the experimental design of the competition assay where isogenic p53 WT/mutant MOLM13 cell lines were mixed with a ratio of 5:95 and top p53 CDE+ genes were knocked out by CRISPR-Cas9. Population ratio was monitored for 25 days at a five-day interval starting from the day of sgRNA transduction. (b) 49 Change in the percentage of p53 mutant cells in the p53 mutant-WT cells (Y-axis) co-culture with time (X-axis, number of days in co-culture), under the CRISPR-KO or CRISPRi of individual selected top p53 CDE+ genes or with non-targeting control sgRNA. The p-values are calculated using two-sided Wilcoxon Rank Sum tests. (c) The difference of the percentage of p53 mutant cells between Day 15 and Day 0 in co-culture (Y-axis), under the CRISPR-KO of a larger set of top p53 CDE+ genes (X-axis, by individual sgRNAs, specified by number suffixes after gene symbols). Error bars represent standard error across replicates for each sgRNA. The horizontal dashed line represents the value for non-targeting control sgRNA (NTC). p53 mutation selection phenomena extends to transient knockout and primary cells We next asked whether our top CDE+ genes may also select for p53 mutants under CRISPR-Cas9 transient knockout. We delivered Cas9 and the sgRNA as a ribonucleoprotein (RNP). We observed that upon Cas9-RNP mediated transfection of a sgRNA targeting our top CDE+ gene from our pooled and competition assays, NDUFB6 (see Methods), there was a higher loss of edited cells in the p53 WT vs isogenic p53 mutant MOLM13 cells over 10 days of culture, as measured by change in ICE scores (Figure 4a top panel). Using an orthogonal method of proliferation monitoring by dye-dilution26, we observed that there was a progressive slowing down in cell proliferation of p53 WT, but not p53 mutant MOLM13 cells upon Cas9- RNP based KO of NDUFB6 vs respective non-targeting controls (Figure 4b,c). Similar to the lentiviral system, this is likely due to the DNA damage induced by the NDUFB6 sgRNA compared to the Cas9 only or Cas9 with NTC controls in p53 WT cells (Figure 4b and S8). We repeated this transient knockout in non-transformed cells (RPE1) and consistently observed an increased loss of edited p53 WT over p53 mutant cells (Figure 4a, bottom panel). Notably, we 50 also observed a selection of p53 mutant over WT in patient tumors profiles (TCGA) based on the copy number alteration patterns of CDE+ genes (details in Supp. Note 11A). Figure 4: Transient knockout in both MOLM13 and RPE1 leads to preferential loss of p53 WT over mutant cells. (a) The editing efficiency of NDUFB6 around the sgRNA cut site was 51 determined in p53 mutant and wildtype isogenic pair of MOLM13 (top panels, transformed cells) and RPE1 cells (bottom panels, primary cells) using ICE protocol (see Methods) at day 0 (orange) and day 10 (green) after Cas9-RNP-sgRNA nucleofection. Differences in day 0 compared to day 10 editing efficiency can be used as a measure of relative fitness of edited compared to non-edited cells. The p-values are calculated using two-sided Wilcoxon Rank Sum tests. In the boxplots, the center line, box edges and whiskers denote the median, interquartile range and the rest of the distribution in respective order, except for points that were determined to be outliers using a method that is a function of the interquartile range, as in standard box plots. (b) DNA damage is quantified from gamma H2AX staining images and measured by gH2AX staining in p53 wildtype MOLM13 cells with no Cas9 , Cas9 + sgRNA for a non-targeting control (NTC) or NDUFB6. gH2AX foci (y-axis) in all three conditions (x-axis) are enumerated in the violin plot. (c) Mean fluorescence intensity of the CellTrace?? dye (APC) in MOLM13 p53 mutant (top panel) vs wildtype cells (bottom panel) is shown for NTC (light blue) or NDUFB6 targeting sgRNAs (dark blue). CellTrace?? APC fluorescence is inversely correlated with proliferation. The error bars denote standard error (mean +/- standard deviation) across three replicates. The p-values are calculated using a two-sided t-test given a small number of data points (n=3). (d) Proliferative effects of NDUFB6 editing in RNP-transfected p53 mutants compared to wildtype cells. A histogram of MOLM13 p53 wildtype (top panel) cells transfected with an NTC or a NDUFB6 sgRNA are shown with the fluorescence intensity of the CellTrace?? dye (APC) on the x-axis. Similarly, MOLM13 p53 mutant cells are plotted in the bottom panel. KRAS mutant cell lines exhibit selection advantage in large-scale genetic screens To determine whether additional cancer driver mutations may be selected for following CRISPR-KO, we focused on a list of 61 cancer driver genes from Vogelstein et al.27 that are 52 mutated in at least 10 of the cell lines screened in the AVANA10 and Achilles15 datasets. For each of these cancer genes, we identified the differentially essential genes between its WT and mutant cell lines in the CRISPR-Cas9 (AVANA) and shRNA (Achilles) screens, as described above for p53. We ranked the cancer genes by the significance of skewness in the numbers of differentially essential genes from CRISPR-Cas9 vs shRNA screens similar to that shown in Figure 1a for p53 (with Fisher?s exact tests, Methods; results shown in Figure 5a and Table S6). The mutants of these genes may be selected for during CRISPR-KO, as their WT cells are overall more vulnerable during CRISPR-KO compared to the mutants. We term these genes ?(potential) CRISPR-selected cancer drivers'' (CCDs). The top significant CCD in addition to p53 is the oncogene KRAS. Like for p53, potential confounding factors including copy number were controlled for (Supp. Note 1, Figure S1b), and there is no significant correlation between the mutation profiles of KRAS and p53 (Fisher?s test P=0.67), suggesting that KRAS might be a CCD independent of p53. We thus next focused on investigating the selection of mutant KRAS as another major CCD. KRAS is a major oncogene whose gain of function mutation is known to activate various DNA repair pathways and may override the trigger of cell death upon DNA damage28,29, supporting its role as a CCD. We computationally identified the CDE+ and CDE- genes of KRAS in a similar way described above for p53 (Figure 5b, Table S2A). KRAS has high numbers of CDE+/- genes, while only very few KRAS mutation-associated genes are identified in the shRNA screen. The predicted median mutant selection levels are comparable to those of p53 (Supp. Note 4), i.e. the CRISPR-KO of its CDE+ genes is likely to drive comparable levels of mutant selection as the KO of the CDE+ genes of p53. Fourteen genes are CDE+ genes of both p53 and KRAS, and thus their CRISPR-KO may impose considerable selection for both KRAS and p53 53 mutants (Table S2A). Consistent with the knowledge of downstream pathways regulated by activated KRAS28,29, its CDE- genes are significantly enriched for DNA DSB repair pathways (FDR<0.02, Methods, visualized in Figure 5c, Table S2E). Figure 5. Large-scale genetic screening identifies KRAS as a second major cancer driver whose mutation can be potentially selected for by CRISPR-Cas9. (a) A scatter plot showing the number of identified CDE+ genes (X-axis) and the negative log10-transformed P values of Fisher?s exact test (Y-axis) testing for the imbalance in the number of differentially essential 54 genes in CRISPR and shRNA screens for the 61 major cancer driver genes from Vogelstein et al.27. p53 and KRAS are identified as the top two significant cancer genes with the higher number of CDE+ genes. (b) The number of genes whose essentiality is significantly associated with KRAS mutational status in CRISPR and shRNA screens (one sided Wilcoxon rank-sum has been performed with FDR threshold of 0.1). (c) Visualization of pathways enriched for KRAS CDE- genes where significance is calculated using the GSEA method as implemented in the R package fgsea21 . Only significant pathways (FDR<0.1) specific to CDE and not to the genes showing differential essentiality in the shRNA screens are included. Pathways are shown as nodes whose sizes correlate with pathway lengths and colors represent the significance of their enrichment (the darker the more significant). Pathway nodes are connected and clustered based on their functional similarities, and higher-level functional terms are given for each of the clusters (Methods). For clarity, only the largest clusters are shown. Similar to p53, we next performed our own CRISPR-Cas9 and a control CRISPRi gene essentiality screens, but on a smaller scale, in a pair of isogenic KRAS WT and KRAS G12D mutant MOLM13 cell lines using a sgRNA library targeting top KRAS CDE+ and CDE- genes (details in Supp. Note 10, Table S7). Here, we observed that the KRAS CDE+ genes are more essential in WT than mutants and vice versa for CDE- genes specifically in CRISPR-Cas9 screens (Wilcoxon signed-rank test P<0.074 and P<0.042 for CDE+/- respectively, Figure 6a left panel), but not in CRISPRi screens (Wilcoxon signed-rank P<0.22 and P<0.49; Figure 6a right panel). Similar results were obtained from analyzing published genome-wide CRISPR- Cas930 and shRNA genetic screens31 performed in a different pair of KRAS isogenic cell lines (WT and G12D mutation in DLD1 cell line; Figure 6b). Similar to p53, we also observed a 55 selection of KRAS mutants in patient tumor profiles (TCGA32) based on the copy number alteration patterns of its CDE+ genes (details in Supp. Note 11B). A competition assay shows selection for KRAS mutant over wildtype cells following CRISPR- Cas9 knockout of KRAS CDE+ genes To test whether, like the p53 case, CRISPR-KO of KRAS CDE+ genes can confer a selective advantage to KRAS mutant over WT cells in co-culture, we conducted a similar competition assay using a pair of WT and KRAS G12D mutant isogenic MOLM13 cell lines. As in the experiment for p53, we mixed the WT and KRAS mutant cells at an initial ratio of 95:5 following KRAS CDE+ sgRNA transduction and monitored the population for 15 days to track the percentage of KRAS mutant cells (TdTomato+) with flow cytometry (Methods; Table S8). A total of 10 KRAS CDE+ genes were tested, in addition to a non-targeting control (NTC). In the control group, the KRAS mutant cell fraction decreased with time, indicating that the mutant cells have lower baseline fitness levels than the WT cells. In comparison, in 8 out of 10 CDE+ genes tested, there is a gain in fitness of the mutant cells (Figure 6c; Methods), testifying that even though the KRAS mutant cells have a lower baseline fitness level, the CRISPR-KO of the majority of CDE+ genes can enhance their fitness and in a subset of cases lead to selective outgrowth of KRAS mutant over WT cells in a mixed population. Cas9 expression in cancer cell lines selects for KRAS mutations Multiple studies have reported a higher editing efficiency of Cas9 in p53 mutated versus p53 WT cell lines11,12,14,13. We first asked if this may also extend to KRAS as an equally important CCD. Analyzing induced exogenous Cas9 activity in 1601 cancer cell lines from DepMap (1375 and 226 KRAS WT and mutant, respectively)10, we find that, like p53, Cas9 56 activity is significantly higher in KRAS mutant cells than in KRAS WT cells (P=2.9E-05, Figure 6d; Methods). We repeated the above analysis modeling Cas9 activity vs KRAS status adjusting for p53 status in a linear model, yielding concordant findings (P=2.43E-04). Importantly, across all the 61 cancer driver genes we analyzed above from Vogelstein et al.27, KRAS and p53 are the only ones showing such a significant difference in Cas9 activity between WT and mutant cells after FDR correction (FDR=9.1E-04 for KRAS & 7.1E-06 for p53, Figure 6e). This further shows that in addition to p53, KRAS WT status can also hamper the efficiency of CRISPR-Cas9. Based on the above findings and a recent report13 of selection for p53 mutant due to DNA damage upon Cas9 expression (without sgRNA), we asked whether DNA damage induced by Cas9 alone can also lead to a mutation selection of KRAS and/or other cancer drivers. To this end, we re-analyzed deep sequencing profiles from 42 Cas9-expressed vs matched parental (i.e. without Cas9) cell lines (Methods) from Enache et al.13, and identified a total of 9 cases involving 5 unique KRAS mutations, occurring in 7 different cell lines with moderate to high Cas9 activity (Methods). Seven out of these 9 cases show increased mutant allele frequency after induced Cas9 expression (Wilcoxon signed rank test P=0.027, Figure 6f). Four of the 5 KRAS mutations are missense mutations; the other mutation is an intronic mutation occurring 100bp from the splicing site. This mutation is not present in the parental cell lines but emerges independently after Cas9 expression in four different cell lines. While the results suggest that CRISPR-Cas9 may select for KRAS mutations, the functional role of these mutations needs to be interpreted with caution. Among the 61 cancer driver genes from Vogelstein et al.27, KRAS is a top gene (ranked the second) along with p53 (ranked fourth) that show significant mutant sub- clonal expansion (Figure 6g). Notably, the top genes identified to be involved in mutant sub- 57 clonal expansions have a significant overlap with our previously identified top CCD genes (Fisher?s exact test P=0.04). 58 Figure 6. Beyond p53: experimental evidence identifying KRAS as another major cancer driver whose mutation can be potentially selected for by CRISPR-Cas9. (a) CRISPR-Cas9 and CRISPRi screens of the top KRAS CDE gene knockouts were performed in isogenic MOLM13 and MOLM13-KRAS-G12D cell lines. The box plot shows the trend that the sgRNAs of the KRAS CDE+ genes are more depleted in KRAS WT cells vs KRAS mutant cells and vice versa for KRAS CDE- genes in CRISPR-Cas9 screens, but there is no such trend in the CRISPRi screens. The P values shown are of one-tailed Wilcoxon signed-rank tests. (b) Analysis of published genome-wide CRISPR-Cas9 and shRNA screens in KRAS-isogenic DLD1 cell line. The box plot shows the trend that the CRISPR-KO of KRAS CDE+ genes reduces cell viability more in KRAS WT cells than KRAS mutant cells, while there is no such trend in the shRNA screen. The P values of one-tailed Wilcoxon signed-rank tests are shown. (c) The difference in the percentage of KRAS mutant cells between Day 15 and Day 0 in co-culture (Y-axis), under the CRISPR-KO of different KRAS CDE+ genes (X-axis). Error bars represent standard error. NTC: non-targeting control sgRNA. (d) A box plot showing the comparison of stable Cas9 activity measured via GFP reporter assay10 in 1375 WT vs 226 KRAS mutant cancer cell lines, Wilcoxon rank-sum test P value 5.5E-05. (e) A scatter plot showing the difference in Cas9 activity between cell lines with a driver WT vs mutant for each driver (effect size, X-axis) and a corresponding Wilcoxon two-side significance for this difference (negative log10-P value, Y-axis). (f) The change in mutant allele frequency (X-axis) of the KRAS mutations detected in different cell lines (cell line-mutation pair on the Y-axis) after induced Cas9 expression, compared to the corresponding parental cell lines, based on data from Enache et al.13. The starts and ends of arrows correspond to the mutant allele frequencies in the parental and the Cas9-expressed cell lines, respectively. Cases of increased allele frequency are colored in red, and those with 59 decreased frequency are colored in blue. (g) A scatter plot showing the median change in mutant allele frequency after induced Cas9 expression across all cell lines in Enache et al.13 (X-axis) and the corresponding Wilcoxon signed rank test significance (negative log10-P value, Y-axis) for the 61 major cancer driver genes from Vogelstein et al.27. The p-values are calculated using two- sided Wilcoxon Rank Sum tests unless not specified otherwise. In the boxplots of panels a, b and d, the center line, box edges and whiskers denotes the median, interquartile range and the rest of the distribution in respective order, except for points that were determined to be outliers using a method that is a function of the interquartile range, as in standard box plots. KRAS mutant cells downregulate G2M checkpoint pathway in response to Cas9 induction To investigate the mechanism underlying the potential selective advantage of KRAS mutant vs WT cells during CRISPR-KO, we analyzed gene expression data of 163 pairs of parental (without Cas9) and the corresponding Cas9-expressed cell lines (138 KRAS WT, 25 KRAS mutant)13, and identified the pathways that are differentially regulated upon Cas9 expression between KRAS WT and mutant cells (i.e. up/down-regulated in KRAS mutant but inversely or non-significantly regulated in KRAS WT cells; Figure S9a, Supp. Note 5 & Figure S10 for p53). A major differentially regulated pathway was G2M checkpoint with the highest difference in the normalized enrichment score (Figure S9b), which is strongly downregulated (rank 2/50) in KRAS mutants but strongly upregulated in KRAS WT cells (4/50). This is a canonical pathway that serves to prevent the cells with genomic DNA damage from entering mitosis (M-phase) and thus its downregulation in KRAS mutant cells may provide them with a proliferative advantage [51]. Another top pathway, E2F Targets, which primarily regulates G1/S transition and DNA replication was also found to be downregulated in KRAS mutant cells but upregulated in KRAS WT cells upon Cas9 expression (Figure S9b). Thus, Cas9-induction may 60 similarly underlie the selective advantage of KRAS mutant cells by selectively activating cell cycle checkpoint pathways in response to DNA damage. Discussion In this study, we systematically investigated the possibility of selection of pre-existing cancer driver mutations during CRISPR-Cas9 gene editing. First, we confirmed and extended upon previous findings that selection11,12,13 of pre-existing p53 mutations by CRISPR-Cas9 can happen, showing it in a large set of transformed and non-transformed cell lines. We identified the specific CDE+ genes whose CRISPR-KO is likely to mediate such selection, and further tested and validated some of these predictions in new screens and competitive assays that we have performed. After studying and validating our integrated computational and experimental pipeline in the known case of p53, we turned to apply it to study a collection of major cancer driver genes, and discovered that KRAS is another major cancer driver gene whose pre-existing mutants have a selective advantage during CRISPR-Cas9 gene editing. We demonstrated the selective advantage of KRAS mutant cells performing a CRISPR-KO/CRISPRi screen in isogenic cells with pooled CDE+ gene-targeting sgRNAs, and further in competition assays during the CRISPR-KO of top predicted KRAS CDE+ genes. We also observed that KRAS WT cells have lower Cas9 activity and thus a lower editing efficiency, similar to that observed for p53, which may limit CRISPR-mediated gene-editing in such cells26. Analyzing recently published KRAS screens, we also find a subclonal expansion of KRAS mutant cancer cells following Cas9 expression. Finally, our study also shows that the introduction of the Cas9 protein downregulates the G2M checkpoint and E2F targets in KRAS mutant, but not KRAS WT cells, which may confer selective advantage to KRAS-mutant cells. 61 Multiple factors can contribute to the identity of CDE+ genes, including involvement in DNA repair and cell cycle pathways, being located in chromosomal fragile sites or highly accessible chromatin regions, supporting that their CRISPR-KO can lead to augmented DNA damage. We find that these factors can together account for up to 15% of our CDE+ genes. We also observed that a gene targeted by highly off-target guides can also lead to high DNA damage, which reassuringly only accounts up to 10% of the CDE+ genes (details in Supp. Note 6). Taken together, these three putative mechanisms can explain about 25% of the CDE+ genes we have identified, however the mechanisms underlying the rest are yet open to further studies. Overall, our results point to a need for accounting for CDE effects in the analysis of dependencies in CRISPR screens. More importantly, our studies point to the need for careful selection of sgRNAs for therapeutic genome editing, and recommend cautionary monitoring of KRAS status in addition to that of p53 during therapies utilizing CRISPR-Cas9. Lastly, in the publicly available CRISPR-Cas9 screens that we have analyzed, the current small numbers of cell lines with mutations in other cancer drivers, such as VHL, limits our ability to reliably determine whether these cancer genes could also be selected during CRISPR-Cas9 genome editing. The investigation of the latter thus awaits specifically designed screens in designated isogenic cell-lines. Methods CRISPR and shRNA essentiality screen data We obtained CRISPR-Cas9 essentiality screen (or dependency profile) data in 436 cell lines from Meyers et al.10 for 16,368 genes, whose expression, CNV and mutation data are available via CCLE portal33. We obtained the shRNA essentiality screen data in 501 cell-lines 62 from DepMap portal34 for 16,165 genes, whose expression, CNV and mutation data is available publicly via CCLE portal33. The 248 cell-lines and 14,718 genes that appear in both datasets were used in this analysis (Table S1). For mutation data, only non-synonymous mutations were considered. Synonymous (silent) mutations were removed from the pre-processed MAF files downloaded from CCLE portal33. Identifying CRISPR specific differentially essential genes of a potential CRISPR-selected cancer driver For a given CCD (e.g. p53 or KRAS), we checked which gene?s essentiality (viability after knockout) is significantly associated with the mutational status of the CCD using a Wilcoxon rank sum test in the CRISPR and shRNA datasets, respectively (FDR<0.1). CRISPR- specific differentially essential positive (CDE+) genes are those whose CRISPR-KO is significantly more viable when the CCD is mutated while their shRNA silencing is not, whereas analogously CDE- genes are those whose CRISPR-KO is significantly more viable when the CCD is WT while their shRNA silencing is not. We filtered out any candidate CDE genes whose copy number was also significantly associated with the given mutation to control for potentially spurious associations coming from copy number (we removed genes showing significant association (FDR<0.1)) ? the exact procedure used is described below in the section titled ?Identifying potential CRISPR-selected cancer drivers?). Identifying CDEs considering functional impact of mutations Out of a total of 248 cell lines that we analyzed, 173 cell lines (69.7%) have p53non- synonymous mutations. In addition to identifying CDEs by considering all non-synonymous mutations, we additionally employed a more conservative approach where we aimed to consider 63 only p53 loss-of-function (LOF) mutations in the CDE identification process. To this end, we considered a mutation to be LOF if it was classified as non-sense, indel, frameshift, or among the 4 most frequent non-functional hotspot mutations (R248Q, R273H, R248W and R175H within the DNA-binding domain, determined as pathogenic by COSMIC35). Using this definition we obtained new mutation profiles for p53 and identified CDE genes via the same method described in the section titled ?Identifying CRISPR specific differentially essential genes of a potential CRISPR-selected cancer driver?. We repeated a similar process with the top three known gain- of-function hotspot mutation variants of KRAS. Identifying potential CRISPR-selected cancer drivers of CRISPR-KO To identify additional CCD genes like p53, we considered 121 cancer driver genes identified by Vogelstein et al.27, whose nonsynonymous mutation is observed in at least 10 cell lines (N=61). We determined whether each of these genes is a CCD as follows: for each of the 61 candidate genes, we tested the association between the essentiality of each of genes in the genome (reflected by post-KO cell viability) with the mutational status of the candidate CCD gene using a Wilcoxon rank sum test. We then counted the number of genes, whose essentiality is: (i) significantly positively associated with the candidate CCD mutational status (FDR- corrected p-value<0.1, median essentiality of WT>mutant of the cancer gene), (ii) significantly negatively associated with the candidate CCD mutational status (FDR-corrected p-value<0.1, median essentiality of WT0.1) with the candidate CCD mutation status; we performed this computation separately for the CRISPR and the shRNA screens, respectively. This computation results in a 3-by-2 contingency table for each candidate CCD gene. We then checked whether the distribution of the above three counts in the CRISPR dataset significantly deviates from that in the shRNA dataset 64 via a Fisher?s exact test on the contingency table. If each of the values in the contingency table was greater than 30, we used the chi-squared approximation of the Fisher?s exact test. We further filtered out any candidate CDE genes whose copy number was also significantly associated with the given mutation to control for potentially spurious associations coming from copy number (we removed genes showing significant association (FDR<0.1)). We performed this procedure for all 61 candidate genes one by one and selected those with FDR corrected Fisher?s exact test <0.1. We further filtered out the candidate CCD whose mutation profile is correlated with p53 mutation profile via a pairwise Fisher test of independence (FDR<0.1). We finally report the CCD genes that have a substantial number of CDE+ genes (N>300). Pathway enrichment analysis of CDE+/CDE- genes We analyzed the CDE+/CDE- genes of each of the CCDs for their pathway enrichment with annotations from the Reactome database36 in two different ways. First, we tested for significant overlap between our CDE genes with each of the pathways with hypergeometric tests (FDR<0.1). Second, we ranked all the genes in the CRISPR-KO screen by the differences in their median post-KO cell viability values in mutant vs WT cells, and the standard GSEA method21 was employed to test whether the genes of each Reactome pathway have significantly higher or lower ranks vs the rest of the genes (FDR<0.1). We repeated the GSEA analysis with the genes ranked by differential post-KD cell viability in the shRNA screen, and only reported significant pathways specific to CRISPR but not shRNA screens. We confirmed that for p53, the GSEA method was able to recover the top significant pathways identified by the hypergeometric test (e.g. those in Figure 1d), although extra significant pathways were identified (Table S2). For p53 and KRAS CDE- genes respectively, the enriched pathways were clustered based on the 65 Jaccard index and the number of overlapping genes with Enrichment Map37, and the largest clusters were visualized as network diagrams with Cytoscape38. To study the potential enrichment of CDE genes in common fragile sites (CFSs), we obtained chromosomal band locations of CFS16, and defined the CFS gene set as the set of all genes located within these chromosomal bands (obtained from Biomart39). We tested for a significant overlap between our CDE genes and the CFS gene set with a hypergeometric test, and also confirmed the lack of significant overlap with the corresponding shRNA-DE genesets. Similarly, for the common highly accessible chromatin (HAC) regions, we obtained a list of these regions defined by a consensus of DNAsel and FAIRE across seven different cancer cell lines from a previous study40. Next, we identified sgRNAs which are expected to target such HAC regions (see Calculating off-target scores section) and ranked genes based on the number of targeting such sgRNAs. Taking the top genes equal to the number of p53 CDE+ genes, we computed the enrichment for p53 CDE+ genes via a hypergeometric test. Testing the clinical relevance of copy number alterations of the p53 or KRAS CDE genes We tested the hypothesis that copy number alterations in CDE+ genes (as a possible surrogate for the number of DSBs in these genes) can reduce the fitness of the CCD (p53 or KRAS) WT tumors with patient data. The cancer genome atlas (TCGA)32 data of somatic copy number alteration (SCNA) and patient survival of 7.547 samples in 26 tumor types were downloaded from the UCSC Xena browser (https://xenabrowser.net/). In these tumor types p53 is mutated in more than 5% of the samples. For each sample, the copy number alterations (genomic instability, GI) of a given set of genes, which quantifies the relative amplification or deletion of genes in a tumor based on SCNA was computed as follows41: 66 where si is the absolute log ratio of SCNA of gene i in a sample relative to normal control, and I() is the indicator function. Wilcoxon rank-sum test was then used to test whether the GI of CDE+ geneset is significantly lower than that of control non-CDE genes in CCD-WT but not in CCD-mutant tumors. Further, we tested if higher absolute levels of SCNA of the CDE+ genes are associated with increased rate of CCD (p53 or KRAS) mutation accumulation with cancer stage, as this would further testify that such amplification/deletion events in the CDE+ genes can drive the selection for CCD mutants. To this end, the following logistic regression model was used to identify the genes whose high absolute SCNA computed as above is associated with higher rate of CCD mutation accumulation with cancer stage, while controlling for cancer type and overall mutation load: logit(P(CCD)) =?0 + ?k ? kcaner_type cancer_typek + ?mutation_load mutation_load + ?GI GIi + ?stage stage + ?interact GIi * stage where CCD denotes the binary CCD mutational status of the patient, logit(P(CCD)) is the logit function of the probability of the CCD being mutant; cancer_typek is the dummy variable for the category of the kth cancer type; GIi denotes the absolute value of SCNA levels of the given gene i as computed above; GIi * stage is the interaction term between the GI of gene i and cancer stage, that latter is made into a binary variable whose value is 0 for early stages (I and II) and 1 for late stages (III and IV). We tested the enrichment of CDE+ genes among the genes whose high absolute SCNA levels are significantly associated with higher rate of CCD mutation accumulation with cancer stage (i.e. genes with significantly positive ?interact coefficients in the above model) using a hypergeometric test. 67 Constructs and stable cell lines MOLM13 cells were obtained from DSMZ (Cat. ACC-554) and maintained in RPMI- 1640 medium (Life Technologies, Carlsbad, CA) supplemented with 10% v/v heat-inactivated fetal bovine serum (Sigma-Aldrich, Saint Louis, MI), 2 mM L-Glutamine (LifeTechnologies) and 100 U/mL penicillin/streptomycin (LifeTechnologies). p53 R248Q was PCR amplified from a bacterial expression plasmid (kind gift of Dr. Shannon Lauberth, UCSD) and KRASG12D the pBabe-KRASG12D plasmid (Addgene plasmid 58902, from Dr. Channing Der) using the Kappa Hi-fidelity DNA polymerase (Kappa Biosystems). These PCR amplicons were separately cloned into the MSCV-IRES-tdTomato (pMIT) vector (a kind gift from Dr. Hasan Jumaa, Ulm) using Gibson Assembly. We first generated high-efficiency Cas9-editing MOLM13 leukemia cells by transducing these cells with the pLenti-Cas9-blasticidin construct (Adggene plasmid 52962 - from Dr. Feng Zhang) and selecting stable clones using flow-sorting. Clones were then tested for editing efficiency by performing TIDE analysis42. These MOLM13-Cas9 cells were then transduced retrovirally with the pMIT-p53R248Q or pMIT-KRASG12D mutants and sorted for tdTomato using flow-cytometry (LSR Fortessa, BD Biosciences) to generate isogenic mutant MOLM13-Cas9 cell lines. Immortalized hTERT RPE1 cells were obtained from ATCC? (Cat. CRL-4000??) and maintained in DMEM-F12 medium (Life Technologies, Carlsbad, CA) supplemented with 10% v/v heat-inactivated fetal bovine serum (Sigma-Aldrich, Saint Louis, MI), 2 mM L-Glutamine (LifeTechnologies) and 100 U/mL penicillin/streptomycin (LifeTechnologies). 68 Generation of pooled sgRNA libraries For pooled library cloning, 10 sgRNAs per gene were designed using the gene perturbation platform (https://portals.broadinstitute.org/gpp/public/analysis-tools/sgrna-design) Genetic Perturbation Platform. Guides targeting p53 CDE+ and CDE- genes were synthesized as pools using array-based synthesis and cloned in the Lentiguide puro vector (Addgene plasmid 52963 - kind gift from Dr. Feng Zhang) using Golden Gate Assembly. In each assay, we have used ~240 unique non-targeting sgRNAs and 49 not expressing non-essential genes. A similar approach was used for the KRAS CDE libraries. Pooled sgRNA library screen 30 million MOLM13-Cas9 cells or their isogenic MOLM13-p53 or KRAS mutant counterparts were transduced with the pooled CDE library virus in RPMI medium supplemented with 10% fetal bovine serum, antibiotics and 8 ?g/ml polybrene. The medium was changed 24 hours after transduction to remove the polybrene and cells were plated in fresh culture medium. 48 hours after transduction, puromycin was added at a concentration of 1 ?g/ml to select for cells transduced with the sgRNA library. Puromycin was removed after 72 hours and then cells were cultured for up to 30 days. 7 days after transduction, approximately 4 million cells were collected, and genomic DNA was prepared for the time zero (T0) measurement and also from time 30 (T30). Genomic DNA from these cells was used for PCR amplification of sgRNAs and sequenced using a MiSeq system (Illumina). Fold depletion or enrichment of sgRNAs from the NGS data was calculated using PinAplPy software43. 69 CDE+/- genes identified in isogenic experiments From the read counts per million for each sgRNA at Day 0 and Day 30 from the above pooled CRISPR screens across two replicates, we removed all the sgRNAs with read count < 20 at Day 0. We calculated an average fold change (FC) of reads from Day 0 to Day 30. For each sgRNA, we calculated this FC-rank difference in p53 WT vs mutant in both CRISPR-KO and CRISPRi screens. For consistent comparison with AVANA, we only considered sgRNAs used in both libraries. The top and bottom genes are differentially essential (DE) from each screen. Taking the top ranked genes based on the difference of this score in two screens, we identify the CDE+ and CDE- genes. CRISPR Competition experiments sgRNAs were cloned using standard cloning protocols and lentiviral supernatants were made from these sgRNAs in the 96-well arrayed format. 100,000 MOLM13 cells or tdTomato- positive isogenic mutants were plated in a 96 well pate and transduced with the sgRNA viral supernatants by spinfection with polybrene-supplemented medium. After selection of sgRNA transduced cells with puromycin for 48 hours, sgRNA transduced MOLM13 cells or mutants were mixed together in a ratio of 95:5 respectively, and the percentage of p53 WT or p53 mutant cells was monitored progressively up to 25 days using high-throughput flow-cytometry as described previously23. Quality control of publicly mined genetic screens used in the study We first obtained gold-standard essential and non-essential geneset from Hart et al.44 . To test the quality of each genetic screen we computed an area under the precision-recall curve 70 (AUPRC) using the average logFC across replicates and cell lines. In this study, we only considered the genetic screens with an AUROC > 0.6 (random model AUPRC=0.5). We also employed this method to test the quality of our in-house generated genetic screens. CRISPR-Cas9 Ribonucleoprotein transfection experiments We generated sgRNAs by in vitro transcription using the HiScribe?? T7 Quick High Yield RNA Synthesis Kit (New England Biolabs, Beverley, MA) and performed the Ribonucleoprotein (RNP) complex formation using TrueCut Cas9 Protein v2 (ThermoFisher Scientific, Waltham, MA) according to published protocols45. MOLM13 cells without Cas9 and expressing pMIT- p53R248Q or pMIT-KRAS were generated as described in the Constructs and stable cell lines section. 1M cells were transfected with NFDUFB6 sgRNA or NTC sgRNA in triplicates with 1mg of Cas9 and 1mg of RNA in 10 ml of Buffer R using the Neon?? transfection system (ThermoFisher Scientific; 1500V, 20 ms, single pulse). Cells were maintained in culture for 48 hours before harvest for imaging, dye-dilution and editing estimation assays. For NDUFB6 and NTC editing estimation, we used the Synthego Performance Analysis ICE tool according to the instructions, using un-transfected parental MOLM13 samples as controls and samples from 48 hours post-transfection as the Day 0 initial timepoint and Day 10 as a final time point, in triplicates. For RPE1 experiments, mutant cells with p53R248Q or pMIT-KRAS similarly and transfections were performed using Lipofectamine?? CRISPRMAX?? Cas9 Transfection Reagent (ThermoFisher Scientific) according to the manufacturer?s instructions for 12-well plate format, in triplicates. Cell harvesting time-points were similar to those of MOLM-13. 71 Dye-dilution experiments We used the CellTrace?? Violet Cell Proliferation Kit (ThermoFisher Scientific) to stain MOLM13-WT and MOLM13-p53 mutant cells transfected with Cas9 RNP complexed with NTC or NDUFB6 RNA, according to the manufacturer?s instructions. Cells were maintained in culture in the dark and assayed by flow cytometry using the LSR Fortessa every 2 days for 14 days. FCS files were analyzed using FlowJo software. Analysis of ?-H2AX foci in MOLM13-Cas9 and MOLM13-p53 mutant cells MOLM13-WT and MOlM13-p53 mutant cells were left untreated or treated with 1?M doxorubicin for 2 h at 37?C 5% CO2, which served as negative and positive controls for DNA damage mediated ?-H2AX foci formation, respectively. MOLM13-WT and MOLM13-p53 mutant cells transfected with Cas9 RNP complexed with NTC or NDUFB6, and negative control cells were pelleted at 400g for 5 min at 4?C, washed two times in PBS and fixed in 4% paraformaldehyde in PBS for overnight at 4?C. The cells were washed two times in PBS and permeabilized in 0.25% triton X-100 in PBS for 5 min at room temperature. Following two washes with PBS, the cells were incubated in blocking buffer (3% BSA in PBS) for 30 min at room temperature and subsequently incubated with APC conjugated H2AX phospho (Serine 139) antibody (BioLegend; Cat # 613415) at an antibody dilution of 1:200 in blocking buffer for overnight at 4?C in dark. Cells were washed two times with PBS and resuspended in 150 ?l PBS. Cell suspensions were spotted on poly-lysine coated glass slides using cytospin (Cytospin 4; Thermo Scientific) centrifugation at 800 rpm for 4 min. Coverslips were mounted onto the slides using ProLong Gold antifade reagent with DAPI (Invitrogen) and cured for overnight at room temperature in dark. Slides were imaged in Nikon A1R HD confocal microscope. Sequential z- 72 sections were imaged using a 60x oil objective and maximum projection images were obtained using the Nikon NIS-Elements platform. Cas9 activity in cancer cell lines with KRAS (or other cancer driver) WT vs mutant We downloaded the exogenous Cas9 activity of 1601 cancer cell lines from DepMap portal and their KRAS mutation status considering only non-synonymous variants profiled using whole exome sequencing (1375 and 226 KRAS WT and mutant, respectively)15. We tested whether the Cas9 activity is higher in KRAS mutant vs KRAS WT cell lines using one-sided wilcoxon rank-sum test. We repeated this process for each cancer driver gene and used the FDR corrected significance to rank them in addition to the fold change of Cas9 expression. Subclonal expansion of KRAS mutant in parent vs high Cas9-expressed cell lines We downloaded the deep targeted sequencing of cancer driver genes performed on 42 parental and matched Cas9-expressed cancer cell lines from Enache et al.13. In this analysis, we discarded the cell lines with <20% Cas9 activity and thus low DNA damage. We asked whether mutant allele frequency of a cancer driver (e.g. KRAS) significantly increased in Cas9-expressed cell lines compared to matched parental cell lines using Wilcoxon signed-rank test. In this analysis, we have considered both intronic and exonic variants provided from sequencing. Analysis of differentially expressed pathways in KRAS wildtype and mutant cells in response to Cas9 induction Gene expression profiles of 163 pairs of parental (without Cas9) and the corresponding Cas9-expressed cell lines (138 KRAS WT, 25 KRAS mutant) were obtained from Enache et al.13. Differential expression analysis between the Cas9-expressed cells and the parental cells was 73 performed for the KRAS WT and mutant cells separately, and GSEA analysis21 (genes ranked by logFC) was performed to identify the hallmark pathways from MSigDB46. We next identified pathways that are differentially regulated upon Cas9 expression between KRAS WT and mutant cells. These include the pathways that are up-regulated in the KRAS mutant cells but down- regulated or non-significantly altered in the KRAS WT cells, and vice versa. The pathways are ranked by the difference of normalized enrichment score in WT vs mutant cells. This analysis is performed using the fgsea R package21. Data and Code availability Statement We have provided the scripts and data from both previously published and in-house screens, in their raw and processed form to reproduce each step of results and plots in a GitHub repository which can be accessed here: https://github.com/ruppinlab/crispr_risk Acknowledgements We acknowledge and thank the National Cancer Institute for providing financial and infrastructural support. We thank Curtis Harris, Andre Nussenzweig, Sridhar Hannenhalli and the members of Cancer Data Science Lab for insightful feedback. This research was supported in part by the Intramural Research Program of the National Institutes of Health, NCI. S.S and K.C. are supported by the NCI-UMD Partnership for Integrative Cancer Research Program. AJD would like to acknowledge the support of the National Cancer Institute of the National Institutes of Health under Award Number P30 CA030199, the Rally Foundation for Childhood Cancer Research and Luke Tatsu Johnson Foundation under Award Number 19YIN45, an Emerging 74 Scientist Award from the Children?s Cancer Research Fund, and the V Foundation for Cancer Research (TVF) under Award Number DVP2019-015. References 1. Hsu, P. D., Lander, E. S. & Zhang, F. Development and Applications of CRISPR-Cas9 for Genome Engineering. Cell 157, 1262?1278 (2014). 2. Kim, D. et al. Digenome-seq: genome-wide profiling of CRISPR-Cas9 off-target effects in human cells. Nat. Methods 12, 237?43, 1 p following 243 (2015). 3. Tsai, S. Q. et al. GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nat. Biotechnol. 33, 187?197 (2015). 4. Fu, Y. et al. High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nat. Biotechnol. 31, 822?826 (2013). 5. Cullot, G. et al. CRISPR-Cas9 genome editing induces megabase-scale chromosomal truncations. Nat. Commun. 10, 1136 (2019). 6. Charlesworth, C. T. et al. Identification of Pre-Existing Adaptive Immunity to Cas9 Proteins in Humans. doi:10.1101/243345 7. Wang, T. et al. Identification and characterization of essential genes in the human genome. Science 350, 1096?1101 (2015). 8. Aguirre, A. J. et al. Genomic Copy Number Dictates a Gene-Independent Cell Response to CRISPR/Cas9 Targeting. Cancer Discov. 6, 914?929 (2016). 75 9. Munoz, D. M. et al. CRISPR Screens Provide a Comprehensive Assessment of Cancer Vulnerabilities but Generate False-Positive Hits for Highly Amplified Genomic Regions. Cancer Discov. 6, 900?913 (2016). 10. Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR?Cas9 essentiality screens in cancer cells. Nature Genetics 49, 1779?1784 (2017). 11. Ihry, R. J. et al. p53 inhibits CRISPR-Cas9 engineering in human pluripotent stem cells. Nat. Med. 24, 939?946 (2018). 12. Haapaniemi, E., Botla, S., Persson, J., Schmierer, B. & Taipale, J. CRISPR?Cas9 genome editing induces a p53-mediated DNA damage response. Nature Medicine 24, 927?930 (2018). 13. Enache, Oana M., et al. "Cas9 activates the p53 pathway and selects for p53-inactivating mutations." Nature Genetics (2020): 1-7. 14. Schiroli, Giulia, et al. "Precise gene editing preserves hematopoietic stem cell function following transient p53-mediated DNA damage response." Cell Stem Cell 24.4 (2019): 551-565. 15.Tsherniak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564?576.e16 (2017). 16. Lukusa, T. & Fryns, J. P. Human chromosome fragility. Biochim. Biophys. Acta 1779, 3?16 (2008). 17. Kosicki, M., Tomberg, K. & Bradley, A. Repair of double-strand breaks induced by CRISPR?Cas9 leads to large deletions and complex rearrangements. Nature Biotechnology 36, 765?771 (2018). 18. van den Berg, J. et al. DNA end-resection in highly accessible chromatin produces a toxic 76 break. doi:10.1101/691857 19. Richardson, C. D. et al. CRISPR?Cas9 genome editing in human cells occurs via the Fanconi anemia pathway. Nature Genetics 50, 1132?1139 (2018). 20. Behan, Fiona M., et al. "Prioritization of cancer therapeutic targets using CRISPR?Cas9 screens." Nature 568.7753 (2019): 511. 21. Sergushichev, Alexey A. "An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation." BioRxiv (2016): 060012. 22. Brown, Kevin R., et al. "CRISPR screens are feasible in TP 53 wild?type cells." Molecular systems biology 15.8 (2019): e8679. 23. Deshpande, A. et al. Investigation of Genetic Dependencies Using CRISPR-Cas9-based Competition Assays. J. Vis. Exp. (2019). doi:10.3791/58710 24. Boettcher, Steffen, et al. "A dominant-negative effect drives selection of TP53 missense mutations in myeloid malignancies." Science 365.6453 (2019): 599-604. 25. Giacomelli, Andrew O., et al. "Mutational processes shape the landscape of TP53 mutations in human cancer." Nature genetics 50.10 (2018): 1381 26. Tario, Joseph D., et al. "Monitoring cell proliferation by dye dilution: considerations for probe selection." Flow Cytometry Protocols. Humana Press, New York, NY, 2018. 249-299. 27. Vogelstein, B. et al. Cancer Genome Landscapes. Science 339, 1546?1558 (2013). 28. H?hnel, P. S. et al. Targeting components of the alternative NHEJ pathway sensitizes KRAS mutant leukemic cells to chemotherapy. Blood 123, 2355?2366 (2014). 77 29. Jinesh, G. G., Sambandam, V., Vijayaraghavan, S., Balaji, K. & Mukherjee, S. Molecular genetics and cellular events of K-Ras-driven tumorigenesis. Oncogene 37, 839?846 (2018). 30. Luo, Ji, et al. "A genome-wide RNAi screen identifies multiple synthetic lethal interactions with the Ras oncogene." Cell 137.5 (2009): 835-848. 31. Martin, T. D. et al. A Role for Mitochondrial Translation in Promotion of Viability in K-Ras Mutant Cells. Cell Rep. 20, 427?438 (2017). 32. Weinstein, J. N. et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nature Genetics 45, 1113?1120 (2013). 33. Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603?607 (2012). 34. McFarland, J. M. et al. Improved estimation of cancer dependencies from large-scale RNAi screens using model-based normalization and data integration. Nat. Commun. 9, 4610 (2018). 35. Forbes, S. A. et al. COSMIC: exploring the world?s knowledge of somatic mutations in human cancer. Nucleic Acids Research 43, D805?D811 (2015). 36. Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 37, D619?22 (2009). 37. Merico, D., Isserlin, R., Stueker, O., Emili, A. & Bader, G. D. Enrichment map: a network- based method for gene-set enrichment visualization and interpretation. PLoS One 5, e13984 (2010). 38. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular 78 interaction networks. Genome Res. 13, 2498?2504 (2003). 39. Smedley, D. et al. The BioMart community portal: an innovative alternative to large, centralized data repositories. Nucleic Acids Res. 43, W589?98 (2015). 40. Song, L. et al. Open chromatin defined by DNaseI and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21, 1757?1767 (2011). 41. Bilal, E. et al. Improving breast cancer survival analysis through competition-based multidimensional modeling. PLoS Comput. Biol. 9, e1003047 (2013). 42. Brinkman, E. K., Chen, T., Amendola, M. & van Steensel, B. Easy quantitative assessment of genome editing by sequence trace decomposition. Nucleic Acids Res. 42, e168 (2014). 43. Spahn, P. N. et al. PinAPL-Py: A comprehensive web-application for the analysis of CRISPR/Cas9 screens. Sci. Rep. 7, 15854 (2017). 44. Hart, Traver, et al. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Molecular systems biology 10.7 (2014). 45. Brunetti, L., Gundry, M.C., Kitano, A., Nakada, D., Goodell, M.A.. Highly Efficient Gene Disruption of Murine and Human Hematopoietic Progenitor Cells by CRISPR/Cas9. J Vis Exp. 134, 57278 (2018). 46. Liberzon, A., Birger, C., Thorvaldsd?ttir, H., Ghandi, M., Mesirov, J.P., Tamayo, P. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 6, 417-425 (2015). 79 80 Chapter 3: Designing optimal combination treatment targeting the clonal architecture of the tumor using scRNA-seq Abstract A combination treatment designed to target the multiple clones of a patient?s tumor could decrease the likelihood of resistant emergence. The ability to learn and predict the response of a drug at a single-cell resolution could help to design such optimal combinations. A lack of large- scale patients? datasets with single-cell (sc) expression hinders our ability to build such models. To overcome this limitation, we built a precision oncology framework for personalized single- cell expression-based planning for treatments in oncology (PERCEPTION) that utilizes large- scale drug screens in cancer cell lines and matched bulk and single-cell transcriptome profiles to build response models that can be translated to patients. We first showed that our predicted viability profile of multiple drugs with known mechanisms of action strongly correlates with the targeted pathway activity at a single-cell resolution, demonstrating our ability to predict at this resolution. We next predict response to monotherapy and combination treatment in three independent screens performed across cancer cell lines and patient-tumor-derived cell lines using their sc-expression profiles. Translating to the clinical context, we successfully stratify responders of combination therapy in a multiple myeloma clinical trial using the tumor?s sc- expression profile. For individuals in this trial, we next find the optimal combination treatments (pairs and triplets) from the existing FDA-approved drugs set, where the pair of gefitinib and niraparib, an EGFR and a PARP inhibitor, is one of the top candidates. In sum, we provide a first-of-its-kind framework to utilize the tumor?s sc-expression to identify responders to combination therapy or to design an optimal combination for an individual patient. 81 Introduction Tumors are typically heterogeneous and composed of numerous different clones, making treatments targeting multiple clones more likely to diminish the likelihood of resistance emerging due to clonal selection, enhancing the overall patient?s response (Castro et al. 2021). Bearing this goal in mind, large-scale combinatorial pharmacological screens in patient-derived cell lines, xenografts, and organoids have given rise to numerous combination treatment candidates (Wensink, et al. 2021, Yao et al. 2020, de Witte et al. 2020). However, these studies are limited by the large number of combinations that are needed to be tested in various genomic contexts and thus, there is a need for in silico methods to narrow down the search space. The characterization of the tumor microenvironment via single-cell omics has already led to countless important insights regarding the complex network of tumor-immune interactions involving many different cell types (Castro et al. 2021). It also offers a promising way to learn and predict drug response at a single-cell resolution. The latter, if successful, could guide the design of drug combinations that target multiple tumor clones disjointly (Shalek et al 2017, Adam et al 2020 & Zhu et al 2017). However, building such predictors of drug response at a single cell (SC) resolution is very challenging due to the paucity of large-scale preclinical or clinical training datasets. Previous efforts have been primarily restricted to pre-clinical models in just a few cell lines (Kim et al 2016, Suphavilai et al 2020). Yet, efforts to identify biomarkers of response and resistance at the patient level using single-cell expression are rapidly emerging for both targeted- and immuno- therapies (Cohen et al 2021, Ledergor et al 2018, Sade-Feldman et al 2018). 82 Aiming to address this challenge systematically, here we present a precision oncology framework for PERsonalized single-Cell Expression-based Planning for Treatments In ONcology (PERCEPTION) that builds upon the recent availability of large-scale pharmacological screens and SC expression data in cancer cell lines to construct machine learning based predictors of SC drug response. First, using these predictors, we show that the predicted viability for drugs with known mechanisms of action correlates with the pathway activity it is targeting at a single-cell resolution. Second, we show that the SC-based models can successfully predict the response to single and combination treatments in three independent screens performed in cancer and patient-tumor-derived cell lines based on their SC-expression profiles. Thirdly, we show that SC-based models successfully stratify responders to combination therapy in a multiple myeloma clinical trial based on their tumor?s SC-expression data. Finally, we identify combination treatments (pairs and triplets) of existing FDA-approved oncology drugs that kill tumor clones as effectively as possible. Results Overview of PERCEPTION To predict patients' response to a therapy using their tumor?s single-cell SC-expression profile, we built a machine learning pipeline called PERCEPTION (Figure 1A, detailed description is provided in Methods). PERCEPTION builds drug response models from large-scale pharmacological screens performed in cancer cell lines where bulk and SC-expression are available. As there is currently a paucity of large-scale matched response and single-cell data either in pre-clinical or patients, we designed a prediction pipeline that first is trained on large- scale bulk-expression profiles of cancer cell lines and then, in a second step, its performance is 83 further optimized by tuning the available SC-expression profiles of cancer cell lines. To this end, we mined bulk-expression (Ghandi et al 2020) and drug response profiles (PRISM) of 100s of different cancer cell lines (N=488, Table S1) from the DepMap database (Tsherniak. et al 2017). The SC-expression profiles of these cell lines (N=205, Table S1) are obtained from Kinker et al. 2020. Drug efficacy is measured via area under the curve (AUC) viability-dosage curve, where lower AUC values indicate increased sensitivity to treatment (Table S1). Briefly, for a given drug X, PERCEPTION performs the following two steps: 1. It first builds a regularized linear response prediction model from the bulk expression and drug response data available for ~300 cancer cell lines. 2. In the second step, the hyperparameters of PERCEPTION were tuned further to maximize its ability to predict the response from SC- expression data i.e. ~160 cancer cell lines with matched SC-expression and response to the drug. The output of our pipeline is a response model and a quantification of its predictive accuracy from SC-expression in never seen before leave-one-out test data. Illustration of PERCEPTION?s ability to predict viability at single-cell resolution via two case studies To visualize PERCEPTION?s ability to predict cell killing at single-cell resolution, we examined our predicted killing for two drugs, where the mechanism of action pathway of the drug is well characterized (Nutlin-3 and Erlotinib). We applied the PERCEPTION pipeline described above to build SC-based predictors for these two drugs and studied them further, as follows. The first case involves the canonical antagonist, Nutlin-3, whose mechanism of killing involves the inhibition of the interaction between MDM2 and tumor suppressor p53; thus, 84 MDM2 high activity is a known response biomarker to nutlin-3 treatment (Arya et al. 2010). Via PERCEPTION, we built a response model for Nutlin-3, whose correlation with the observed response at the bulk expression was 0.598, P=1.2E-16, (with MDM2 expression being one top- ranked predictive features). We predicted the killing post its treatment for 3566 single cells across nine p53 wild-type lung cancer cell lines. Across these single-cells, we observed that that the predicted killing after nutlin-3 treatment and MDM2 expression are strongly correlated across the individual cells screened (Pearson Rho= 0.50, P<2E-16, as visualized in Figure 1D), as expected. We also find that we can identify sub-clones with pre-existing nutlin-3 resistance (Figure 1D-arrow highlight). In the second case, we repeated this analysis to study and visualize PERCEPTION?s ability to predict the response to erlotinib, which targets the gain-of- function mutation in epidermal growth factor (EGFR) and is mainly used to treat lung cancer patients with activating EGFR mutations. We found that individual cells with low EGFR pathway activity signature are predicted to be resistant to this treatment, as expected, and that the predicted and observed killing levels are correlated across individual cells (Pearson Rho= 0.42, P<2E-16) as visualized in Figure 1E. Analogous findings for other EGFR inhibitors developed more recently than erlotinib are provided in Extended Figure 1B. Testing PERCEPTION predictions for FDA approved drugs We applied PERCEPTION to build SC predictors of response for 133 U.S FDA-approved oncology drugs available in the drug screen (PRISM) (Table S2). The predictive performances for these drugs are provided in Figure 1B. We defined models as predictive if the Spearman correlation between their predicted vs observed viability was greater than 0.3. This threshold was chosen as it corresponds to the mean cross-screen replicate correlation observed among three major pharmacological screens (average cross-platform correlation across GDSC, CTD & 85 PRISM ~ 0.30 (Corsello et al. 2020)). We were able to build models for 33% (44 out of 133 drugs, Table S2) of the total drugs tested whose prediction accuracy exceeds this threshold (Figure 1B). Studying the predictive accuracy of these 44 predictive models in a cross-validation manner for different transcriptomics inputs, including SC, bulk, and pseudo-bulk-expression (generated by summing up the gene-mapped reads across single cells, Methods), reassuringly we find that the predictive performance based on SC-expression is comparable to that obtained using bulk-expression or pseudo-bulk (Figure 1C). 86 Figure 1. PERCEPTION-based precision oncology framework. (A) The PERCEPTION precision oncology framework is composed of two steps: (i) Build response models using 87 large-scale drug screens performed on cancer cell lines and matched bulk-expression. (ii) Refine the model parameters i.e., number of features by utilizing sc-expression in cancer cell lines to enable our model for the usage of sc-expression. (B) The number of predictive models for FDA-approved drugs for cancer generated by PERCEPTION during cross-validation (y-axis) when sc-expression (blue), bulk-expression (red), and pseudo-bulk are used for a Pearson correlation threshold (x-axis, Predictability) (C) The distribution of predictive performance (x-axis) of drugs with a predictive model, defined by Pearson Rho>0.3. In the boxplots, the center line, box edges, and whiskers denote the median, interquartile range, and the rest of the distribution, respectively, except for points that were determined to be outliers using a method that is a function of the interquartile range, as in standard box plots. D) In the left-most panel, killing a canonical MDM2 antagonist, Nutlin-3 and expression of MDM2 are provided for every single cell (each point) in the top and bottom tSNE plot, respectively. The intensity of the color denotes the extent of killing in the left panel and MDM2 expression in the right pane, where the respective legends are provided. In this panel, we provide 3566 single- cells from nine p53 WT lung cancer cell lines. The tSNE clustering is performed using the expression profile of all the genes. PERCEPTION predictive performance in an independent large-scale GDSC drug screen We next tested the performance of our models on an independent screen, GDSC (Garnett et al. 2012). To this end, we first identified drugs shared between the PRISM and GDSC screens (N=191, Table S3, quality control and model building steps in Methods). We were able to build PRISM-based predictive models for 16 of these common drugs. The mean correlation between experimental viability reported in GDSC vs PRISM (screen 88 concordance) across 80 shared cell lines serving as an unseen testing set was 0.44. For the same testing set, the mean correlation between the predicted vs observed viability in 0.38 and 0.28 in PSISM and GDSC, respectively (Figure 2A). As expected, the prediction performance of a model in the GDSC test set is correlated with the concordance between the drug?s viability profiles in GDSC and PRISM datasets (Pearson Rho=0.49, P=5.89E-02; Figure 2B, Table S4). As the range of predicted values is smaller than observed (Extended Figure 2), we use a scaled predicted AUC (z-score) in further analyses reported below. Figure 2: PERCEPTION?s performance in the GDSC screen. (A) The correlation 89 measure via Pearson Rho (y-axis) comparing ?GDSC vs PRISM?, ?PRISM vs PERCEPTION?, and ?GDSC vs PERCEPTION?. Drug response predictions were performed at a single-cell resolution and the cell line level response (mean response across single cells) was used to compare. (B) Relationship between the concordance in ?GDSC vs PRISM? (green) and ?GDSC vs PERCEPTION? (orange) across 16 drugs. These are the drugs with pharmacological screens in both PRISM and GDSC platforms and that have a substantial positive correlation between their AUC values (cor > 0.3 and p-value < 0.5 in cell lines excluding the above 80 cell lines). The size of the dots represents the Pearson correlation-based p-value in -log10 scale. The drugs are ordered on the x-axis from left to right in the decreasing order of their correlation between GDSC and PRISM responses. SC-based PERCEPTION models prediction of monotherapy and combination response in a lung cancer cell lines screen To study PERCEPTION in another independent screen, we tested its predictive performance in a recent drug screen in lung cancer cell lines (Nair et al. 2021). We focused on cancer 40 drugs that are FDA-approved or in clinical trials for which we could build predictive PERCEPTION models. We assessed their predictive performance vs drug screen data measured for monotherapy and two-drug combinations of 14 of these drugs (whose viability profiles passed quality control) across 21 lung cancer cell lines in five dosages (Table S5, methods, Supp Note 1). Matched SC-expression was mined for these lung cancer cell lines from Kinker et al. 2020, including about 300 cells per cell line. 90 Given this data, we used the PERCEPTION models to predict the response to each drug in each cell line by computing the mean predicted viability across all the single cells of that from a cell line. The predicted viability is significantly higher in resistant vs sensitive cell lines (Figure 3A, top vs bottom 33% cell lines ranked by viability, Wilcoxon rank-sum P=2E-06, FC=1.53), and can stratify responders vs non-responders (ROC-AUC = 0.72, Figure 3B). The predictive performance of high-confidence screen results (see Methods) is considerably higher (AUC=0.88, Figure 3C-blue curve, FC=1.95, Figure 3B-right panel). The overall mean Pearson correlation between predicted vs observed viability for these drugs is 0.33; P<7.4E-09 (Extended Figure 3A). Detailed drug level comparisons between observed vs predicted viability are provided in Extended Figure 3B. We next tested PERCEPTION?s ability to predict the response to 42 combinations of these 14 drugs studied in this screen (Table S5). A combination response in a given cell line was predicted by adopting the independent drug action (IDA) model across all the single cells from that cell line (Ling et al. 2020) i.e. the combination response of two drugs is simply the effect of the single most effective drug in the combination (Supp Note 2). The predicted combination viability is significantly higher in resistant vs sensitive cell lines (Wilcoxon rank-sum P=8.3E- 03, FC=1.54, Figure 3D) and can stratify the responders vs nonresponders (AUC=0.69, # of resistant data points=28, # of sensitive data points=24, Figure 3E). Like in the monotherapies case, the observed effect size is considerably higher when considering only high-confidence screen results (P<8.8E-03, FC=1.77, Figure 3D-right panel, AUC=0.87, Figure 3E-blue curve, Pair-level comparison between observed vs predicted are provided in Extended Figure 3D). 91 Taken together, these results to the ability of PERCEPTION models to predict single and combination therapies in independent screens without any further training. Figure 3: PERCEPTION?s predicted response successfully stratifies resistant vs sensitive lung cancer cell lines to monotherapy and combination response. A) Illustration of our method defining sensitive and resistant cell lines for treatment from viability profile, where for each drug 92 the top and bottom 33% of the cell lines ranked by viability in ascending order are defined as sensitive and resistant, respectively. B) The predicted viability from PERCEPTION (x-axis) for top vs bottom 33 % cell lines defined as resistant (N=72) vs sensitive (N=84) cell lines, considering all the data points (left panel) or only high-confidence points (right panel) via a standard boxplot. A respective significance computed using a one-tailed Wilcoxon rank-sum test is provided. C) A receiving operator curve is plotted showing the relationship between sensitivity and specificity, where the area under the curve denotes the power of stratification. The colors of the curves - black and blue, represent whether all data points or only high-confidence data points are used, respectively. The area under this curve is provided at the right corner and similarly color-coded. The area under the dashed diagonal line denotes a random-model performance. Like B) and C), panel D) and E), respectively shows our ability to predict combination viability (# of resistant data points=28, # of sensitive data points=24). SC-based PERCEPTION prediction in patient-derived Head and Neck cancer cell lines To test the ability of our models to predict response in patient-derived cell lines (PDC), we used SC-expression of head and neck cancer cell lines derived from five different patients treated with eight different drugs at two concentrations (Table S6), with both monotherapy and combination therapy (Suphavilai et al. 2020). We were able to build predictive PERCEPTION response models for 4 out of the 8 drugs (docetaxel, epothilone-b, gefitinib, and vorinostat; Pearson Rho threshold > 0.25). For monotherapy treatments, the predicted viability is significantly higher in resistant vs sensitive cell lines (N=16 each, Figure 4A), with an AUC of 0.64 (Figure 4B). The predicted viability is correlated with the observed viability (Pearson Rho=0.46; P<0.03, Extended Figure 5A), and individual drug-level correlations are provided in Extended Figure 4C. Higher predicted viability in resistant cell lines is also observed for 93 combination treatments (Figure 4C), with an AUC of 0.86 (Figure 4D). The predicted viability after gefitinib treatment is illustrated at a single-cell resolution in Figure 4E. The predicted vs experimental correlations obtained for all data points and drug levels are provided in Extended Figure 4B, D. These results demonstrate the ability of PERCEPTION models to predict response in patient-derived single cells. 94 Figure 4: Prediction monotherapy and combination response in patient-derived cell lines. (A) The predicted viability from PERCEPTION in resistant (n=16) vs sensitive (n=16) cell lines. (B) ROC curves depicting the prediction power (sensitivity and specificity) of the predicted viability to stratify resistant vs sensitive cell lines. The area under this curve is provided at the right corner and denotes overall model prediction power. The area under the dashed diagonal line denotes a random-model performance. In (C) and (D), we repeated the analysis for combination treatment (Number of resistant vs sensitive cell lines=12 vs 12). The boxplots provided are standard and the significances are computed using the one-tailed Wilcoxon rank- sum test. E) Viability after treatment with gefitinib, a canonical EGFR mutation inhibitor, and activity of the EGFR pathway is provided for every single cell (each point) in the left and right UMAP plot, respectively. The intensity of the dot color denotes the EGFR pathway activity in the left panel and viability in the right panel, where the respective legends are provided. In this panel, we analyze 1116 single-cells from 5 PDCs. The UMAP clustering is performed using the expression profile of all the genes. PERCEPTION Prediction of combination treatment response in a Multiple myeloma clinical trial We turn to test the ability of PERCEPTION models to predict patients? responses based on SC transcriptomics from their tumors, our main goal. Performing a literature search for clinical trials of targeted or chemotherapy that report both patient?s tumor SC-expression data and response labels, we found one such dataset with 41 multiple myeloma patients. The cells were clustered in the original paper to three clones (median) based on their expression profiles. These patients were treated with a DARA?KRD combination of four drugs - daratumumab, 95 carfilzomib, lenalidomide, and dexamethasone (Cohen et al. 2021). SC-expression and response labels were available for 28 of these patients, whose pretreatment sub-clonal distribution is shown in Figure 5A. Patient response was measured via tumor size estimates in radiological images. We built PERCEPTION response models for two out of four of these drugs (carfilzomib and lenalidomide) whose response profiles are available in either PRISM or CTD. Using these models, we predicted a combination response in a given patient via the following two steps. We first predicted the combination response for each clone of a patient using their average expression profile. We observed that the response observed in the most resistant clone of a patient best stratifies the responders vs non-responder patients (P<1.9.E-03, Figure 5C). Therefore, the predicted response of a patient is the response of the most resistant clone available (Methods). The resulting predicted response of a patient can successfully stratify the responders vs non-responders with an AUC of 0.827 (Figure 5D). In comparison, repeating the above analysis using pseudo-bulk expression (computed here as a mean expression over all the cells in the tumor) yields an AUC of 0.56, testifying to the marked benefit of harnessing SC data from patients' tumors to predict their response. 96 Figure 5: PERCEPTION stratifies responders vs non-responders of the combination DACA- KRD therapy regime in multiple myeloma patients. A) Distribution of abundance of malignant 97 sub-clones (y-axis) in each multiple myeloma patient (x-axis) in the trial identified using sc- expression, where the color code for the sub-clones is provided at the top. One-tailed Wilcoxon rank-sum p-value denoting the significance of the higher median predicted viability in resistant cell lines. B) Predicted viability of the combination (z-score) at a clonal level for each patient where response status is provided at the top-strip of each facet. The left to right order of patients is the same as in panel A. C) The predicted combination response in 28 multiple myeloma patients stratified by responders vs non-responders status. D) Receiver Operating curve based on predicted combination response shows the relationship between specificity (x-axis) and sensitivity (y-axis) is provided. The area under this curve, provided at the right bottom corner, denotes the stratification power to distinguish responders vs non-responders. Charting the drug combinations landscape of FDA-approved drugs in multiple myeloma After showing PERCEPTION?s ability to stratify myeloma patients to the DARA?KRD treatment, we next study its application for identifying combination treatments of FDA-approved cancer drugs for the multiple myeloma patients studied in the above trial that targets multiple clones in the tumor disjointly and thus, have a low likelihood for resistance emergence (Figure 6A). To this end, we first identified the set of FDA-approved cancer drugs that have predictive PERCEPTION models (N=44). We then searched for drug combinations that kill disjoint sets of clones in the tumor i.e. those that achieve maximal tumor coverage and killing (Figure 6A). We began with combinations of two drugs (946 possible pairs), ranking every pair by a score denoting the extent of their disjoint killing, termed its Improvement Score (IS). This score quantifies the fold increase in killing compared to expected (Methods, Figure 6B) predicted to be induced by the specific combination. Out of the 946 possible combinations scanned, 842 pairs show no improvement over the expected (Methods). The remaining combinations, with 98 Improvement score > 1, termed effective, are shown in Figure 6B. The extent of the killing of different tumor clones by a few top-ranked effective combinations is shown in Figure 6C including our top-ranked combination hit of gefitinib & ponatinib, an EGFR inhibitor, and a canonical BCR-ABL inhibitor, respectively (IS = 2.57, Empirical P value = 1E-04). Another top combination pair with the high improvement score is vinblastine and sunitinib (Improvement score = 1.55, Empirical P value = 1E-04), a tubulin polymerization and multi-targeted receptor tyrosine kinase inhibitor, respectively. Analogously, we next looked for all possible triplets of drug combinations (Figure 6D, N=13,244), where our top hits include the combination of etoposide + midostaurin + ponatinib (Improvement score = 2.09, Empirical P value = 1E-04) and etoposide + niraparib + ponatinib (Improvement score = 1.95, Empirical P value = 1E-04) (Figure 6E). This method can be utilized for optimal combination design targeted multiple clones in multiple myeloma patients where combinations with high improvement scores. 99 Figure 6: Identify optimal drug combinations for multiple myeloma patients. A) Overview of our method: Given a patient with multiple myeloma composed of three clones, our goal is to find 100 a drug combination that can disjointly kill separate clones, and thus their combination treatment can kill all the clones present and could be an optimal combination. B) For each pair (x-axis), a median improvement score (y-axis) is computed i.e. fold decrease of predicted viability compared to the baseline. Here, an improvement score is provided for 94 pairs where this score is >1. For each pair, the proportion of patients (N=12) where this score is greater than 1 is denoted by the intensity of the color, where the legend is provided at the top. Top pairs are labeled. C) Disjoint killing of clones is visualized using viability profile at a clone level for the top pairs from panel C. This is provided separately for each patient (a facet) where the intensity of the color denotes the viability after the treatment for each clone (x-axis) of a given drug (y- axis) where the legend is provided on the right. Panel D) and E) are analogous to panel B) and C), respectively, but for drug triplets. Discussion We present a framework to build drug response models to predict the drug response in cancer cells at single-cell resolution and demonstrate its application to predicting response to monotherapy and combination treatment at the level of cell lines, patient-derived-cell lines, and a clinical trial of multiple myeloma. The power of our analysis is limited by the availability of sc- expression patients? datasets with response labels. Consequently, we expect that our model would further refine and improve as such datasets would become more available. The current response signatures identified by PERCEPTION are pan-cancer and could be further refined by considering cancer type-specific cell lines, whenever a reasonable number of cell lines become available. We note that the quality of our response model would depend on the quality of the sc- expression profiles available e.g., depth, high drop-out rates, etc., and likely negatively impact our performance. To note, we chose not to impute our data after considering the recent progress 101 on the subject concluding that unexpected high-dropouts are limited to non-UMI-based sc- expression methods and likely due to biological variation (Svensson et al. 2020, Cao et al. 2021). The lack of toxicity and side effects screens in normal cells hindered us from learning these phenotypes. More availability of such screens in the future may enable our models to predict response to drugs in both tumor and normal cells and thus refine our combinations considering their toxicity as well. In summary, this study is the first to harness the high resolution of information from scRNA-seq technology to build drug response models that can be translated to clinical context to identify responders to combination therapy or design an optimal combination for an individual patient. Methods Data collection We first collected the bulk-transcriptomics and drug response profiles generated in cancer cell lines curated in the DepMap (Tsherniak et al. 2017) consortium from Broad Institute (version 20Q1, https://depmap.org/portal/download/). The drug response is measured via area under the viability curve (AUC) across eight dosages and measures via a sequencing technique called PRISM (Corsello et al. 2020). In total, we mined 488 cancer cell lines with both bulk- transcriptomics and drug response profiles. We next mined sc-expression of 205 cancer cell lines (280 cells per cell line) generated in Kinker et al. 2020 from the Broad Single-cell Portal (https://singlecell.broadinstitute.org/single_cell/study/SCP542/pan-cancer-cell-line- heterogeneity#study-download). The metadata, identification, and clustering information were also mined from the same portal. 102 The PERCEPTION pipeline For each FDA-approved drug (N=133), we run the following steps to build a response model. Step 1: Learn from Bulk A. Identifying gene bulk-expression features correlated with viability profile: We first divided cell lines available to us into two sets, where the first set is used in step 1 and set 2 is used in step 2. We chose the cell lines where sc-expression is not available (N=318) and available (N=170) as sets A and B, respectively. Considering the first set of cell lines, we computed a Pearson correlation strength between the expression of each gene and for a query drug d. We considered this score as a measure of information in a gene expression profile and ranked each gene based on the correlated magnitude. B. Build models: Considering the top X gene expression features (where X is a hyperparameter optimized later) ranked based on the above-calculated correlation magnitude, we built a linear model regularized via the elastic net in five-fold cross-validation. Step 2: Optimize using sc-expression. A. Hyperparameter Optimization using sc-expression: We built the above model using a Bayesian- like grid of various X values (range 10-500), where the model with the best performance using sc-expression input of 169 cell lines (left one out for testing) is chosen. B. Performance in Cross- Validation: In the left-out cell line, which has not been used in either model building or hyperparameter optimization, we perform the error estimation. The error estimation is done by repeating the above steps for 170 times, wherein each instance a different cell line is left out. The final predicted values in this leave-one-out testing are compared to observed values via Pearson Correlation. 103 Quality control of in-house pharmacological screen by comparing them to existing screens To test the quality of our in-house screen generated, we compared our screen to a previous high-quality screen from Broad and Sanger Institute, PRISM (Corsello et al. 2020). Specifically, we leveraged the fact that screens for these drugs are also performed in the same cell lines, at least their monotherapy. For each drug, we computed a correlation between our in- house screen and PRISM in matched cell lines. We reasoned that the drugs with correlated profiles in the two screens (Pearson Rho>0.3) are consistent across the two screens suggesting that they are high quality. Independently, we note that the concordance score of drugs? response profile across screens is correlated with our predictive performance (Pearson Rho>0.39; P<0.019, Extended Figure 2A), suggesting that our model is capturing the robust signal across screens of these drugs. Cross-platform comparison of PERCEPTION performance The pharmacological drug screens performed by PRISM and GDSC studies are based on two independent platforms. The GDSC data was downloaded from the DepMap portal (Downloaded: April 15, 2020, https://depmap.org/portal/download/). To compare the performance of PERCEPTION across two independent screening platforms and test if the expression signature captured by our drug response models can be translated across the domains: 1. Of the 347 cell lines in common with drug response in GDSC and PRISM, there are 120 cell lines with sc-expression data. We selected at random 80 cancer cell lines with sc-expression data and pharmacological screens in GDSC and PRISM, 2. We considered all the drugs (N=191) which were screened in both PRISM and GDSC, from 104 which we selected a subset of drugs (N=28) with a concordant response between PRISM and GDSC (Pearson rho > 0.3 and p-value < 0.05; at least 20 cell lines with responses per drug in both GDSC and PRISM) in the 267 cell lines in common between the two screen excluding the cell lines in the testing set. 3. For each of the selected drugs we ran the PERCEPTION pipeline, in Step 2 of the pipeline the parameters were optimized on sc- expression of 90 cell lines (excluding the 80 test cell lines) instead of the default 170 cell lines. 4. Finally, we applied the resulting response models to the testing dataset and compared the predicted AUC values to the experimental response from GDSC and PRISM. We used the Pearson correlation coefficient as the measure to compare the performance between the screens and predicted responses. Generating models for PDC cell lines The single-cell expression of the five HNSC patient-derived cell lines and their treatment response for eight drugs and combination therapy at two different dosages obtained from Suphavilai et al. 2020. For these drugs, PERCEPTION was unable to build drug response models using PRISM screens. Therefore, we incorporated two main changes to the PERCEPTION pipeline: 1. drug response from GDSC screens (response from > ~800 cell lines for these drugs) were used to build models, 2. the expression levels of all the genes in the cancer cell line datasets were not found in the PDC sc-expression datasets and the frequency of dropouts in the PDC dataset is higher? ~3500 genes have > 75% non-zero counts across all the cells in all five patients, as a result only the top 3000 genes which are common in both datasets and with fewer dropouts across all five patients are considered in the pipeline. For the drugs for which 105 PERCEPTION was able to build models, we applied the models on the PDC cell lines and obtained the predictions for each individual cell. The patient-level monotherapy response for a given drug is represented by the mean response of all the cells included in a patient?s PDC sample. In the case of drug combinations, for a given cell, its combination response was the minimum among the two drug responses for the cell. The patient-level combination response was represented by the mean of the combined response of all the cells in a patient's PDC sample. Data availability The entire collection of the processed datasets used in this manuscript, including pre-clinical models of cancer cell lines and PDCs, can be accessed via a ZENODO repository which could be provided upon request or upon publication. Code availability We used open-source R version 4.0 to generate the figures. Wherever required, commercially available Adobe Illustrator 23.0.3 (2019) was used to create the figure grids. All of the scripts for analysis and figure production were built in-house and will be provided upon publication. Acknowledgments This research was supported in part by the Intramural Research Program of the National Institutes of Health, NCI. This work used the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov). We acknowledge and thank the National Cancer Institute for providing financial and infrastructural support. 106 Conflict of interest The authors declare that they have no conflict of interest. Author contributions References 1. Corsello, Steven M., et al. "Discovering the anticancer potential of non-oncology drugs by systematic viability profiling." Nature cancer 1.2 (2020): 235-248. 2. Kinker, Gabriela S., et al. "Pan-cancer single-cell RNA-seq identifies recurring programs of cellular heterogeneity." Nature Genetics 52.11 (2020): 1208-1218. 3. Ghandi, Mahmoud, et al. "Next-generation characterization of the cancer cell line encyclopedia." Nature 569.7757 (2019): 503-508. 4. Shalek, Alex K., and Mikael Benson. "Single-cell analyses to tailor treatments." Science translational medicine 9.408 (2017). 5. Adam, George, et al. "Machine learning approaches to drug response prediction: challenges and recent progress." NPJ precision oncology 4.1 (2020): 1-10. 6. Zhu, Sibo, et al. "Advances in single-cell RNA sequencing and its applications in cancer research." Oncotarget 8.32 (2017): 53763. 7. Kim, K.-T. et al. Application of single-cell RNA sequencing in optimizing a combinatorial therapeutic strategy in metastatic renal cell carcinoma. Genome Biol. 17, 80 (2016). 107 8. Suphavilai, Chayaporn, et al. "Predicting heterogeneity in clone-specific therapeutic vulnerabilities using single-cell transcriptomic signatures." bioRxiv (2020). 9. Cohen, Yael C., et al. "Identification of resistance pathways and therapeutic targets in relapsed multiple myeloma patients through single-cell sequencing." Nature Medicine 27.3 (2021): 491-503. 10. Ledergor, Guy, et al. "Single cell dissection of plasma cell heterogeneity in symptomatic and asymptomatic myeloma." Nature medicine 24.12 (2018): 1867- 1876. 11. Sade-Feldman, Moshe, et al. "Defining T cell states associated with response to checkpoint immunotherapy in melanoma." Cell 175.4 (2018): 998-1013. 12. Garnett, Mathew J., et al. "Systematic identification of genomic markers of drug sensitivity in cancer cells." Nature 483.7391 (2012): 570-575. 13. Tian, Luyi, et al. "Clonal multi-omics reveals Bcor as a negative regulator of emergency dendritic cell development." Immunity (2021). 14. Tsherniak, Aviad, et al. "Defining a cancer dependency map." Cell 170.3 (2017): 564-576. 15. Wensink, G. Emerens, et al. "Patient-derived organoids as a predictive biomarker for treatment response in cancer patients." NPJ precision oncology 5.1 (2021): 1- 13. 16. Yao, Ye, et al. "Patient-derived organoids predict chemoradiation responses of locally advanced rectal cancer." Cell stem cell 26.1 (2020): 17-26. 108 17. de Witte, Chris Jenske, et al. "Patient-derived ovarian cancer organoids mimic clinical response and exhibit heterogeneous inter-and intrapatient drug responses." Cell reports 31.11 (2020): 107762. 18. Cao, Yingying, et al. "Umi or not umi, that is the question for scrna-seq zero- inflation." Nature Biotechnology 39.2 (2021): 158-159. 19. Svensson, Valentine. "Droplet scRNA-seq is not zero-inflated." Nature Biotechnology 38.2 (2020): 147-150. 20. Ling, Alexander, and R. Stephanie Huang. "Computationally predicting clinical drug combination efficacy with cancer cell line screens and independent drug action." Nature communications 11.1 (2020): 1-13. 21. Castro, L. Nicolas Gonzalez, Itay Tirosh, and Mario L. Suv?. "Decoding Cancer Biology One Cell at a Time." Cancer Discovery 11.4 (2021): 960-970. 22. Arya, Arvind K., et al. "Nutlin-3, the small-molecule inhibitor of MDM2, promotes senescence and radiosensitises laryngeal carcinoma cells harbouring wild-type p53." British journal of cancer 103.2 (2010): 186-195. 109 Conclusion In this thesis, I provide an overview of the three computational approaches I developed during my Ph.D. to advance precision medicine for cancer prevention and treatment. The common axis among these three efforts is that we analyze large-scale cancer omics data from both pre-clinical models and patients datasets to generate the initial hypothesis and guide the study. We leveraged large-scale genetic screens in cancer cell lines to identify the cancer risk associated with CRISPR-based therapies i.e. an undesired selection of cells with pre-existing p53 and KRAS mutations and thus calling for carefully monitoring patients undergoing CRISPR- Cas9-based editing for clinical therapeutics for pre-existing p53 and KRAS mutations. We integrated publicly available and our own generated large-scale omics data from African American and European American cancer patients to delineate some of the molecular mechanisms that may underlie the observed differences in cancer incidences across cancer patients from these two ancestries. Focusing mainly on lung cancer, we found that lung tumors from African American (AA) patients exhibit higher genomic instability, homologous recombination deficiency, and aggressive molecular features such as chromothripsis. These molecular differences extend to other cancer types. We also showed that these somatic differences observed may have genetic origins by comparing germline variants from patients of these two ancestries. This provides a therapeutic unique strategy to treat tumors from an ancestry (AAs) with high HRD using PARP and checkpoint inhibitors. 110 Finally, I presented a computational framework to use single-cell RNA-seq from patients? tumors to find combination treatments that can target multiple clones in the tumor disjointly. Using this framework, we predict the response to monotherapy and combination treatment in cell lines, patient-derived-cell lines, and in a clinical trial of multiple myeloma patients and chart the landscape of optimal combination treatments of the existing FDA-approved drugs in multiple myeloma. In summary, we demonstrate the power and wide range of applications of multi-omics analysis to identify cancer risks associated with genetic editing and strategies to overcome it, to delineate molecular mechanisms contributing to the cancer disparity in AA patients, and a treatment strategy based on that, and finally, we built a framework based on single-cell transcriptomics of tumors to stratify responders and guide combination treatment. 111