ABSTRACT Title of Dissertation: ANALYTICAL APPROACHES FOR COMPLEX MULTI-BATCH -OMICS DATASETS AND THEIR APPLICATION TO NEURONAL DEVELOPMENT Theresa Ann Alexander, Doctor of Philosophy, 2023 Dissertation directed by: Assistant Professor Colenso M. Speer, Department of Biology Professor Najib M. El-Sayed, Department of Cell Biology and Molecular Genetics High-throughput sequencing methods are extremely powerful tools to quantify gene expression in bulk tissue and individual cells. Experimental designs often aim to quantify expression shifts to characterize developmental trajectories, disease states, or cellular drug responses. Experimental and genetic methods are also rapidly evolving to capture specific aspects of gene expression such as in targeting individual cell types, regulatory stages, and spatially resolved cell subcompartments. These studies frequently involve a variety of experimental conditions that require many samples to guarantee sufficient statistical power for subsequent analyses. These studies are frequently processed in multiple batches due to limitations on the number of samples that can be collected, processed, and sequenced at once. To eliminate erroneous results in subsequent analyses, it is necessary to deconvolve non-biological variation (batch effect) from biological signal. Here, we explored variational contributions in multi-batch high throughput sequencing experiments by developing new methods, evaluating heterogeneity-contributors in an axon-TRAP-RiboTag protocol case-study, and highlighting biological results from this protocol. First, we describe iDA, a novel dimensionality reduction method for high-throughput sequencing data. High-dimensional data in complex, multi-batch experiments often result in discrete clustering of samples or cells. Existing unsupervised linear dimensionality reduction methods like PCA often do not resolve discreteness simply with projections of maximum variance. We show that iDA can produce better projections for separating discrete clustering that correlates with known experimental biological and batch factors. Second, we provide a case study of special considerations for a complex, multi-batch high throughput experiment. We investigated the multi-faceted heterogenic contributions of a study using the axon-TRAP-RiboTag translatomic isolation protocol in a neuronal cell type. We show that popular batch-correction methods may reduce signal due to true biological heterogeneity in addition to technical noise. We offer metrics to help identify biological signal-driven batch- differences. Lastly, we employ our understanding of variational contributions in the intrinsically photosensitive retinal ganglion cell (ipRGC) -omics case study to explore the biological transcriptomic and translatomic coordination. Our analysis revealed ipRGCs participate in subcompartment-specific local protein translation. Genetic perturbations of photopigment-driven neuronal activity led to global tissue transcriptomic shifts in both the retina and brain targets, but the ipRGC axonal-specific translatome was unaltered. ANALYTICAL APPROACHES FOR COMPLEX MULTI-BATCH -OMICS DATASETS AND THEIR APPLICATION TO NEURONAL DEVELOPMENT by Theresa Ann Alexander Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Dr. Colenso M. Speer, Co-Chair Dr. Najib M El-Sayed, Co-Chair Dr. Joshua Singer Dr. Michael P. Cummings Dr. Robert Patro, Dean’s Representative © Copyright by Theresa Ann Alexander 2023 ii Preface The work in this dissertation was conducted over the span of two research labs. The first part of my dissertation work was spent working with Dr. Héctor Corrada Bravo. Chapter 1 introduces the work conducted in the Corrada Bravo group. The algorithm adapted as a result of the work with Dr. Corrada Bravo is Chapter 2 of this dissertation which are peer-reviewed published works. After Dr. Corrada Bravo left the University of Maryland, I transitioned to co-mentorship under the advisors Dr. Colenso M. Speer and Dr. Najib M. El-Sayed. The second half of my dissertation work is presented in Chapters 3, 4, and 5. At the time of this writing, Chapter 4 is currently under preparation for submission and Chapter 5 is a portion of a paper which will be submitted within the calendar year pending integration with data which are currently being collected. For these 2 chapters, all the experimental data collection was performed by Rashmi Gupta and Shyrice M. Mitchell and all the data analysis was performed by Theresa Alexander. I would like to thank all the interdisciplinary co-author contributions to make this work impactful and overarching. Chapter 2: ● Theresa A Alexander, Rafael A Irizarry, Héctor Corrada Bravo, Capturing discrete latent structures: choose LDs over PCs, Biostatistics, Volume 24, Issue 1, January 2023, Pages 1–16, https://doi.org/10.1093/biostatistics/kxab030. My individual contributions include: 1) literature review, 2) algorithm development, iii 3) experimental design, 4) algorithm assessment, 5) writing the manuscript, and 6) manuscript revisions. Chapter 4: ● Theresa A Alexander, Rashmi Gupta, Ashton T. Belew, Shyrice M. Mitchell, Najib M. El-Sayed, Colenso M. Speer. “Validating the efficacy of Axon-TRAP RNA-Sequencing Results in Multi-Batch Studies”. This work will be submitted eNeuro in 2022. My individual contributions include: 1) literature review, 2) data analysis, 3) experimental design for “forensic bioinformatic” analysis, and 4) writing the manuscript. Chapter 5: ● Theresa A Alexander*, Rashmi Gupta*, Ashton T. Belew, Shyrice M. Mitchell, Najib M. El-Sayed, Colenso M. Speer. “The effects of melanopsin on ipRGC cell autonomous and cell non-autonomous programming”. This work will be submitted to a peer-reviewed journal in 2022 in conjunction with on-going work. My individual contributions include: 1) literature review, 2) data pre-processing, 3) data analysis, and 4) writing this portion of the manuscript. Other publications in which I have completed or contributed to during my PhD work include: iv ● Alexander, T.A., Machiela, M.J. LDpop: an interactive online tool to calculate and visualize geographic LD patterns. BMC Bioinformatics 21, 14 (2020). https://doi.org/10.1186/s12859-020-3340-1. ● Gomez MA, Belew AT, Navas A, Rosales-Chilama M, Murillo J, Dillon LAL, Alexander TA, Martinez- Valencia A and El-Sayed NM (2021). Early Leukocyte Responses in Ex-Vivo Models of Healing and Non- Healing Human Leishmania (Viannia) panamensis Infections. Front. Cell. Infect. Microbiol. 11:687607. doi: 10.3389/ fcimb.2021.687607. ● Sam B. Choi*, Theresa A. Alexander*, Tarlan Vatan*, Chenghang Zhang, Shyrice M. Mitchell, Colenso M. Speer, Peter Nemes. Ultrasensitive Capillary Electrophoresis Trapped Ion Mobility Time- of-Flight Mass Spectrometry Reveals a Proteome Transition During Development of the Brain’s Circadian Pacemaker. Submitted to Mol. Cell. Prot. December, 2022. ● Li CY, Bowers JM, Alexander TA, Lawrence K, Coyne JM, Li Y, Juntti SA. Olfactory receptor for female reproductive pheromonal signaling in an African cichlid. In Preparation. ● Gómez,M.A.*, Belew, A.T. , Vargas, D.A., Rebellon, D., Rosales- Chilama, M., Alexander, T.A., Parra, L.F.G, Saravia, N, El-Sayed, N.M.* (2022). Innate biosignatures of treatment outcome in patients with cutaneous leishmaniasis. In Preparation. v ● Fernandez, O., Gómez,M.A.*, Belew, A.T., Rosales-Chilama, M., Alexander, T.A., El-Sayed, N.M., Saravia, N. (2021). Genotypic and transcriptomic profiling of virulent and avirulent strains of L. (V.) panamensis. In Preparation. ● Gómez,M.A.*, Belew, A.T., Fernandez, O., Rosales-Chilama, M., Alexander, T.A., Saravia, N., El-Sayed, N.M. * (2022). Human macrophage response and intrinsic resistance of L. (V.) panamensis to pentavalent antimony. In. Preparation. vi Dedication For my family and friends, two legged and four, who have walked this journey with me. vii Acknowledgements I am grateful for the many mentors, advisors, and friends who have helped and supported me through my doctoral journey. Firstly, I would like to thank my family for all their support over the past 5 years. Louis, Mom, Dad, Colleen, and Ricky: thanks for celebrating my wins and cushioning my losses. I am so fortunate to have gone through this process with such a supportive group behind me. I will always be grateful that my first introduction to my doctoral journey was with Dr. Héctor Corrada Bravo's group. During my initial time at UMD, Héctor gave me the independence and the confidence I needed to find myself as a student and researcher. Along with Héctor, I feel indebted to the Corrada Bravo lab members: Jay, Dom, Mohamed, and Aya, for setting the tone for a work environment I will now forever chase after. Dr. Najib El-Sayed and Dr. Colenso Speer took me in as an orphaned graduate student and made the transition feel seamless even in the midst of a global pandemic. I am so thankful for the mentorship and support each of them has given me over the past 3 years. The skills I have acquired for interdisciplinary research and communication would not be if not for Najib and Colenso. viii I am also grateful for members of my thesis committee who each offered guidance towards my growth as a researcher as well as navigating my extremely nonlinear doctoral journey (in alphabetical order): Dr. Michael Cummings, Dr. Rob Patro, and Dr. Josh Singer. Each of you have challenged and supported me while making me feel like a valuable member of the scientific community. I also am greatly appreciative for all the mentorship, guidance, support, and friendship from Dr. Ashton Trey Belew. I learned so much from getting to work with Trey but most importantly was the importance of fun in research. I would like to thank the members of the Speer and Singer labs for their support, help, and friendship throughout my work. I would specifically like to thank (in alphabetical order): Jackie Minehart, Morgan Musgrove, Andrea Orpia, Kat Pizano, Tarlan Vatan. I would also like to send my love and thanks to members of the BISI and CBCB community, specifically Alexis Boleda, Jason Fan, and Ashley Della Fera. Without the many hours of talks for research and otherwise, I would not be where I am today. ix Table of Contents Preface........................................................................................................................... ii Dedication .................................................................................................................... vi Acknowledgements ..................................................................................................... vii Table of Contents ......................................................................................................... ix List of Tables .............................................................................................................. xii List of Figures ............................................................................................................ xiii List of Abbreviations .................................................................................................. xv Chapter 1: Introduction to investigating sources of variational contributions in RNA- seq data.......................................................................................................................... 1 1.1 Variability in high dimensional data ................................................................... 1 1.1.1 Genome/Transcriptome-wide association studies........................................ 2 1.1.2 Single cell RNA-sequencing variability ...................................................... 3 1.2 Linear dimensionality reduction methods ........................................................... 3 1.2.1 Unsupervised: Principal Component Analysis (PCA) ................................. 3 1.2.1 Supervised: Linear Discriminant Analysis (LDA) ...................................... 4 1.3 Non-Linear dimensionality reduction methods .................................................. 5 1.4 Contributions....................................................................................................... 6 Chapter 2: A Novel Dimensionality Reduction Method for high-throughput RNA- sequencing data ............................................................................................................. 7 2.1 Introduction ......................................................................................................... 7 2.2 iDA Algorithm .................................................................................................. 11 2.3 Methods for Performance Assessment.............................................................. 13 2.3.1 Convergence Assessment........................................................................... 14 2.3.2 Cumulative F-Statistic ............................................................................... 14 2.3.3 Cumulative R2 ............................................................................................ 16 2.3.4 Clustering Validation ................................................................................. 17 2.4 Results ............................................................................................................... 17 2.4.1 iDA preserves cluster structure better than PCA ....................................... 17 2.4.2 iDA directions are highly correlated with known PBMC cell type markers ............................................................................................................................. 24 2.4.3 iDA directions are highly correlated with SNPs in eQTL ......................... 27 2.4.4 iDA directions are highly correlated with known technical noise ............. 29 2.4.5 iDA Converges........................................................................................... 33 2.5 Discussion ......................................................................................................... 35 2.6 Software Availability ........................................................................................ 41 x Chapter 3: Introduction to investigating -omics data types in neurons ...................... 42 3.1 Central dogma of biology within neurons ........................................................ 42 3.2 Regulation of local protein translation .............................................................. 43 3.3 ipRGCs to study activity-dependent local protein translation .......................... 44 3.3.1 ipRGCs subcompartments are physically separated .................................. 45 3.3.2 ipRGCs are defined by functional, developmental, and regional types ..... 46 3.3.3 ipRGCs are genetically tractable ............................................................... 47 3.4 Neuronal Translatomes: Axon-TRAP-RiboTag ............................................... 48 3.5 Contributions..................................................................................................... 50 Chapter 4: Validating the efficacy of Axon-TRAP RNA-Sequencing Results in Multi- Batch Studies .............................................................................................................. 52 4.1 Background ....................................................................................................... 52 4.1.1 Multi-batch processing in high-throughput studies ................................... 52 4.1.2 Assumptions made by batch-correction methods may not often be met ... 53 4.1.3 Case Study of axon-TRAP RNA-seq data from long range neuronal connections from eye-to-brain ............................................................................ 54 4.2 Results ............................................................................................................... 59 4.2.1 Experimental batch contributes to unwanted batch effect ......................... 59 4.2.2 Experimental batch sequencing depth does not correlate with unique genes observed .............................................................................................................. 62 4.2.3 Batch effect can be driven by true biological signal .................................. 65 4.2.4 Ribosomal protein expression is variable between experimental batches . 70 4.3 Discussion ......................................................................................................... 73 4.4 Conclusion ........................................................................................................ 78 4.5 Materials and Methods ...................................................................................... 78 4.5.1 Experimental Design .................................................................................. 79 4.5.2 Preprocessing of RNA-Sequencing Reads................................................. 80 4.5.3 Cumulative F-Statistic with SVA Dimensions .......................................... 80 4.5.4 Variance Partition Analysis ....................................................................... 82 4.5.5 Sequencing Depth Contributional Assessment .......................................... 82 Chapter 5: The role of melanopsin on ipRGC cell autonomous and cell non- autonomous programming .......................................................................................... 84 5.1 Introduction ....................................................................................................... 84 5.1.1 Intrinsically photosensitive retinal ganglion cells...................................... 84 5.1.2 Melanopsin and its role in development .................................................... 85 5.1.3 Experimental Design .................................................................................. 87 5.2 Results ............................................................................................................... 88 5.2.1 Gene expression is tissue location and Opn4 genotype specific................ 88 5.2.2 Local translation occurs in ipRGC axons at postnatal day 15 ................... 96 5.2.3 Local translation in ipRGC somas is melanopsin-dependent at P15 ....... 102 xi 5.2.4 Local translation in ipRGC axonal projections in the dLGN is melanopsin- independent at P15 ............................................................................................ 103 5.3 Discussion ....................................................................................................... 105 5.4 Conclusions ..................................................................................................... 110 5.5 Future Directions ............................................................................................ 111 5.6 Materials and Methods .................................................................................... 113 5.6.1 Mouse lines .............................................................................................. 113 5.6.2 Tissue Collection ..................................................................................... 114 5.6.3 Axon- Translation ribosome affinity purification (TRAP) ...................... 115 5.6.4 Axon-TRAP-RiboTag RNA Isolation ..................................................... 116 5.6.5 Axon-TRAP-RiboTag Library preparation and sequencing .................... 117 5.6.6 Axon-TRAP-RiboTag RNA-sequencing read preprocessing .................. 118 5.6.7 Tissue Collection for Bulk RNA sequencing (RNA-seq) ........................ 119 5.6.8 RNA isolation for Bulk RNA-seq ............................................................ 119 5.6.9 Library preparation and sequencing of Bulk RNA-seq ........................... 120 5.6.10 Bulk RNA-sequencing read preprocessing ............................................ 121 5.6.11 Bioinformatic Analyses ......................................................................... 122 5.7 Supplementary Material .................................................................................. 124 Bibliography ............................................................................................................. 129 xii List of Tables Table 2.1 Confusion matrix between iDA and PCA+Louvain Clustering………….. 20 Table 2.2 Comparison of mean expression of canonical PBMC cell type markers… 24 Table 2.3 Minor Allele Frequencies per population of top eQTL SNP/gene pairs…. 27 Table 2.4 Adjusted rand index for clusters from iDA versus PCA embeddings……. 33 Table 4.1 F-statistic for each SV dimension resulting from SVA…………………... 60 Table 5.1 Significant DE genes dysregulated between Opn4 genotypes……………. 91 Table 5.2 DE results for receptor/ligand pairs in the PDGF family in the retina…… 92 Table 5.3 DE results for receptor/ligand pairs in the Ephrin family in the retina…… 93 Table 5.4 Sample numbers for P15 axon-TRAP experiment……………………… 118 Table 5.5 Sample numbers for P8 bulk RNA-seq experiment……………………...121 Supplementary Table 1 Jaccard index of sample SNPs…………………………… 125 xiii List of Figures Figure 2.1 PCA versus LDA projection for data where data is clustered……………. 8 Figure 2.2 Elbow plots to determine the optimal number of dimensions to use in the PCA embedding……………………………………………………………………... 15 Figure 2.3 Cumulative F-statistic and Cumulative R2 for the PBMC and Geuvadis datasets………………………………………………………………………………. 19 Figure 2.4 tSNE and UMAP visualizations for the PCA and iDA PBMC cell embeddings………………………………………………………………………….. 22 Figure 2.5 Dunn Index calculated for bootstrapped samples for each dataset………. 23 Figure 2.6 Putative cell type clusters found with iDA are separated both by known cell type markers and in the iDA embedding………………………………………...26 Figure 2.7 iDA projection 3 separates populations………………………………….. 28 Figure 2.8 tSNE of iDA embedding for ERCC negative control dataset…………… 30 Figure 2.9 Method comparison on pseudo-cell scRNA-seq dataset………………… 32 Figure 2.10 iDA Convergence………………………………………………………. 34 Figure 2.11 iDA runtime…………………………………………………………….. 35 Figure 3.1 RiboTag genetic approach with Opn4 Cre driver………………………...49 Figure 4.1 Batch effect in ipRGC case study………………………………………... 57 Figure 4.2 SVA vs COMBAT batch-correction…………………………………….. 59 Figure 4.3 SVA versus known experimental factors………………………………... 61 Figure 4.4 Mapped read counts per sample…………………………………………. 63 Figure 4.5 Relationship of mapped reads versus genes observed…………………… 64 Figure 4.6 Bootstrapped downsampling of sequencing counts within each sample…65 Figure 4.7 Non-zero genes quantified per sample by subcompartment location/genotype……………………………………………………………………. 67 Figure 4.8 Non-zero genes observed between axon-TRAP and bulk RNA-seq……. 69 Figure 4.9 Non-zero genes observed in the original axon-TRAP-RiboTag………… 70 Figure 4.10 Variance partition analysis of axon-TRAP-RiboTag samples…………. 71 Figure 4.11 Hierarchical clustering of the expression profiles for all ipRGC axon- TRAP samples………………………………………………………………………. 72 Figure 5.1 ipRGC subcompartment and Opn4 (melanopsin) genotypic differences in global transcriptomics……………………………………………………………….. 90 Figure 5.2 Differential gene expression (DE) analysis of the transcriptome……….. 91 Figure 5.3 GSEA of ipRGC transcriptome………………………………………….. 96 Figure 5.4 Non-zero genes observed in study samples……………………………… 98 Figure 5.5 ipRGC subcompartment-specific mRNA translation…………………... 100 Figure 5.6 GSEA of axon-enriched mRNAs………………………………………. 101 Figure 5.7 Retina Opn4-dependent translation…………………………………….. 103 Figure 5.8 dLGN Opn4-independent translation………………………………….. 104 xiv Supplementary Figure 5.1 Heterogeneity in Opn4 expression in bulk RNA-seq….. 124 Supplementary Figure 5.2 IGV view of Opn4 aligned reads………………………. 125 Supplementary Figure 5.3 Comparison of SNP sets between bulk RNA-seq samples……………………………………………………………………………... 127 Supplementary Figure 5.4 Dendrogram of clustered bulk RNA-seq samples……... 128 xv List of Abbreviations scRNA-seq Single cell RNA-seq RNA-seq Bulk RNA-sequencing GWAS/TWAS Genome/transcriptome-wide association studies SVA Surrogate variable analysis PBMC Peripheral blood mononuclear cells PCA Principal component analysis PCs Principal components t-SNE t-Distributed Stochastic Neighbor Embedding UMAP Uniform Manifold Approximation and Projection iDA iterative Discriminant Analysis LDA Linear discriminant analysis LD Linear discriminant QDA Quadratic discriminant analysis TSS Total sum of squares RSS Residual sum of squares DI Dunn Index NK Natural killer MAF Minor allele frequency EUR European YRI Yoruban SNP Single nucleotide polymorphism eQTL Expression quantitative trait loci ERCC External RNA Controls Consortium xvi GEMs Gel beads in emulsion ARI Adjusted rand index FDR False discovery rate RGCs Retinal ganglion cells cRGCs Conventional retinal ganglion cells ipRGCs intrinsically photosensitive retinal ganglion cells SCN Suprachiasmatic Nucleus dLGN dorsal Lateral Geniculate Nucleus P0,P8, P15,P60 Postnatal day 0,8,15,60 E20 Embryonic day 20 TRAP Translating ribosome affinity purification DE Differential expression GSEA Gene set enrichment analysis Opn4 Melanopsin SC Superior colliculus bp Base pairs BBI-ACGT Brain & Behavior Institute-Advanced Genomic Technologies Core UMIs Unique molecular identifiers Het Heterozygote KO Knock-out CPM Counts per million log2FC Log2 Fold Change xvii Eph Ephrin receptor Efn Ephrin ligand GO Gene ontology KEGG Kyoto Encyclopedia of Genes and Genomes VEGF Vascular endothelial growth factor PDGF Platelet-derived growth factor HIF-1a Hypoxia-induced factor 1a RTKs Receptor tyrosine kinases DAG Diacylglycerol BCL Binary base call IGV Integrative Genome Viewer 1 Chapter 1: Introduction to investigating sources of variational contributions in RNA-seq data 1.1 Variability in high dimensional data The number of studies that rely on high dimensional biological data collection continues to grow. This trend has been intensified recently by the advent of single cell RNA-seq (scRNA-seq), which simultaneously measures genome-wide expression profiles for tens of thousands of cells. In these studies, dimensionality reduction techniques are commonly used to capture underlying discrete structure of the data in a small number of dimensions. This structure can be imposed by sources of variation that could be strategic (experimental), biological but unknown and potentially interesting, or unwanted and unintentionally introduced by laboratory techniques. In high dimensional transcriptomic and translatomic datatypes, accurate dimensionality reduction techniques are essential for visualizing the data structure in exploratory analyses, identifying any potential variation attributed to noise in the data, and cutting computational costs for downstream analysis such as clustering. Low-dimensional representations need to capture the underlying structure of the samples from true separable differences between study groups as well as potential unwanted batch variation. In the case of scRNA-seq, investigators are often interested in discovering previously unknown cell types based on their gene expression profile. In bulk RNA- sequencing (RNA-seq) investigators commonly vary experimental conditions, introducing batch effects that induce unwanted variability which may cause spurious associations in downstream analysis if not corrected for (Leek et al., 2010). 2 Whether heterogeneity across classes arises from unwanted sources of variation, variation of response to an experimental intervention, or different cell types in scRNA-seq experiments, it is vital to be able to identify a low dimensional embedding which describes the separation of classes. This embedding can be used as an interpretable reduction to characterize and investigate features which drive class separation or to correct for unwanted sources of variation in downstream analysis. Varying experimental conditions commonly introduce batch effects that induce unwanted variability (Leek et al., 2010). For example, the Geuvadis Project expression data (Lappalainen et al., 2013) shows clear evidence of batch effect, where the laboratory in which the samples were processed contributes to a large part of variation in the data. If unaccounted for, this batch effect may cause spurious associations in downstream analysis. This is particularly problematic when the outcome of interest is confounded with the variable responsible for the batch effect (Leek et al., 2010). 1.1.1 Genome/Transcriptome-wide association studies In genome or transcriptome-wide association studies (GWAS/TWAS), where each locus or transcript is tested for association with an outcome, standard statistical tests assume observations are independent. When there is underlying structure in the data, genetic population structure for example, the independence assumption does not hold (Brown et al., 2018). To correct for this latent structure, state-of-the-art methods include population structure as a covariate in the model. In GWAS, self-reported ancestry is not reliable (Mersha & Abebe, 2015), and the latent structure attributed to 3 ancestry must be estimated. Similarly, in TWAS, sources of technical variability are impossible to predict and methods that estimate these unknown factors, such surrogate variable analysis (SVA) (Leek & Storey, 2007) have been proposed. 1.1.2 Single cell RNA-sequencing variability A common motivation of studies using scRNA-seq is to detect novel cell-types based on the natural grouping of the transcriptomic profiles and identify marker genes that define these cell types (Jain, 2010). The current standard approach is to define cell- types using unsupervised clustering followed by statistical tests used to identify genes with greater expression in one cell type compared to the others. While this may give researchers valuable information for a particular gene of interest, the method fails to provide a broad summary of a set of genes which define cell type separation. A more useful approach would be to find an embedding space which defines the latent structure of the separation of cell types. Often in scRNA-seq when trying to identify putative cell types, the boundaries between them are ill-defined, making it difficult to define definitive cell type boundaries (Kiselev et al., 2019). Defining the latent structure which captures the features which best separate cell types in low dimensional space from each other would be preferred. 1.2 Linear dimensionality reduction methods 1.2.1 Unsupervised: Principal Component Analysis (PCA) Principal Component Analysis (PCA) is commonly used as a dimensionality reduction method in these applications since it finds principal components (PCs) that explain variation in the data. For sources of variation such as batch effects or genetic 4 ancestry the variation in the data set is largely attributed to factors not associated with the outcome variable and PCA is used to find an embedding to estimate and account for these latent factors. In other data types like scRNA-seq, PCA is used to reduce the computational cost of downstream analyses like clustering. The application of PCA to detect latent discrete structure assumes that latent classes will in fact account for most of the dataset variance. However, since PCA is unsupervised, it does not necessarily separate classes (Lever et al., 2017). When the clusters formed by latent classes fall in directions of maximum variance, PCA will indeed find an embedding where clusters separate, but if these clusters do not fall in directions of maximum variance, embeddings computed by PCA will fail to separate these clusters. Since PCA seeks to find transformations that explain most variance in the data, it inherently favors separating clusters with large variance. A well-separated cluster with low variance will likely not be separated by the top embedding directions since it does not contribute much to the total variance of the data set. For data sets with an innate discrete latent structure containing clusters with unbalanced variances an alternative method for recognizing latent variables is needed. 1.2.1 Supervised: Linear Discriminant Analysis (LDA) Another method used for dimensionality reduction, classification, and data visualization is linear discriminant analysis (LDA). LDA, like PCA, is a linear reduction method but class labels are used in order to find optimal separating 5 boundaries between them. The class labels are used to maximize the ratio of between class variation to within class variation. In instances where the major sources of variation within a dataset are known (such as experimental variables and batches), LDA produces weighted vectors which describe the data separation and variability. The contribution of each feature in the separability per dimension is interpretable since they are linear. 1.3 Non-Linear dimensionality reduction methods Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t- SNE) and Uniform Manifold Approximation and Projection (UMAP), use a different approach to retain local structure in the original high dimensionality data in the low dimensional embedding (McInnes et al., 2018; van der Maaten & Hinton, 2008). tSNE first computes joint probabilities of the pairwise similarity scores in the full dimension data space and the low-dimensional projection space. Next, the algorithm minimizes the Kullback-Leibler divergence between the joint probabilities and iterates until convergence. This method preserves small pairwise distances/local similarities between samples, but because the cost function is not convex, convergence to the global minima is not guaranteed. Similarly to tSNE, UMAP is also a neighborhood-preserving non-linear method, but offers improvements on computational complexity compared to tSNE. These methods often yield visualizations that capture latent discrete structure in data well. These methods are computationally expensive and in practice are usually used on data that is already of smaller dimension by using PCA as a first step, thus inheriting the issues raised about 6 PCA above. More importantly for the context of biological interpretation, embeddings obtained by t-SNE and UMAP are nonlinear, making it hard to interpret feature weights to determine those that may be contributing to the signal separating latent discrete structure in the data. 1.4 Contributions For datasets with an innate discrete latent structure containing clusters with unbalanced variances an alternative method for recognizing latent variables is needed. If the experimental design is such that one may expect unequal variances between groups of interest, the resulting PCA + tSNE or UMAP visual representation will not accurately separable differences between distinct groups with small variance between them. In particular, in instances where a large portion of the variation between study groups is attributed to one group’s large variation, the difference between the groups with smaller distances between them becomes unobservable. In chapter 2, we develop a novel dimensionality reduction method, iterative discriminant analysis (iDA). iDA as an iterative extension of linear discriminant analysis (LDA). This new proposed method will mitigate this bias and preserve the distinct separation between low variance groups. 7 Chapter 2: A Novel Dimensionality Reduction Method for high- throughput RNA-sequencing data 2.1 Introduction In this chapter, we propose iterative Discriminant Analysis (iDA), an unsupervised dimensionality reduction technique to address the problems of PCA and nonlinear methods like t-SNE and UMAP discussed in chapter 1. iDA uses linear transformations to produce an embedding space that contains discriminatory information that optimally separates latent clusters. This method leverages linear discriminant analysis (LDA) in order to find transformations which both separate classes from each other while also minimizing variance within each class. This method improves on PCA as it defines the latent space while maintaining unobserved clustered architecture. Denote the observed data matrix as . LDA, like PCA, is a dimensionality reduction technique since it finds a linear projection of the data features to best fit the data in a lower dimension producing an embedding . A major difference between these two linear projection techniques is that LDA is a supervised method, while PCA is unsupervised. Unsupervised PCA finds projections which maximize the total variance in the data, which may or may not separate classes in these projections (Figure 2.1). If the class assignments are known a priori, LDA can be used to find projections which best separate the classes. In Figure 2.1, the direction which maximizes variation is uninteresting since it does not separate the latent clustering. However, the LDA projection finds a distinct boundary between the 8 classes. Such as in this example, a direction with low variation, but potentially valuable, class defining information, is not guaranteed to be preserved in the first few principal components. When determining the number of PCs to use, it is common practice to evaluate the eigenvalues for the top PCs in an elbow plot to determine the best cutoff for which eigenvectors to include in the reduced embedding. Since the eigenvectors indicate the amount of variation attributed to its respective eigenvector, the top PCs which capture the majority of the variation in the data are commonly used, meaning the lowly variable but well defined clusters will not be preserved in the PCA embedding. Figure 2.1. PCA versus LDA projection for data where data is clustered. The PCA projection fails to separate the classes because they do not fall on the projection which maximizes variation for the data. The supervised LDA projection maximizes the ratio of between to within cluster variation. 9 PCA maximizes total variance of the embedded data by maximizing the objective function (2.1) where is the empirical covariance of the embedded data Y (2.2) (2.3) and is the th row of embedding corresponding to the low-dimensional representation of the th data observation ( ). LDA uses observed assignments to one of classes to find projections which separate classes by maximizing the objective function: (2.4) where is the between class empirical covariance and is the within class empirical covariance matrix: 10 (2.5) The within class scatter matrix is the covariance for the difference between each data point and the mean vector for its corresponding cluster across all clusters. The between class scatter matrix finds the covariance for the difference between each cluster's mean vector and the overall mean for all data points. While PCA maximizes global variation of the data, LDA maximizes the variation specific to the clustered architecture since class labels are known. Intuitively, by separating the data into clusters and maximizing the ratio of between to within cluster variation, the resulting projection will be one which maximizes the variance between clusters while minimizing the variance within each cluster. LDA assumes that all classes have equal covariance, but the extension to quadratic discriminant analysis (QDA) allows for unequal covariance matrices when calculating the within class scatter matrices, therefore allowing for more flexible, non-linear decision boundaries between classes. LDA is a supervised classification technique which requires class labels in order to maximize this ratio of between class variation to within class variation. The challenge of bridging the gap between PCA and LDA is defining these class labels which capture latent discrete structure of data. The core idea of iDA is to use an 11 unsupervised clustering method to define classes and use those in LDA to find a low dimensional embedding. This unsupervised method aims to find a latent clustered structure in the data while being unbiased to factors such as batch, population, or cell type. This agnostic approach avoids using recorded experimental class labels, but the iDA algorithm does allow for an initial set of class labels to be given if desired by the user. Iteration in this case would start with the embedding step instead of clustering step using the given labels, but that will just find an embedding that separates based on the provided labels. An approach that incorporates labels such as soft-assignments would be helpful, but we leave it for future work. 2.2 iDA Algorithm The iDA algorithm determines the number of clusters to use in each iteration by maximizing the clustering's modularity as used in the Walktrap clustering step. If, however, the user would like to find an embedding with a predetermined number of clusters, this can be achieved by setting the user-defined parameter as input to the iDA function. For batch correction, instead of only providing known batch indicators as input class labels to LDA, we assume variation may come from unmeasured sources, and therefore using a variable like batch would not fully capture groups arising from such variation. Moreover, in the case of scRNA-seq, cell types are unknown and since known marker genes are not defined for all cell types as well as subtypes, inferring cell types is difficult. 12 The iDA algorithm is as follows: 1) Extract informative genes to identify those which may be potentially affected by unmeasured variables, 2) Cluster the samples using the Louvain community detection method to get class labels for each sample, 3) Decompose the informative gene expression matrix via Linear Discriminant Analysis (LDA) or Quadratic Discriminant Analysis (QDA) using the class labels obtained from the Louvain clustering. 4) Update the class labels by clustering in the discriminant space. The new optimal class assignments can then be used for decomposition until the class assignments have stabilized in the embedding space. iDA results in an embedding space of dimension – 1, where is the number of clusters identified by iDA which maximizes the community structure modularity, and each dimension determines a decision boundary which best separates an individual class from the rest. To find reliable clusters which define the discrete latent structure, we use the Walktrap method for community detection in a nearest neighbor graph which computes the similarities between neighbors based on random walks in the graph (Pons and Latapy, 2005). To compute the shared nearest neighbors graph, first, 13 a k-nearest neighbors graph is built with the default number of neighbors . Then, the pairwise Jaccard Similarity is computed (the number of common neighbors each vertex has divided by the number of vertices that are neighbors of at least one of the two pairwise vertices). The shared nearest neighbors graph is then pruned such that pairwise vertices that have similarities less than 1/15 are set to zero, and converted to an undirected graph, which is then the input to the Walktrap clustering algorithm. The Walktrap clustering returns a hierarchical clustering. The point at which to cut the tree is determined by computing the modularity at each level of the tree and cutting it where the modularity is maximized. For our experiments below, we check for convergence as explained in Section Convergence Assessment. Since iDA remains agnostic to the study design, we don’t need to worry about recording errors for phenotype or batch data, or self- identification discrepancies. In our implementation we initialize the embedding by selecting features with highest marginal variance. 2.3 Methods for Performance Assessment To assess the performance of iDA, we used two data sets. The first is from the Geuvadis bulk RNA-sequencing Project, which has expression data from 462 individuals from European and Yoruba ancestry individuals and was sequenced in 7 different laboratories (Brown et al., 2018). This data set exhibits a latent structure from an overwhelming amount of variation attributed to batch effects and some 14 variation attributed to population structure. The second data set is the 10X Genomics peripheral blood mononuclear cell (PBMC) single cell RNA-sequencing (scRNA-seq) data set which has expression from roughly 3,000 unlabeled immune cells (Zheng et al., 2017). We chose these datasets because of the known structure embedded in each (genetic population structure and batch in the Geuvadis and distinct cell types in PBMC dataset). Note, however, these labels are not used as input to the iDA algorithm. 2.3.1 Convergence Assessment To assess the convergence of iDA, we need to inspect the stability of the cluster assignments resulting from the Walktrap community detection at each iteration. Since the cluster assignments dictate the scatter matrices calculated for the discriminant analysis, once the cluster assignments are stable, then the discriminators will not change. To evaluate when the cluster assignments have stopped changing, we calculate the Adjusted Rand Index as concordance for the cluster assignments between each iteration. The iterations are stopped when the Adjusted Rand Index is larger than 0.98 between the cluster assignment vectors. 2.3.2 Cumulative F-Statistic To compare the capacity of iDA and PCA to accurately find a subspace which describes the underlying clustering in the data, we adapt the F statistic as well as to be able to assess models across dimensions. First, we perform linear regressions 15 which model each dimension of PCA and iDA as a function of the observed variables that describe the known underlying group structure (for Geuvadis data set, this is the interaction between population and laboratory, for 10X Genomics PBMC scRNA-seq data, this will be cell type). To determine the appropriate number of PCs to use in this comparison for each dataset, we evaluated the elbow plots to determine the eigenvectors accounting for the majority of the variation in each dataset (Figure 2.2). Figure 2.2. Elbow plots to determine the optimal number of dimensions to use in the PCA embedding. To determine how many top PCs to use for each dataset, the eigenvalues are plotted for each PC dimension, and the cutoff is determined based on the "bend" in the plot. To ensure an appropriate amount of variation is accounted for when comparing to iDA, we chose to include the top 10 PCs in the Geuvadis and PBMC datasets. To ensure an appropriate amount of variation was captured in the PC embedding for comparison to iDA, we use the first 10 PCs for the Geuvadis and PBMC datasets. (2.6) 16 Then, to assess how well these known classes are separated across dimensions, we use an adaptation of the F-statistic. We define this cumulative F-statistic as: (2.7) where we compute the total sum of squares (TSS) and residual sum of squares (RSS) per regression model, and sum over each dimension to compute a cumulative F- statistic from 1 to . This statistic measures the variance of the group means / variance of the within group variances over dimensions. This indicates how well the dimensional reductions for PCA and iDA describes group separation for known groups. Because both the and change as dimensions are added, the cumulative F-statistic is not monotonic. 2.3.3 Cumulative R2 We also compute a cumulative for dimensions 1 through to measure the total variance captured in the sub-spaces from each reduction technique over dimensions. (2.8) 17 This cumulative over dimensions represents the proportion of variance over dimensions that is explained by the underlying grouping. 2.3.4 Clustering Validation To validate that the iDA method identifies better latent clustering in the original data than the PCA dimensionality reduction and is not creating spurious clustering, we will assess the compactness and separation of clusters in the original space using the Dunn Index (DI). The DI is defined as the ratio between the smallest distance between observations not in the same cluster to the largest intra-cluster distance, (2.9) where is the inter-cluster distance and is the intra-cluster distance. To ensure robustness, the Dunn Index is calculated using a bootstrapping approach over multiple sampled observations samples drawn with replacement and the Dunn Index is calculated over 1000 bootstrap samples. 2.4 Results 2.4.1 iDA preserves cluster structure better than PCA The cumulative F-statistic is a measure to evaluate how well the low dimensional embedding captures the separation of clusters in the data. The iDA algorithm was applied to the 10X genomics scRNA-seq PBMC data set and compared to the PCA 18 reduction of the same dimension. For comparison of the embedding from iDA and PCA, we used the cell identities assigned from the Seurat scRNA-seq pipeline as the cell types. Figure 2.3A provides evidence that the decision boundaries from iDA offer an embedding which better separates the putative cell types compared to PCA, as determined by the cumulative F-statistic. As expected, the first dimension of PCA, which captures the most variation in the data, separates one cluster very well (likely the most variable cluster), but as dimensions are added, the projections do not capture the cluster separation. In the iDA embedding, we see each dimension equally captures the separation of one cluster, so the cumulative F-statistic does not degrade as we add dimensions. iDA also finds LD's which capture a higher proportion of variance explained over all dimensions than the PCs do for evaluating variance between cell types (Figure 2.3B). Although PCA's objective function is to maximize variance, we see that it fails to find variation attributed to some cell types, likely ones with low variation. There is evidence of this in the acute drop off in both the cumulative F- statistic and values. 19 Figure 2.3. Cumulative F-statistic and Cumulative for the PBMC and Geuvadis datasets. A) As more dimensions are added for each of the reduction methods, the clustering from as determined by the Seurat putative cell type assignments for each of the samples is better separated by the iDA embedding than PCA. B) The iDA embedding captures more of the total variance attributed to the latent clustering than PCA. C) As more dimensions are added for each of the reduction methods, the clustering from the main effects and interaction between laboratory and population for each of the samples is better separated by the iDA embedding than PCA. D) The iDA embedding captures more of the total variance attributed to the latent clustering than PCA. 20 We see a similar trend for the statistics computed using the Geuvadis data set. We evaluate the performance of iDA and PCA using the groups from the main effects and interaction between the population and laboratory the samples were processed in. While these variables may not account for all latent clustering in this data, we know they contribute to a large portion of the clustering, and an embedding should capture these clusters (and potentially others). Here, the first dimension of PCA does a very poor job of accounting for the latent clustering attributed to population and laboratory; more evidence that maximizing the variation does not guarantee to best separate clusters (Figure 2.3C). iDA again consistently finds better separation for this data, as well as better overall variation attributed to these groups (Figure 2.3D). Many of the clusters produced by both iDA and PCA + Louvain (Seurat clusters) are nearly identical between each method (Table 2.1). The main difference is the boundary between Seurat Clusters 0 and 1 changes such that some Seurat Cluster 0 cells get grouped with cells in Seurat Cluster 1. Table 2.1 Confusion matrix between iDA and PCA+Louvain Clustering 21 Based on both the tSNE and UMAP of these projections, the boundary between these two clusters is not well defined (Figure 2.4). To determine which clustering is optimal, we compute the bootstrapped Dunn Index (DI) in the full gene expression space for class assignments resulting from iDA compared to the class assignments from other methods over 1,000 iterations. The PCA and iDA embeddings were compared using multiple clustering techniques including two community detection algorithms common in single cell clustering, Walktrap clustering (the default for the iDA algorithm), and Louvain Community Detection (Blondel et al., 2008), as well as K-means clustering, which has been shown to be closely linked to the unsupervised dimension reduction of PCA (Ding & He, 2004). 22 Figure 2.4. tSNE and UMAP visualizations for the PCA and iDA PBMC cell embeddings. Similarly to PCA, the iDA embedding can be used with both tSNE and UMAP to visualize cluster separation. The clusters found in the PBMC scRNA-seq dataset using the iDA embedding with each of the community detection algorithms -- Walktrap (green) and Louvain clustering (light blue) -- both have Dunn Indices which are significantly increased compared to the clusters found with the PCA embedding and Louvain clustering (i.e. the clusters found with the Seurat pipeline) (yellow) (Figure 2.5A). In the Geuvadis dataset, the Dunn Indices of the clusters found by all clustering methods with the iDA embedding are all significantly increased compared to the clusters found by both the PCA embedding with Louvain clustering as well as the recorded batch (laboratory) 23 and population of each sample (Figure 2.5B). This indicates that the class assignments resulting from iDA define clusters in the original space which are compact and better separated than class assignments determined by either the Seurat pipeline for scRNA-seq or measured covariates in Geuvadis. Figure 2.5. Dunn Index calculated for bootstrapped samples for each dataset. Dunn Index computed for bootstrapped samples from PA) PBMC cells and B) Geuvadis samples using either the PCA embedding or the iDA embedding and several different clustering methods. The sampling was performed by taking n = number of observations in each dataset, with replacement, and bootstrapped 1,000 times. A) The iDA algorithm with the Community detection algorithms (Walktrap and Louvain) both have significantly higher Dunn Indices than the PCA embedding with Louvain clustering. B) All clustering methods with the iDA algorithm generate significantly better clusters with respect to the Dunn Indices when compared to the PCA embedding and the known batch and population. 24 2.4.2 iDA directions are highly correlated with known PBMC cell type markers Canonical markers of particular cell types have been characterized in PBMCs and can be used to validate clustering methods (Zheng et al., 2017). These markers have been experimentally elucidated such that cells with high expression of these markers are likely of the corresponding cell type based on their known function. CD14 and LYZ are known markers for monocytes, MS4A1 is a marker for B-cells, GNLY and NKG7 are markers for Natural Killer (NK) cells, and PPBP is a marker for platelets. If the clustering algorithm captures clusters indicative of cell type, the resulting clusters should reflect this in the marker gene’s expression. For example, LD6 is the directional vector which separates cluster 1 (with cells in cluster 1 mapping to the lowest weights of the vector). We see that the cells in cluster 1 highly express the two markers for Monocytes, CD14 and LYZ (Figure 2.6A). Quantitatively, both markers for monocytes have much higher mean scaled expression than the cells in all other clusters (Table 2.2). Table 2.2. Comparison of mean expression of canonical PBMC cell type markers. 25 The LD3 vector separates the cluster of cells (cluster 6) which highly express the markers of NK cells (GNLY and NKG7) (Figure 2.6B). These markers are shown to have much higher average scaled expression than cells not in this cluster. Additionally, iDA is sensitive to separating the cluster of cells which are high in the expression of both marker genes (as opposed to only one), separating them from cells in cluster 2 which are high in only NKG7 expression but not GNLY, indicating that cluster 6 is a cluster of NK cells. LD5 separates the cluster which highly expresses the marker for B-cells, MS4A1 (Figure 2.6C) , also with an overall much higher mean scaled expression when compared to cells in all other clusters (1.971 versus -0.291). This indicates cluster 5 is likely a population of B-cells. Lastly, a marker for platelets, PPBP, is highly expressed in the cells in cluster 9 identified by the iDA algorithm. This cluster is separated by LD1 and shows a clear separation, with the platelet cells mapping to high LD weights (Figure 2.6D). 26 Figure 2.6. Putative cell type clusters found with iDA are separated both by known cell type markers and in the iDA embedding. A) The group of cells which highly express both LYZ and CD14, known markers for monocytes, are separated by the iDA dimension (LD6) (salmon colored points). B) The group of cells which highly express GNLY and NKG7, known markers for NK cells, are separated by the iDA dimension (LD3) (light blue colored points). C) The group of cells which highly express MS4A1, a known marker for B-cells, are separated by the iDA dimension (LD5) (teal colored points). D) PPBP is a known marker for platelets. The group of cells which highly express PPBP are separated by the iDA dimension (LD1) (pink colored points). E) FCER1A is a known marker for Dendritic cells. The group of cells which highly express FCER1A are separated by the iDA dimension (LD4) (purple colored points). 27 2.4.3 iDA directions are highly correlated with SNPs in eQTL The Geuvadis data set has known sub-population features attributed to ancestry and the location where the samples were sequenced. iDA identifies an LD which separates the YRI population from the EUR populations (LD3). Interestingly, this LD purely captures the separation in population and groups all samples sequenced in different laboratories together. Brown et al (2018) show that expression reflects population structure because alleles, which may have different frequencies among different populations, can affect the expression level of nearby genes. These allele-gene pairs form expression quantitative trait loci, or eQTL, if this relationship occurs. The Geuvadis data set has 116 SNP- gene pairs which are in eQTL only in YRI individuals and not in the EUR individuals. The average difference between the YRI minor allele frequency (MAF) and the EUR population’s MAF is a substantial 0.175. Since these alleles are in eQTL with a gene pair and the allele frequencies are different between populations, the effect on expression level will also differ between populations. These SNPs allele frequencies vary greatly between the YRI and EUR populations (Table 2.3). Table 2.3. Minor Allele Frequencies per population of top eQTL SNP/gene pairs. 28 Figure 2.7. iDA projection 3 separates populations. SNP-gene pairs in eQTL affect the expression dependent on the sample population since the SNP MAF is variable across populations. The iDA dimension (LD3) separates the YRI population from the European ones. A) RAB5C, which is in eQTL with SNP rs11757158 in the YRI population, has higher average expression than the European populations. B) SRF, which is in eQTL with SNP rs11413536 in the YRI population, has higher average expression than the European populations. C) PSKH1, which is in eQTL with SNP rs143415501 in the YRI population, has higher average expression than the European populations. D) PWAR6, which is in eQTL with SNP rs200846953 in the YRI population, has higher average expression than the European populations. The MAFs differ as much as 36 (as seen in the rs11757158-RAB5C pair), and in some cases, the minor allele is not present at all in the EUR population (rs143415501). Figure 2.7 A, B and C clearly increase the expression rate of their respective eQTL genes, while D shows a decrease in expression of PWAR6 in YRI as compared to the EUR samples, mimicking the difference of MAF between the YRI population and the EUR populations. 29 2.4.4 iDA directions are highly correlated with known technical noise Another point to consider is if technical noise is being further entrenched in the embeddings found by iDA. To ensure iDA is able to capture variation in datasets with unknown technical noise, we performed iDA on a control dataset designed with all variation attributed to technical noise (Zheng et al., 2017). This dataset consists of droplets which were loaded with the same ratio of 92 External RNA Controls Consortium (ERCC) synthetic RNAs into GEMs. Since there are no differences in cell size, RNA content or transcriptional activity in these GEMs, the only variation is from technical noise introduced by the differing number of UMIs per GEM. The embedding iDA produces on both the raw counts and the normalized counts confirms that the entrenchment of technical noise can arise from the normalization technique (Townes et al., 2019). For example, in the negative control ERCC dataset, the iDA embedding of the raw counts clearly separates GEMs by the technical variation introduced by the mean UMI count per GEM (Figure 2.8B), but the embedding of the log normalized counts does not produce this separation (Figure 2.8A). Therefore, the entrenchment of technical noise in the iDA embedding is a result of the normalization step, and not the iDA algorithm itself. 30 Figure 2.8. tSNE of iDA embedding for ERCC negative control dataset. tSNE of the iDA embedding for the ERCC negative control dataset both on the scaled counts (A) and raw counts (B). iDA captures the variation attributed to the technical noise when present in the data (B). If normalization techniques have entrenched the technical noise (Townes et al., 2019), iDA will not be able to separate the technical variation (A). To show further evidence of accurate iDA clustering and dimensionality reduction, we applied both iDA and PCA with Walktrap clustering to two single cell benchmarking datasets from the 10X genomics platform with known cell type labels (Tian et al., 2019). Known populations of three (10X-3cl) and five cell lines (10X- 31 5cl), respectively, were sequenced using the 10X genomics single cell RNA sequencing platform for these two datasets. The tSNE plot of the PCA reduction of log counts for each cell reveals discrete clustering within each cell line, but there is also a moderate amount of heterogeneity in each cell line (particularly in H1975 and A549, which each separate into distinct sub-clusters) (Figure 2.9A). When iDA is applied to each of these datasets without any guidance on how many cell lines are in each dataset, the resulting clustering is sensitive to the cell type sub-clustering. Because of this sensitivity to within cell type heterogeneity, when compared to the known cell type labels, the Adjusted Rand Index (ARI) for iDA clustering on the 3 cell line dataset is .519, and .798 for the 5 cell line dataset. Comparatively, when unsupervised clustering is applied to the PCA embedding, the ARI between these clusters and the known cell type labels is .457 for 10X-3cl and .615 for 10X-5cl. 32 Figure 2.9. Method comparison on pseudo-cell scRNA-seq dataset. A) tSNE of the Tian et al. (2019) 10X genomics single cell RNA sequencing benchmarking datasets. Distinct clustering is visualized for each of the known cell lines, but there is a substantial amount of heterogeneity within each cell line for both datasets. B) Cumulative F-statistics for the iDA supervised embedding, iDA unsupervised embedding, and PCA embedding. The iDA supervised embedding outperforms the other methods in capturing variance attributed to the known cell line labels. C) Pseudocell mixtures with different ratios of RNAs from 3 cell lines. Visually, PCA separates the homogeneous pseudo cells (red, green, blue), but the mixed populations form a continuous spectrum between the 3 homogeneous ones. The cumulative F-statistic for each dataset shows iDA unsupervised, iDA supervised, and PCA all perform comparably in these in-discrete datasets. 33 In the case where the number of cell lines in a sample is known, this parameter can be given to Walktrap clustering. In these two datasets, because there is such a great difference between the expression profiles of each cell line, the supervised clustering yields much greater results. When iDA is applied with the number of known clusters given, the resulting ARIs are .997, .986 for 10X-3cl and 10X-5cl respectively. PCA with supervised clustering also does well with ARIs of 1.00 for 10X-3cl and .987 for 10X-5cl (Table 2.4). Table 2.4. Adjusted rand index for clusters from iDA versus PCA embeddings. 2.4.5 iDA Converges When assessing the convergence of iDA, we evaluate the concordance of cluster assignments at each iteration with the Adjusted Rand Index between the two sets of class assignments. We set a threshold of concordance larger than as the stopping criterion for iDA. Each data set reaches the threshold within 3 iterations of iDA (Figure 2.10). Since the discriminants in iDA are determined based on the cluster assignments, if the assignments reach stability between iterations, the discriminants will be stable and unchanged between iterations as well. 34 Figure 2.10. iDA Convergence. Convergence of the iDA algorithm is measured by the concordance of class assignments of each iteration to the previous one. The convergence threshold for iDA is set to concordance = 0.98. Both the Geuvadis and PBMC data sets converge within 3 iterations. The speed of iDA is heavily dependent on the number of variable features used initially since each iteration decomposes the gene-by-gene matrix. Although PCA is faster, iDA runs within 6 minutes for datasets with less than 3,000 variable genes and 500 samples. iDA identifies 2,829 and 3,000 variable genes in the Geuvadis and PBMC datasets and runs in five and three minutes, respectively (Figure 2.11). In the breakdown of time allocation for the iDA algorithm, the SVD takes the most time for each of the data sets, since a decomposition is computed in each iteration. 35 Figure 2.11. iDA runtime. Run time of the iDA algorithm versus the PCA reduction for both the Geuvadis and PBMC data sets. PCA is much faster than iDA, as it only computes one decomposition. iDA computes a decomposition and clustering in each iteration until convergence. The Geuvadis data set which has 462 samples and iDA identifies 1,802 variable genes runs within 3 minutes. The PBMC data set has 2,638 cells and iDA identifies 3,000 variable genes and runs within 4.47 minutes. 2.5 Discussion Discovering the latent structure in high dimensional data is essential in many applications. The latent structure can be attributed to batch, which needs to be corrected for before downstream analysis to avoid spurious associations as well as reduced power in association testing. The latent structure can also define biological clusters which researchers are interested in evaluating such as in scRNA-seq, where identifying a set of features which define cell types is of main concern. In standard pre-processing using PCA, as well as methods which use PCA to identify sources of variation in batch effects cases, classes which are distinctly defined but have low variation will not be maximally separated. Although both PCA and iDA are not specifically batch correction methods, the resulting low dimensional embedding 36 which captures unwanted variation can in turn be used as covariates in downstream analysis to produce more accurate results. For example, the iDA embedding which captures the separation of batches in the scaled Geuvadis counts clearly separates the samples from individual laboratories in which they were processed and could be used as covariates in downstream analysis to account for this separation (Figure 2.7C). This approach, namely, adding embeddings to control for unwanted structure in downstream analysis, is preferred to another common approach of removing the first few embedding directions from the original input data to produce corrected datasets for further downstream analysis. Accordingly, we have focused our analyses in this paper on the preferred approach. One advantage of PCA embeddings is the interpretability of the variability contribution each dimension adds when included in the embedding. Often, the respective eigenvalues of the top PCs are plotted to determine the "elbow" which determines the ideal number of top PCs to include in the final embedding (Figure 2.2). This offers a natural ordering of PCs based on the amount of variation the PC accounts for in the data. Because iDA also finds embeddings with an SVD, the resulting LD's order corresponds to “easy-to-separate” clusters. For example, in the PBMC dataset, LD1 separates the platelet cells which are best separated from other cell types (Figure 2.5D), while the later LDs are not as well separated (Figure 2.5 A,C,E). 37 Several neighbor-preserving non-linear dimensionality reduction techniques exist for data visualization such as t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). These methods are particularly popular for scRNA-seq dimensionality reduction and visualization. tSNE constructs a probability distribution for all pairs of samples in the data such that samples close in Euclidean distance in the original data space to have a higher probability of being placed near each other in the resulting lower dimensional space than those which are far apart from each other in the original data space, therefore preserving the neighborhood structure in a two or three dimensional space (Laurens van der Maaten & Geoffrey Hinton, 2008). tSNE has a few shortcomings, particularly when applied to scRNA-seq. tSNE does not scale well which is problematic with the rapidly growing number of cells being sequenced for scRNA-seq. iDA is more scalable for large numbers of samples since the SVD time-consuming step in iDA is dependent on the number of variable features and not the number of samples. Since tSNE computes the k-nearest neighbor graph as an initial step, it is infeasible to compute this in the original high dimensional space, so a preprocessing step like PCA is used to reduce the number of dimensions before applying tSNE, which means it could fall victim to the downfalls of PCA as discussed before. The newer method, UMAP, is also a neighborhood-preserving method with several adaptations to tSNE, such as a cost function which attempts to preserve global structure (McInnes et al., 2018). 38 While t-SNE or UMAP may suffice for data visualization, the embedding only offers visual information on dissimilarity between the samples and loses the interpretability that the iDA embedding offers. The tSNE and UMAP embeddings do not have weights of the features which define the differences between samples, which renders the embeddings useless to explain features which drive cluster separation or define features that may be used as covariates in downstream analysis. From the SVD of yields the weight matrix of genes by linear discriminants. These gene weights allow for the interpretation of specific features which separate latent clusters. With these interpretable embeddings, we can visualize the separation of the clusters in each dimension of the embedding and validate these clusters with known properties of each group. For instance, to validate the putative cell types found in the scRNA- seq data set, we compare the expression of known markers to LDs which separate the identified cell type clusters. The cell type clusters identified by the iDA algorithm show corresponding levels of expression for known markers of these cell types. The iDA embedding and clustering is also sensitive to lowly-abundant cell types. In PBMCs, lymphocytes (T cells, B cells, NK cells) are the majority of the cell population, monocytes are the next highest abundant, and dendritic cells and platelets as the rarest (Kleiveland, 2015). In the iDA embedding of the PBMC dataset, the least abundant platelet cell population is the cell cluster which is best separated from the others (in LD1), and the dendritic cell population is captured in LD4 (Figure 2.5). Therefore, iDA will identify rare populations that are sufficiently separated. 39 Expression is known to reflect population structure because of the eQTL (Brown et al., 2018) so it is essential to identify this source of variation as a covariate if this is not the desired response variable. In the Geuvadis data set, of the 116 SNPs which are in eQTL in only the YRI individuals, the average difference between the YRI MAF and the EUR population’s MAF is a substantial 0.175. This means that on average, SNPs in eQTL in the YRI population but not in the EUR populations have a frequency difference in the population of 17.5 . Since these SNPs are in eQTL in only the YRI population, this means they are affecting the expression of the gene which is in eQTL with them much differently than other populations. The difference in population structure affecting MAFs means that the expression of genes in eQTL will be affected by the population structure as well, as seen in Figure 2.7, causing a latent clustering of expression. The genes in eQTL with population-dependent alleles are recapitulated in the separation of populations by LD as well as gene expression. When comparing the embeddings that iDA produces compared to PCA, the cumulative F-statistic shows the iDA supervised embedding outperforms the other methods in both datasets with respect to the known cell types. The iDA unsupervised embedding outperforms its PCA counterpart in 10X-5cl and does just as well as the PCA embedding in 10X-3cl (Figure 2.9B). In instances where the nature of the latent structure is from discrete sources of variations such as batch effect, population stratification, and discrete cell types, iDA is able to capture these clusters well in the resulting embedding and clustering. iDA 40 leverages LDA in its algorithm to find the embedding and therefore is not as effective at finding an accurate embedding in scenarios where the latent structure is of a continuous fashion, such as in developmental biology. In such datasets where cell types are differentiating and form a continuous trajectory, the iDA embedding may find natural or unnatural breaks in trajectories and the resulting embedding will produce a more discrete structure than actually exists. Because of this, the use of iDA is most appropriate in datasets where the latent structure is thought to be of relatively discrete clusters. The "pseudo cell" mixture datasets from the CellBench experiments, which includes 9 cell combinations of the same 3 cell lines yielding 34 unique groups with a continuous structure, show the case where cell lines are not discretely separated from each other (Figure 2.9C). The cumulative F-statistic shows that even in the case of continuous, trajectory-like structure between mixes of different cell type compositions, the iDA unsupervised (no number of clusters given as input) and iDA supervised (the estimated number of latent clusters is given as input) either outperforms or matches the performance that a PCA embedding finds with respect to attributing within to between cluster separation between cell type groupings. In standard pre-processing using PCA, classes which are distinctly defined but have low variation will not be separated on the axes of maximum variation. We have shown that iDA is sensitive to classes with unequal within-class variation and the iDA embedding better separates clusters and identifies the class assignments which 41 determine a more optimal clustering in the original data space, as evidenced by the Dunn Index. The iDA embedding can be evaluated to determine which features are contributing the most to the class separation and can be used as covariates in downstream analysis to correct for this latent clustering. 2.6 Software Availability R (http://www.r-project.org) code for the iDA package is available at https://github.com/reesyxan/iDA. 42 Chapter 3: Introduction to investigating -omics data types in neurons 3.1 Central dogma of biology within neurons Neurons are a unique classification of cell types in many ways but one of the most notable differences between themselves and homogeneous, symmetric cells is their unique structure. Like other cell types, neurons have a centralized cell body where the nucleus is housed containing each cell's DNA. In addition, neurons possess long projections, dendrites and axons, that enable them to connect to and communicate with neighboring cells, as well as cells that are extremely far away. With this unique structure comes unique regulatory mechanisms to molecularly support the growth, development, and homeostatic maturity of neurons and their sub- compartments (Fernandopulle et al., 2021; Holt & Schuman, 2013). Neurons transport molecular packages and organelles bidirectionally both from nucleus to axon (anterograde) and axon to nucleus (retrograde) using a series of polarized microtubule tracks with mechanisms known as fast axonal transport and slow axonal transport (Brady et al., 2014; Dalla Costa et al., 2021; Ganguly et al., 2021). The motor proteins kinesin and dynein work together to transport materials in the plus- end-direction and minus-end-direction, respectively, and offer an energetically efficient means for axonal transport and presynaptic replenishment (Guedes-Dias & Holzbaur, 2019). 43 Fast axonal transport moves packages at a rate of up to 2 micrometers per second. Considering axons can be millimeters to meters long, it can take hours up to days to transport vital proteins, organelles, and other molecules needed to support neuronal functions (Maday et al., 2014). Recent studies have shown that in addition to organelle and protein axonal localization, some neurons localize translationally silenced mRNAs to axonal subcompartments where they can then be translated locally in order to meet the axonal local proteomic requirements during development and maturity (Dougherty, 2017; Glock et al., 2021; Heiman et al., 2008; Ostroff et al., 2019; Sanz et al., 2009; Shigeoka et al., 2016; Tushev et al., 2018). This process of local protein translation greatly reduces the time-sensitive anterograde burden and enables neurons to quickly translate the needed proteins locally to where they will be functionally incorporated. 3.2 Regulation of local protein translation Axonogenesis and synaptogenesis have been shown to be regulated by differential mRNA shuttling to axonal subcompartments where local protein translation takes place (Preußner & Heyd, 2016; Shigeoka et al., 2016). The local protein translation process can be modulated by extrinsic and intrinsic cues (Jung et al., 2012). An example of the coupling of external growth cues with a shift in local protein translation is beta-actin translation in response to growth signaling cue Netrin-1 where the attractive Netrin-1 causes 3’ untranslated region (3’ UTR)- mediated selective localization of beta-actin within axonal growth cones (Leung et al., 2006; Welshhans & Bassell, 2011; Yao et al., 2006). In parallel, past research has shown 44 that local protein translation is an activity-dependent mechanism for rapid protein production to direct axon growth during development (Choe & Cho, 2020). In particular, increased neuronal activity leads to an increase in local translation mediated by microRNAs (miRNAs) and RNA binding proteins (RBPs) in the neuropil (Jung et al., 2014; Narayanan et al., 2007; Sutton et al., 2004). Cis-acting miRNAs and RBPs translationally silence localized mRNAs by binding to UTR regions, creating a physical block for translational machinery binding. In the case of beta-actin, past studies have shown that neuronal activity “unmasks” beta-actin mRNA molecules by causing cis-acting UTR-bound elements to disassociate from the mRNA, therefore making them available to ribosomal machinery (Buxbaum et al., 2014). This UTR unmasking was positively correlated with the rate of beta-actin translation. Together, this suggests that shifts in local protein translation caused by a disruption of extrinsic or intrinsic could lead to developmental defects. 3.3 ipRGCs to study activity-dependent local protein translation Researchers have turned to developing new experimental methods to study the dynamic properties of local protein translation in both developing and mature neural circuits. To study the phenomenon of the dynamic shifts in the translatome populations across development as well as in perturbed activity states, we leveraged the unique properties of intrinsically photosensitive retinal ganglion cells (ipRGCs). ipRGCs are a class of retinal ganglion cells (RGCs), making up just 1%-3% of the total RGC population (Do & Yau, 2010), that express the photopigment melanopsin (Opn4). Unlike conventional RGCs, Opn4 expression in ipRGCs enables them to 45 respond to light cues independent of rods and cones and control functions such as circadian photoentrainment, pupillary light reflex, and metabolism (Beier et al., 2020; D. Berson, 2003; Hattar, 2002; Preußner & Heyd, 2016; Renna et al., 2011; Berry et al., 2023). ipRGCs are responsible for relaying environmental photic information from the retina to several brain targets and they possess three main properties that provide a model system for studying activity-dependent developmental local translation: 1. ipRGCs have physical separation between cell subcompartments with cell bodies in the retina and long range axonal projections in the brain, 2. ipRGCs have cell type-specific functions and innervation patterns allowing the evaluation of cell-type specific effects, and 3. ipRGCs are genetically tractable and the intrinsic photosensitivity can be exploited to perturb neuronal activity levels. The aggregation of these properties provides an ideal system to study dynamic, activity-dependent translatomic shifts in neuronal cell types. 3.3.1 ipRGCs subcompartments are physically separated Studying local subcompartment translational regulation is essential for understanding subcompartment-specific changes. In order to study the subcompartment-specific local translation, cell subcompartments (somata versus axons) must be able to be physically separated from one another. ipRGC somata are born in the ganglion cell layer during retina cell type neurogenesis at embryonic days 10-12 (Lucas & Schmidt, 2019; McNeill et al., 2011). They then begin to extend projections during neurite outgrowth to several brain targets including the suprachiasmatic nucleus (SCN) and the dorsal lateral geniculate nucleus (dLGN) where they form long-range 46 connections from eye-to-brain (Do, 2019). At maturity, these axonal tracts run ~10 mm axon from retina to the SCN and ~15 mm axon from retina to the dLGN (K.-Y. Kim et al., 2019). Because these neurons innervate several brain regions that can be microdissected separately, the translatomes in the somata within the retinas can be harvested separately from axonal compartments in the SCN and dLGN for translatomic comparison. 3.3.2 ipRGCs are defined by functional, developmental, and regional types ipRGC types have been characterized as modulating a wide array of biological processes, such as regulation of sleep/awake cycles, metabolism, hunger patterns and hormone production (Lazzerini Ospri et al., 2017). ipRGCs form parallel pathways to several brain targets to photoentrain these functional roles. ipRGCs projecting to the brain’s master circadian pacemaker, the suprachiasmatic nucleus (SCN) play a pivotal role in coordinating the circadian rhythm to photic input, while ipRGCs projecting to the dorsal lateral geniculate nucleus (dLGN) play a role in in visual perception. By postnatal day 0 (P0), ipRGC axons have significantly innervated multiple visual areas of the brain, including the dLGN (McNeill et al., 2011). However, this is not the case for the SCN. Although ipRGC axons pass by the SCN, they do not begin innervating the structure for several days (and do not finish doing so for several days more). The ipRGCs that innervate the SCN are known to be M1 ipRGCs, and even more specifically, are a molecularly defined subset of M1 ipRGCs because they lack the expression of Brn3b (Chen et al., 2011; Do, 2019). M1 ipRGCs innervate the 47 SCN are developmentally delayed when compared to non-M1 ipRGCs innervating other areas of the visual brain including the dLGN. Based on these findings, ipRGC types are developmentally and regionally unique; SCN-projecting M1 ipRGCs have a different developmental timeline than dLGN-projecting non-M1 ipRGCs. 3.3.3 ipRGCs are genetically tractable ipRGCs express the photosensitive protein melanopsin (Opn4). ipRGC axon development and early synaptogenesis in central targets occurs during a period where the photosensitivity of the retina is determined solely by the presence of the photosensitive protein melanopsin (D. M. Berson, 2002; Fu et al., 2005; Hattar, 2002; Provencio et al., 1998; Renna et al., 2011). Disruption of melanopsin-driven phototransduction has been shown to be critical in maintaining developmental features including retinal wave properties, retinal vasculature development, and retinofugal projection development (discussed further in chapter 5) (Renna et al., 2011; Tiriac et al., 2018). In addition, the melanopsin phototransduction pathway also drives behavioral and developmental defects such as neonatal photoaversion and synaptogenesis (Caval-Holme et al., 2022; Hu et al., 2022). Particularly, melanopsin- driven phototransduction in ipRGCs directly affects synaptogenesis in various brain targets by way of oxytocin release in the supraoptic nucleus and the paraventricular nucleus (PVN) (Hu et al., 2022). The direct link between melanopsin-driven activity in ipRGCs during development and intracellular translational regulation driving such developmental features has yet to be explored. Because of this evidence of the 48 regulation of local translation by neuronal activity, melanopsin expression dictating action potentials evoked by photic stimuli may play an important role in regulating activity-dependent plasticity of local protein production in the development of ipRGC circuits. Since Opn4 is both a unique cell type marker and controls a whole host of phototransduction-dependent developmental features, it is an ideal candidate to study ipRGC-specific transcriptomics and translatomics. 3.4 Neuronal Translatomes: Axon-TRAP-RiboTag Evaluating different regulatory processes throughout the central dogma pipeline from DNA to protein is critical in understanding developmental and homeostatic neuronal processes. Tight regulation of transcription and translation allows for not only cell type functional specificity, but also cell compartment functional specificity, especially in systems where transcription and translation occur in distant compartments of the cell, such as neural tissue. Specifically, the translational regulation within cell compartments allows for complexity and precision of individual cell compartments. Studying the neuronal compartment translatomes has become vital to investigate axonal regulatory properties by quantifying the local protein translation in axons that enables rapid responses to cell-cell signaling cues during development as discussed in previous sections. Axon-TRAP-RiboTag, is a new experimental method developed by the Holt Lab to quantify the translatome in axons (Shigeoka et al., 2018). RiboTag is a genetic approach which allows for immunoprecipitation of ribosome bound mRNA in the presence of knocked-in Cre driver by labeling ribosomes with an HA epitope. RiboTag consists of a knocked in floxed wildtype 49 Rpl22 exon4 followed by an Rpl22 exon 4 which has an inserted HA epitope. When crossed with a mouse line containing a Cre recombinase driver, cells which express the Cre recombinase will also express the Rpl22HA mutant (Figure 3.1). The HA positive ribosomes can then be isolated by way of immunoprecipitation using an antibody for HA. Figure 3.1. RiboTag genetic approach with Opn4 Cre driver. RiboTag consists of a knocked-in floxed wildtype Rpl22 exon4 followed by an Rpl22 exon 4 which has an inserted HA epitope. When crossed with a mouse line containing a Cre recombinase driver (such as Opn4 shown in this figure), cells which express the Cre recombinase will also express the Rpl22HA mutant (green cells in bottom row). Cre negative cells will not have HA labeled ribosomes (orange cells in bottom row). In the case of ipRGCs, melanopsin mutant Opn4CRE transgenic mice (Ecker et al., 2010) were crossed with the transgenic mouse line Rpl22HA (Sanz et al., 2019) which has an HA epitope tag embedded in the coding region of the large ribosomal subunit 50 protein (illustrated in Figure 3.1). This cross produced a colony of mice that generates HA epitope-tagged ribosomes only in the presence of Cre recombinase. In the case of Opn4CRE x Rpl22HA, the Cre recombinase is expressed in cells that express Opn4, labeling ribosomes in ipRGCs. Ribosome-bound mRNAs in tissue collected from these mice can then be isolated using an anti-HA antibody, effectively isolating the ipRGC translatome, and mRNAs then sequenced using low-input RNA-seq protocols. 3.5 Contributions New experimental protocols are pushing the bounds of attainable information explaining how cells develop, function, and maintain homeostasis. Experiments using complex, cutting-edge sample collection and isolation protocols like axon-TRAP- Ribotag require careful computational exploration to validate the data. Particularly when experiments require several experimental batches, understanding sources of variation (unwanted and wanted biological and technical contributions) is extremely important to ensure validity of downstream results. A challenge in axon-TRAP- Ribotag experiments is consistency of the isolation of the translating mRNA population from the bulk mRNA population. There is a need for metrics to evaluate the efficiency and consistency of the protocol across experimental and technical variables. Only after potential batch effect contributors are identified and eradicated by sample filtering or batch correction methods can downstream analysis like differential expression analysis or gene set enrichment analysis be done. 51 In chapter 4, we provide a case study of special considerations for a complex, multi- batch high throughput experiment. We investigate the multi-faceted heterogenic contributions of a study using the axon-TRAP-RiboTag translatomic isolation protocol in a neuronal cell type. Our results highlight that batch-correction methods may reduce signal due to true biological heterogeneity in addition to technical noise. Further, we show axon-TRAP-Ribotag experiments are subject to varying degrees of isolation efficiency with less-efficient pulldowns having a closer resemblance to bulk RNA-seq. We offer metrics to help identify biologically signal-driven batch- differences specific to translatomic experiments. In chapter 5, we employ our understanding of variational contributions from chapter 4 in the intrinsically photosensitive retinal ganglion cell (ipRGC) -omics case study to explore the biological transcriptomic and translatomic coordination. Our analysis revealed ipRGCs participate in subcompartment-specific local protein translation. Genetic perturbations of the Opn4 photopigment-driven neuronal activity led to global tissue transcriptomic shifts in both the retina and brain targets, but the ipRGC axonal-specific translatome was unaltered. 52 Chapter 4: Validating the efficacy of Axon-TRAP RNA- Sequencing Results in Multi-Batch Studies 4.1 Background 4.1.1 Multi-batch processing in high-throughput studies Since their introduction, bulk RNA-sequencing studies have helped experimentalists better understand steady-state RNA levels and have become increasingly more accessible. In order to expand the range of biological information that can be obtained beyond RNA steady-state levels, experimental protocols have grown more complex enabling quantification of spatiotemporally precise expression regulation. In cases where spatial regulation can be studied in cell compartments, such as in long-range neuronal connections, to relate cell type soma and axonal compartment’s regulation, innovative experimental methods are being developed. Scientists in the field of neurobiology have turned to methods such as translating ribosome affinity purification (TRAP) and axon-TRAP-RiboTag to quantify the translatome, a post- transcriptional state, to uncover complex biological phenomena (Reynoso et al., 2015; Shigeoka et al., 2018). These studies frequently involve a variety of experimental conditions, including various tissue locations, genotypic profiles, and developmental ages, requiring multiple biological replicates to be collected. Careful consideration of the experimental design needs to be taken since a large number of samples are needed to guarantee sufficient statistical power for subsequent analyses (Auer & Doerge, 2010). Also, tissue collection techniques frequently call for pooling of tissue samples due to the limited amount of starting tissue available, particularly in the axonal 53 compartments of the cells, further increasing the number of samples needed for a study. These studies are frequently processed in multiple batches due to limitations on the number of samples that can be collected, processed, and sequenced with staffing and time limitations. To eliminate erroneous results in subsequent analyses, it is necessary to deconvolve potential batch effect (variation due to non-biological factors in an experiment) driving variational contributions in complex experiments with numerous experimental conditions as well as when processing or experimental batches (Sheerin et al., 2019). Par