ABSTRACT 

 
Title of Dissertation: ANALYTICAL APPROACHES FOR 

COMPLEX MULTI-BATCH -OMICS 

DATASETS AND THEIR APPLICATION TO 

NEURONAL DEVELOPMENT   

  
 Theresa Ann Alexander, Doctor of Philosophy, 

2023 

 
Dissertation directed by: Assistant Professor Colenso M. Speer, 

Department of Biology 

Professor Najib M. El-Sayed, Department of Cell 

Biology and Molecular Genetics  

 
High-throughput sequencing methods are extremely powerful tools to quantify gene expression 

in bulk tissue and individual cells. Experimental designs often aim to quantify expression shifts 

to characterize developmental trajectories, disease states, or cellular drug responses. 

Experimental and genetic methods are also rapidly evolving to capture specific aspects of gene 

expression such as in targeting individual cell types, regulatory stages, and spatially resolved cell 

subcompartments. These studies frequently involve a variety of experimental conditions that 

require many samples to guarantee sufficient statistical power for subsequent analyses. These 

studies are frequently processed in multiple batches due to limitations on the number of samples 


that can be collected, processed, and sequenced at once. To eliminate erroneous results in 

subsequent analyses, it is necessary to deconvolve non-biological variation (batch effect) from 

biological signal. Here, we explored variational contributions in multi-batch high throughput 

sequencing experiments by developing new methods, evaluating heterogeneity-contributors in an 

axon-TRAP-RiboTag protocol case-study, and highlighting biological results from this protocol.  

 
First, we describe iDA, a novel dimensionality reduction method for high-throughput sequencing 

data. High-dimensional data in complex, multi-batch experiments often result in discrete 

clustering of samples or cells. Existing unsupervised linear dimensionality reduction methods 

like PCA often do not resolve discreteness simply with projections of maximum variance. We 

show that iDA can produce better projections for separating discrete clustering that correlates 

with known experimental biological and batch factors. 

 
Second, we provide a case study of special considerations for a complex, multi-batch high 

throughput experiment. We investigated the multi-faceted heterogenic contributions of a study 

using the axon-TRAP-RiboTag translatomic isolation protocol in a neuronal cell type. We show 

that popular batch-correction methods may reduce signal due to true biological heterogeneity in 

addition to technical noise. We offer metrics to help identify biological signal-driven batch-

differences. 

 
Lastly, we employ our understanding of variational contributions in the intrinsically 

photosensitive retinal ganglion cell (ipRGC) -omics case study to explore the biological 

transcriptomic and translatomic coordination. Our analysis revealed ipRGCs participate in 


subcompartment-specific local protein translation. Genetic perturbations of photopigment-driven 

neuronal activity led to global tissue transcriptomic shifts in both the retina and brain targets, but 

the ipRGC axonal-specific translatome was unaltered. 

 
ANALYTICAL APPROACHES FOR COMPLEX MULTI-BATCH -OMICS DATASETS 

AND THEIR APPLICATION TO NEURONAL DEVELOPMENT  

 
by 

 
Theresa Ann Alexander 

 
Dissertation submitted to the Faculty of the Graduate School of the  

University of Maryland, College Park, in partial fulfillment 

of the requirements for the degree of 

Doctor of Philosophy 

2023 

 
Advisory Committee: 

Dr. Colenso M. Speer, Co-Chair 

Dr. Najib M El-Sayed, Co-Chair 

Dr. Joshua Singer 

Dr. Michael P. Cummings 

Dr. Robert Patro, Dean’s Representative 


© Copyright by 

Theresa Ann Alexander 

2023 

 
ii 

 
Preface 

The work in this dissertation was conducted over the span of two research labs. The 

first part of my dissertation work was spent working with Dr. Héctor Corrada Bravo. 

Chapter 1 introduces the work conducted in the Corrada Bravo group. The algorithm 

adapted as a result of the work with Dr. Corrada Bravo is Chapter 2 of this 

dissertation which are peer-reviewed published works. After Dr. Corrada Bravo left 

the University of Maryland, I transitioned to co-mentorship under the advisors Dr. 

Colenso M. Speer and Dr. Najib M. El-Sayed. The second half of my dissertation 

work is presented in Chapters 3, 4, and 5. At the time of this writing, Chapter 4 is 

currently under preparation for submission and Chapter 5 is a portion of a paper 

which will be submitted within the calendar year pending integration with data which 

are currently being collected. For these 2 chapters, all the experimental data collection 

was performed by Rashmi Gupta and Shyrice M. Mitchell and all the data analysis 

was performed by Theresa Alexander. I would like to thank all the interdisciplinary 

co-author contributions to make this work impactful and overarching. 

 
Chapter 2: 

● Theresa A Alexander, Rafael A Irizarry, Héctor Corrada Bravo, 

Capturing discrete latent structures: choose LDs over PCs, 

Biostatistics, Volume 24, Issue 1, January 2023, Pages 1–16, 

https://doi.org/10.1093/biostatistics/kxab030. My individual 

contributions include: 1) literature review, 2) algorithm development, 


iii 

 
3) experimental design, 4) algorithm assessment, 5) writing the 

manuscript, and 6) manuscript revisions.  

Chapter 4:  

● Theresa A Alexander, Rashmi Gupta, Ashton T. Belew, Shyrice M. 

Mitchell, Najib M. El-Sayed, Colenso M. Speer. “Validating the 

efficacy of Axon-TRAP RNA-Sequencing Results in Multi-Batch 

Studies”. This work will be submitted eNeuro in 2022. My individual 

contributions include: 1) literature review, 2) data analysis, 3) 

experimental design for “forensic bioinformatic” analysis, and 4) 

writing the manuscript.  

 
Chapter 5: 

● Theresa A Alexander*, Rashmi Gupta*, Ashton T. Belew, Shyrice M. 

Mitchell, Najib M. El-Sayed, Colenso M. Speer. “The effects of 

melanopsin on ipRGC cell autonomous and cell non-autonomous 

programming”. This work will be submitted to a peer-reviewed journal 

in 2022 in conjunction with on-going work. My individual 

contributions include: 1) literature review, 2) data pre-processing, 3) 

data analysis, and 4) writing this portion of the manuscript.  

 
Other publications in which I have completed or contributed to during my PhD work 

include: 


iv 

 
● Alexander, T.A., Machiela, M.J. LDpop: an interactive online tool to 

calculate and visualize geographic LD patterns. BMC Bioinformatics 

21, 14 (2020). https://doi.org/10.1186/s12859-020-3340-1. 

● Gomez MA, Belew AT, Navas A, Rosales-Chilama M, Murillo J, 

Dillon LAL, Alexander TA, Martinez- Valencia A and El-Sayed NM 

(2021). Early Leukocyte Responses in Ex-Vivo Models of Healing and 

Non- Healing Human Leishmania (Viannia) panamensis Infections. 

Front. Cell. Infect. Microbiol. 11:687607. doi: 10.3389/ 

fcimb.2021.687607.  

● Sam B. Choi*, Theresa A. Alexander*, Tarlan Vatan*, Chenghang 

Zhang, Shyrice M. Mitchell, Colenso M. Speer, Peter Nemes. 

Ultrasensitive Capillary Electrophoresis Trapped Ion Mobility Time-

of-Flight Mass Spectrometry Reveals a Proteome Transition During 

Development of the Brain’s Circadian Pacemaker. Submitted to Mol. 

Cell. Prot. December, 2022. 

● Li CY, Bowers JM, Alexander TA, Lawrence K, Coyne JM, Li Y, 

Juntti SA. Olfactory receptor for female reproductive pheromonal 

signaling in an African cichlid. In Preparation. 

● Gómez,M.A.*, Belew, A.T. , Vargas, D.A., Rebellon, D., Rosales-

Chilama, M., Alexander, T.A., Parra, L.F.G, Saravia, N, El-Sayed, 

N.M.* (2022). Innate biosignatures of treatment outcome in patients 

with cutaneous leishmaniasis. In Preparation. 


v 

 
● Fernandez, O., Gómez,M.A.*, Belew, A.T., Rosales-Chilama, M., 

Alexander, T.A., El-Sayed, N.M., Saravia, N. (2021). Genotypic and 

transcriptomic profiling of virulent and avirulent strains of L. (V.) 

panamensis. In Preparation. 

● Gómez,M.A.*, Belew, A.T., Fernandez, O., Rosales-Chilama, M., 

Alexander, T.A., Saravia, N., El-Sayed, N.M. * (2022). Human 

macrophage response and intrinsic resistance of L. (V.) panamensis to 

pentavalent antimony. In. Preparation. 

 
vi 

 
Dedication 

For my family and friends, two legged and four, who have walked this journey with 

me.  


vii 

 
Acknowledgements 

I am grateful for the many mentors, advisors, and friends who have helped and 

supported me through my doctoral journey. 

 
Firstly, I would like to thank my family for all their support over the past 5 years. 

Louis, Mom, Dad, Colleen, and Ricky: thanks for celebrating my wins and 

cushioning my losses. I am so fortunate to have gone through this process with such a 

supportive group behind me.  

 
I will always be grateful that my first introduction to my doctoral journey was with 

Dr. Héctor Corrada Bravo's group. During my initial time at UMD, Héctor gave me 

the independence and the confidence I needed to find myself as a student and 

researcher. Along with Héctor, I feel indebted to the Corrada Bravo lab members: 

Jay, Dom, Mohamed, and Aya, for setting the tone for a work environment I will now 

forever chase after.  

 
Dr. Najib El-Sayed and Dr. Colenso Speer took me in as an orphaned graduate 

student and made the transition feel seamless even in the midst of a global pandemic. 

I am so thankful for the mentorship and support each of them has given me over the 

past 3 years. The skills I have acquired for interdisciplinary research and 

communication would not be if not for Najib and Colenso.  

 
viii 

 
I am also grateful for members of my thesis committee who each offered guidance 

towards my growth as a researcher as well as navigating my extremely nonlinear 

doctoral journey (in alphabetical order):  Dr. Michael Cummings, Dr. Rob Patro, and 

Dr. Josh Singer. Each of you have challenged and supported me while making me 

feel like a valuable member of the scientific community.  

 
I also am greatly appreciative for all the mentorship, guidance, support, and 

friendship from Dr. Ashton Trey Belew. I learned so much from getting to work with 

Trey but most importantly was the importance of fun in research.  

 
I would like to thank the members of the Speer and Singer labs for their support, help, 

and friendship throughout my work. I would specifically like to thank (in alphabetical 

order): Jackie Minehart, Morgan Musgrove, Andrea Orpia, Kat Pizano, Tarlan Vatan.  

 
I would also like to send my love and thanks to members of the BISI and CBCB 

community, specifically Alexis Boleda, Jason Fan, and Ashley Della Fera. Without 

the many hours of talks for research and otherwise, I would not be where I am today.  


ix 

 
Table of Contents 

 
Preface........................................................................................................................... ii 

Dedication .................................................................................................................... vi 

Acknowledgements ..................................................................................................... vii 

Table of Contents ......................................................................................................... ix 

List of Tables .............................................................................................................. xii 

List of Figures ............................................................................................................ xiii 

List of Abbreviations .................................................................................................. xv 

Chapter 1: Introduction to investigating sources of variational contributions in RNA-

seq data.......................................................................................................................... 1 

1.1 Variability in high dimensional data ................................................................... 1 

1.1.1 Genome/Transcriptome-wide association studies........................................ 2 

1.1.2 Single cell RNA-sequencing variability ...................................................... 3 

1.2 Linear dimensionality reduction methods ........................................................... 3 

1.2.1 Unsupervised: Principal Component Analysis (PCA) ................................. 3 

1.2.1 Supervised: Linear Discriminant Analysis (LDA) ...................................... 4 

1.3 Non-Linear dimensionality reduction methods .................................................. 5 

1.4 Contributions....................................................................................................... 6 

Chapter 2: A Novel Dimensionality Reduction Method for high-throughput RNA-

sequencing data ............................................................................................................. 7 

2.1 Introduction ......................................................................................................... 7 

2.2 iDA Algorithm .................................................................................................. 11 

2.3 Methods for Performance Assessment.............................................................. 13 

2.3.1 Convergence Assessment........................................................................... 14 

2.3.2 Cumulative F-Statistic ............................................................................... 14 

2.3.3 Cumulative R2 ............................................................................................ 16 

2.3.4 Clustering Validation ................................................................................. 17 

2.4 Results ............................................................................................................... 17 

2.4.1 iDA preserves cluster structure better than PCA ....................................... 17 

2.4.2 iDA directions are highly correlated with known PBMC cell type markers

............................................................................................................................. 24 

2.4.3 iDA directions are highly correlated with SNPs in eQTL ......................... 27 

2.4.4 iDA directions are highly correlated with known technical noise ............. 29 

2.4.5 iDA Converges........................................................................................... 33 

2.5 Discussion ......................................................................................................... 35 

2.6 Software Availability ........................................................................................ 41 


x 

 
Chapter 3: Introduction to investigating -omics data types in neurons ...................... 42 

3.1 Central dogma of biology within neurons ........................................................ 42 

3.2 Regulation of local protein translation .............................................................. 43 

3.3 ipRGCs to study activity-dependent local protein translation .......................... 44 

3.3.1 ipRGCs subcompartments are physically separated .................................. 45 

3.3.2 ipRGCs are defined by functional, developmental, and regional types ..... 46 

3.3.3 ipRGCs are genetically tractable ............................................................... 47 

3.4 Neuronal Translatomes: Axon-TRAP-RiboTag ............................................... 48 

3.5 Contributions..................................................................................................... 50 

Chapter 4: Validating the efficacy of Axon-TRAP RNA-Sequencing Results in Multi-

Batch Studies .............................................................................................................. 52 

4.1 Background ....................................................................................................... 52 

4.1.1 Multi-batch processing in high-throughput studies ................................... 52 

4.1.2 Assumptions made by batch-correction methods may not often be met ... 53 

4.1.3 Case Study of axon-TRAP RNA-seq data from long range neuronal 

connections from eye-to-brain ............................................................................ 54 

4.2 Results ............................................................................................................... 59 

4.2.1 Experimental batch contributes to unwanted batch effect ......................... 59 

4.2.2 Experimental batch sequencing depth does not correlate with unique genes 

observed .............................................................................................................. 62 

4.2.3 Batch effect can be driven by true biological signal .................................. 65 

4.2.4 Ribosomal protein expression is variable between experimental batches . 70 

4.3 Discussion ......................................................................................................... 73 

4.4 Conclusion ........................................................................................................ 78 

4.5 Materials and Methods ...................................................................................... 78 

4.5.1 Experimental Design .................................................................................. 79 

4.5.2 Preprocessing of RNA-Sequencing Reads................................................. 80 

4.5.3 Cumulative F-Statistic with SVA Dimensions .......................................... 80 

4.5.4 Variance Partition Analysis ....................................................................... 82 

4.5.5 Sequencing Depth Contributional Assessment .......................................... 82 

Chapter 5:  The role of melanopsin on ipRGC cell autonomous and cell non-

autonomous programming .......................................................................................... 84 

5.1 Introduction ....................................................................................................... 84 

5.1.1 Intrinsically photosensitive retinal ganglion cells...................................... 84 

5.1.2 Melanopsin and its role in development .................................................... 85 

5.1.3 Experimental Design .................................................................................. 87 

5.2 Results ............................................................................................................... 88 

5.2.1 Gene expression is tissue location and Opn4 genotype specific................ 88 

5.2.2 Local translation occurs in ipRGC axons at postnatal day 15 ................... 96 

5.2.3 Local translation in ipRGC somas is melanopsin-dependent at P15 ....... 102 


xi 

 
5.2.4 Local translation in ipRGC axonal projections in the dLGN is melanopsin-

independent at P15 ............................................................................................ 103 

5.3 Discussion ....................................................................................................... 105 

5.4 Conclusions ..................................................................................................... 110 

5.5 Future Directions ............................................................................................ 111 

5.6 Materials and Methods .................................................................................... 113 

5.6.1 Mouse lines .............................................................................................. 113 

5.6.2 Tissue Collection ..................................................................................... 114 

5.6.3 Axon- Translation ribosome affinity purification (TRAP) ...................... 115 

5.6.4 Axon-TRAP-RiboTag RNA Isolation ..................................................... 116 

5.6.5 Axon-TRAP-RiboTag Library preparation and sequencing .................... 117 

5.6.6 Axon-TRAP-RiboTag RNA-sequencing read preprocessing .................. 118 

5.6.7 Tissue Collection for Bulk RNA sequencing (RNA-seq) ........................ 119 

5.6.8 RNA isolation for Bulk RNA-seq ............................................................ 119 

5.6.9 Library preparation and sequencing of Bulk RNA-seq ........................... 120 

5.6.10 Bulk RNA-sequencing read preprocessing ............................................ 121 

5.6.11 Bioinformatic Analyses ......................................................................... 122 

5.7 Supplementary Material .................................................................................. 124 

Bibliography ............................................................................................................. 129 

 
xii 

List of Tables 
 

Table 2.1 Confusion matrix between iDA and PCA+Louvain Clustering………….. 20 

Table 2.2 Comparison of mean expression of canonical PBMC cell type markers… 24 

Table 2.3 Minor Allele Frequencies per population of top eQTL SNP/gene pairs…. 27 

Table 2.4 Adjusted rand index for clusters from iDA versus PCA embeddings……. 33 

Table 4.1 F-statistic for each SV dimension resulting from SVA…………………... 60 

Table 5.1 Significant DE genes dysregulated between Opn4 genotypes……………. 91 

Table 5.2 DE results for receptor/ligand pairs in the PDGF family in the retina…… 92 

Table 5.3 DE results for receptor/ligand pairs in the Ephrin family in the retina…… 93 

Table 5.4 Sample numbers for P15 axon-TRAP experiment……………………… 118 

Table 5.5 Sample numbers for P8 bulk RNA-seq experiment……………………...121 

Supplementary Table 1 Jaccard index of sample SNPs…………………………… 125 

 
xiii 

List of Figures 

Figure 2.1 PCA versus LDA projection for data where data is clustered……………. 8 

Figure 2.2 Elbow plots to determine the optimal number of dimensions to use in the 

PCA embedding……………………………………………………………………... 15 

Figure 2.3 Cumulative F-statistic and Cumulative  R2 for the PBMC and Geuvadis 

datasets………………………………………………………………………………. 19 

Figure 2.4  tSNE and UMAP visualizations for the PCA and iDA PBMC cell 

embeddings………………………………………………………………………….. 22 

Figure 2.5 Dunn Index calculated for bootstrapped samples for each dataset………. 23 

Figure 2.6 Putative cell type clusters found with iDA are separated both by known 

cell type markers and in the iDA embedding………………………………………...26 

Figure 2.7 iDA projection 3 separates populations………………………………….. 28 

Figure 2.8 tSNE of iDA embedding for ERCC negative control dataset…………… 30 

Figure 2.9 Method comparison on pseudo-cell scRNA-seq dataset………………… 32 

Figure 2.10 iDA Convergence………………………………………………………. 34 

Figure 2.11 iDA runtime…………………………………………………………….. 35 

Figure 3.1 RiboTag genetic approach with Opn4 Cre driver………………………...49 

Figure 4.1 Batch effect in ipRGC case study………………………………………... 57 

Figure 4.2 SVA vs COMBAT batch-correction…………………………………….. 59 

Figure 4.3 SVA versus known experimental factors………………………………... 61 

Figure 4.4 Mapped read counts per sample…………………………………………. 63 

Figure 4.5 Relationship of mapped reads versus genes observed…………………… 64 

Figure 4.6 Bootstrapped downsampling of sequencing counts within each sample…65 

Figure 4.7 Non-zero genes quantified per sample by subcompartment 

location/genotype……………………………………………………………………. 67 

Figure 4.8 Non-zero genes observed between axon-TRAP and bulk RNA-seq……. 69 

Figure 4.9 Non-zero genes observed in the original axon-TRAP-RiboTag………… 70 

Figure 4.10 Variance partition analysis of axon-TRAP-RiboTag samples…………. 71 

Figure 4.11 Hierarchical clustering of the expression profiles for all ipRGC axon-

TRAP samples………………………………………………………………………. 72 

Figure 5.1 ipRGC subcompartment and Opn4 (melanopsin) genotypic differences in 

global transcriptomics……………………………………………………………….. 90 

Figure 5.2 Differential gene expression (DE) analysis of the transcriptome……….. 91 

Figure 5.3 GSEA of ipRGC transcriptome………………………………………….. 96 

Figure 5.4 Non-zero genes observed in study samples……………………………… 98 

Figure 5.5 ipRGC subcompartment-specific mRNA translation…………………... 100 

Figure 5.6 GSEA of axon-enriched mRNAs………………………………………. 101 

Figure 5.7 Retina Opn4-dependent translation…………………………………….. 103 

Figure 5.8 dLGN Opn4-independent translation………………………………….. 104 


xiv 

Supplementary Figure 5.1 Heterogeneity in Opn4 expression in bulk RNA-seq….. 124 

Supplementary Figure 5.2 IGV view of Opn4 aligned reads………………………. 125 

Supplementary Figure 5.3 Comparison of SNP sets between bulk RNA-seq 

samples……………………………………………………………………………... 127 

Supplementary Figure 5.4 Dendrogram of clustered bulk RNA-seq samples……... 128 

 
xv 

List of Abbreviations 
scRNA-seq               Single cell RNA-seq 

RNA-seq                  Bulk RNA-sequencing 

GWAS/TWAS         Genome/transcriptome-wide association studies 

SVA                         Surrogate variable analysis 

PBMC                      Peripheral blood mononuclear cells 

PCA                         Principal component analysis 

PCs                           Principal components 

t-SNE                       t-Distributed Stochastic Neighbor Embedding 

UMAP                     Uniform Manifold Approximation and Projection 

iDA                          iterative Discriminant Analysis 

LDA                         Linear discriminant analysis 

LD                            Linear discriminant 

QDA                        Quadratic discriminant analysis 

TSS                          Total sum of squares 

RSS                          Residual sum of squares 

DI                             Dunn Index 

NK                           Natural killer 

MAF                        Minor allele frequency 

EUR                        European 

YRI                          Yoruban 

SNP                          Single nucleotide polymorphism 

eQTL                       Expression quantitative trait loci 

ERCC                       External RNA Controls Consortium 


xvi 

GEMs                       Gel beads in emulsion 

ARI                          Adjusted rand index 

FDR   False discovery rate 

RGCs                       Retinal ganglion cells 

cRGCs                     Conventional retinal ganglion cells 

ipRGCs                    intrinsically photosensitive retinal ganglion cells 

SCN                         Suprachiasmatic Nucleus 

dLGN                       dorsal Lateral Geniculate Nucleus 

P0,P8, P15,P60        Postnatal day 0,8,15,60 

E20   Embryonic day 20 

TRAP                       Translating ribosome affinity purification 

DE                            Differential expression 

GSEA                       Gene set enrichment analysis 

Opn4                        Melanopsin 

SC                            Superior colliculus 

bp                          Base pairs 

BBI-ACGT              Brain & Behavior Institute-Advanced Genomic Technologies                                                               

Core 

UMIs                        Unique molecular identifiers 

Het                           Heterozygote 

KO                           Knock-out 

CPM                         Counts per million 

log2FC                     Log2 Fold Change 


xvii 

Eph                           Ephrin receptor 

Efn                           Ephrin ligand 

GO                           Gene ontology 

KEGG                      Kyoto Encyclopedia of Genes and Genomes 

VEGF                       Vascular endothelial growth factor 

PDGF                       Platelet-derived growth factor 

HIF-1a                     Hypoxia-induced factor 1a 

RTKs                       Receptor tyrosine kinases 

DAG                        Diacylglycerol 

BCL                         Binary base call 

IGV                    Integrative Genome Viewer 

 
1 

 
Chapter 1: Introduction to investigating sources of variational 

contributions in RNA-seq data 

 
1.1 Variability in high dimensional data 

The number of studies that rely on high dimensional biological data collection 

continues to grow. This trend has been intensified recently by the advent of single cell 

RNA-seq (scRNA-seq), which simultaneously measures genome-wide expression 

profiles for tens of thousands of cells. In these studies, dimensionality reduction 

techniques are commonly used to capture underlying discrete structure of the data in a 

small number of dimensions. This structure can be imposed by sources of variation 

that could be strategic (experimental), biological but unknown and potentially 

interesting, or unwanted and unintentionally introduced by laboratory techniques. In 

high dimensional transcriptomic and translatomic datatypes, accurate dimensionality 

reduction techniques are essential for visualizing the data structure in exploratory 

analyses, identifying any potential variation attributed to noise in the data, and cutting 

computational costs for downstream analysis such as clustering. Low-dimensional 

representations need to capture the underlying structure of the samples from true 

separable differences between study groups as well as potential unwanted batch 

variation. In the case of scRNA-seq, investigators are often interested in discovering 

previously unknown cell types based on their gene expression profile. In bulk RNA-

sequencing (RNA-seq) investigators commonly vary experimental conditions, 

introducing batch effects that induce unwanted variability which may cause spurious 

associations in downstream analysis if not corrected for (Leek et al., 2010).  

 
2 

 
Whether heterogeneity across classes arises from unwanted sources of variation, 

variation of response to an experimental intervention, or different cell types in 

scRNA-seq experiments, it is vital to be able to identify a low dimensional 

embedding which describes the separation of classes. This embedding can be used as 

an interpretable reduction to characterize and investigate features which drive class 

separation or to correct for unwanted sources of variation in downstream analysis. 

Varying experimental conditions commonly introduce batch effects that induce 

unwanted variability (Leek et al., 2010). For example, the Geuvadis Project 

expression data (Lappalainen et al., 2013) shows clear evidence of batch effect, where 

the laboratory in which the samples were processed contributes to a large part of 

variation in the data. If unaccounted for, this batch effect may cause spurious 

associations in downstream analysis. This is particularly problematic when the 

outcome of interest is confounded with the variable responsible for the batch effect 

(Leek et al., 2010). 

 
1.1.1 Genome/Transcriptome-wide association studies 

In genome or transcriptome-wide association studies (GWAS/TWAS), where each 

locus or transcript is tested for association with an outcome, standard statistical tests 

assume observations are independent. When there is underlying structure in the data, 

genetic population structure for example, the independence assumption does not hold 

(Brown et al., 2018). To correct for this latent structure, state-of-the-art methods 

include population structure as a covariate in the model. In GWAS, self-reported 

ancestry is not reliable (Mersha & Abebe, 2015), and the latent structure attributed to 


3 

 
ancestry must be estimated. Similarly, in TWAS, sources of technical variability are 

impossible to predict and methods that estimate these unknown factors, such 

surrogate variable analysis (SVA) (Leek & Storey, 2007) have been proposed. 

1.1.2 Single cell RNA-sequencing variability  

A common motivation of studies using scRNA-seq is to detect novel cell-types based 

on the natural grouping of the transcriptomic profiles and identify marker genes that 

define these cell types (Jain, 2010). The current standard approach is to define cell-

types using unsupervised clustering followed by statistical tests used to identify genes 

with greater expression in one cell type compared to the others. While this may give 

researchers valuable information for a particular gene of interest, the method fails to 

provide a broad summary of a set of genes which define cell type separation. A more 

useful approach would be to find an embedding space which defines the latent 

structure of the separation of cell types. Often in scRNA-seq when trying to identify 

putative cell types, the boundaries between them are ill-defined, making it difficult to 

define definitive cell type boundaries (Kiselev et al., 2019). Defining the latent 

structure which captures the features which best separate cell types in low 

dimensional space from each other would be preferred.  

1.2 Linear dimensionality reduction methods 

1.2.1 Unsupervised: Principal Component Analysis (PCA) 

Principal Component Analysis (PCA) is commonly used as a dimensionality 

reduction method in these applications since it finds principal components (PCs) that 

explain variation in the data. For sources of variation such as batch effects or genetic 


4 

 
ancestry the variation in the data set is largely attributed to factors not associated with 

the outcome variable and PCA is used to find an embedding to estimate and account 

for these latent factors. In other data types like scRNA-seq, PCA is used to reduce the 

computational cost of downstream analyses like clustering. The application of PCA to 

detect latent discrete structure assumes that latent classes will in fact account for most 

of the dataset variance. However, since PCA is unsupervised, it does not necessarily 

separate classes (Lever et al., 2017). When the clusters formed by latent classes fall in 

directions of maximum variance, PCA will indeed find an embedding where clusters 

separate, but if these clusters do not fall in directions of maximum variance, 

embeddings computed by PCA will fail to separate these clusters. 

 
Since PCA seeks to find transformations that explain most variance in the data, it 

inherently favors separating clusters with large variance. A well-separated cluster 

with low variance will likely not be separated by the top embedding directions since it 

does not contribute much to the total variance of the data set. For data sets with an 

innate discrete latent structure containing clusters with unbalanced variances an 

alternative method for recognizing latent variables is needed. 

 
1.2.1 Supervised: Linear Discriminant Analysis (LDA) 

Another method used for dimensionality reduction, classification, and data 

visualization is linear discriminant analysis (LDA). LDA, like PCA, is a linear 

reduction method but class labels are used in order to find optimal separating 


5 

 
boundaries between them. The class labels are used to maximize the ratio of between 

class variation to within class variation. In instances where the major sources of 

variation within a dataset are known (such as experimental variables and batches), 

LDA produces weighted vectors which describe the data separation and variability. 

The contribution of each feature in the separability per dimension is interpretable 

since they are linear.  

 
1.3 Non-Linear dimensionality reduction methods 

Visualization techniques, such as t-Distributed Stochastic Neighbor Embedding (t-

SNE) and Uniform Manifold Approximation and Projection (UMAP), use a different 

approach to retain local structure in the original high dimensionality data in the low 

dimensional embedding (McInnes et al., 2018; van der Maaten & Hinton, 2008). 

tSNE first computes joint probabilities of the pairwise similarity scores in the full 

dimension data space and the low-dimensional projection space. Next, the algorithm 

minimizes the Kullback-Leibler divergence between the joint probabilities and 

iterates until convergence. This method preserves small pairwise distances/local 

similarities between samples, but because the cost function is not convex, 

convergence to the global minima is not guaranteed. Similarly to tSNE, UMAP is 

also a neighborhood-preserving non-linear method, but offers improvements on 

computational complexity compared to tSNE. These methods often yield 

visualizations that capture latent discrete structure in data well. These methods are 

computationally expensive and in practice are usually used on data that is already of 

smaller dimension by using PCA as a first step, thus inheriting the issues raised about 


6 

 
PCA above. More importantly for the context of biological interpretation, 

embeddings obtained by t-SNE and UMAP are nonlinear, making it hard to interpret 

feature weights to determine those that may be contributing to the signal separating 

latent discrete structure in the data.  

1.4 Contributions 

For datasets with an innate discrete latent structure containing clusters with 

unbalanced variances an alternative method for recognizing latent variables is needed. 

If the experimental design is such that one may expect unequal variances between 

groups of interest, the resulting PCA + tSNE or UMAP visual representation will not 

accurately separable differences between distinct groups with small variance between 

them. In particular, in instances where a large portion of the variation between study 

groups is attributed to one group’s large variation, the difference between the groups 

with smaller distances between them becomes unobservable. In chapter 2, we develop 

a novel dimensionality reduction method, iterative discriminant analysis (iDA). iDA 

as an iterative extension of linear discriminant analysis (LDA). This new proposed 

method will mitigate this bias and preserve the distinct separation between low 

variance groups.  

 
7 

 
Chapter 2: A Novel Dimensionality Reduction Method for high-

throughput RNA-sequencing data  

 
2.1 Introduction 

In this chapter, we propose iterative Discriminant Analysis (iDA), an unsupervised 

dimensionality reduction technique to address the problems of PCA and nonlinear 

methods like t-SNE and UMAP discussed in chapter 1. iDA uses linear 

transformations to produce an embedding space that contains discriminatory 

information that optimally separates latent clusters. This method leverages linear 

discriminant analysis (LDA) in order to find transformations which both separate 

classes from each other while also minimizing variance within each class. This 

method improves on PCA as it defines the latent space while maintaining unobserved 

clustered architecture. 

 
Denote the observed  data matrix as . LDA, like PCA, is a dimensionality 

reduction technique since it finds a linear projection  of the data features to best fit 

the data in a lower dimension producing an embedding . A major 

difference between these two linear projection techniques is that LDA is a supervised 

method, while PCA is unsupervised. Unsupervised PCA finds projections which 

maximize the total variance in the data, which may or may not separate classes in 

these projections (Figure 2.1). If the class assignments are known a priori, LDA can 

be used to find projections  which best separate the classes. In Figure 2.1, the 

direction which maximizes variation is uninteresting since it does not separate the 

latent clustering. However, the LDA projection finds a distinct boundary between the 


8 

 
classes. Such as in this example, a direction with low variation, but potentially 

valuable, class defining information, is not guaranteed to be preserved in the first few 

principal components. When determining the number of PCs to use, it is common 

practice to evaluate the eigenvalues for the top PCs in an elbow plot to determine the 

best cutoff for which eigenvectors to include in the reduced embedding. Since the 

eigenvectors indicate the amount of variation attributed to its respective eigenvector, 

the top PCs which capture the majority of the variation in the data are commonly 

used, meaning the lowly variable but well defined clusters will not be preserved in the 

PCA embedding. 

 
Figure 2.1. PCA versus LDA projection for data where data is clustered.  

The PCA projection fails to separate the classes because they do not fall on the 

projection which maximizes variation for the data. The supervised LDA projection 

maximizes the ratio of between to within cluster variation. 

 
9 

 
PCA maximizes total variance of the embedded data by maximizing the objective 

function 

        (2.1) 

where  is the empirical covariance of the embedded data Y 

 
                (2.2) 

   (2.3) 

 
and  is the th row of embedding  corresponding to the low-dimensional 

representation of the th data observation ( ).  

 
LDA uses observed assignments  to one of  classes to find projections which 

separate classes by maximizing the objective function:  

 
         (2.4) 

 
where  is the between class empirical covariance  and  is the within class 

empirical covariance matrix:  

 
10 

 
         (2.5)  

            
The within class scatter matrix is the covariance for the difference between each data 

point and the mean vector for its corresponding cluster across all clusters. The 

between class scatter matrix finds the covariance for the difference between each 

cluster's mean vector and the overall mean for all data points. 

 
While PCA maximizes global variation of the data, LDA maximizes the variation 

specific to the clustered architecture since class labels are known. Intuitively, by 

separating the data into clusters and maximizing the ratio of between to within cluster 

variation, the resulting projection  will be one which maximizes the variance 

between clusters while minimizing the variance within each cluster. LDA assumes 

that all classes have equal covariance, but the extension to quadratic discriminant 

analysis (QDA) allows for unequal covariance matrices when calculating the within 

class scatter matrices, therefore allowing for more flexible, non-linear decision 

boundaries between classes.  

 
LDA is a supervised classification technique which requires class labels in order to 

maximize this ratio of between class variation to within class variation. The challenge 

of bridging the gap between PCA and LDA is defining these class labels which 

capture latent discrete structure of data. The core idea of iDA is to use an 


11 

 
unsupervised clustering method to define classes and use those in LDA to find a low 

dimensional embedding. This unsupervised method aims to find a latent clustered 

structure in the data while being unbiased to factors such as batch, population, or cell 

type. This agnostic approach avoids using recorded experimental class labels, but the 

iDA algorithm does allow for an initial set of class labels to be given if desired by the 

user. Iteration in this case would start with the embedding step instead of clustering 

step using the given labels, but that will just find an embedding that separates based 

on the provided labels. An approach that incorporates labels such as soft-assignments 

would be helpful, but we leave it for future work. 

 
2.2 iDA Algorithm 

The iDA algorithm determines the number of clusters to use in each iteration by 

maximizing the clustering's modularity as used in the Walktrap clustering step. If, 

however, the user would like to find an embedding with a predetermined number of 

clusters, this can be achieved by setting the user-defined  parameter as input 

to the iDA function. 

 
For batch correction, instead of only providing known batch indicators as input class 

labels to LDA, we assume variation may come from unmeasured sources, and 

therefore using a variable like batch would not fully capture groups arising from such 

variation. Moreover, in the case of scRNA-seq, cell types are unknown and since 

known marker genes are not defined for all cell types as well as subtypes, inferring 

cell types is difficult.  


12 

 
The iDA algorithm is as follows: 1) Extract informative genes to identify those which 

may be potentially affected by unmeasured variables, 2) Cluster the samples using the 

Louvain community detection method to get class labels for each sample, 3) 

Decompose the informative gene expression matrix via Linear Discriminant Analysis 

(LDA) or Quadratic Discriminant Analysis (QDA) using the class labels obtained 

from the Louvain clustering. 4) Update the class labels by clustering in the 

discriminant space. The new optimal class assignments can then be used for 

decomposition until the class assignments have stabilized in the embedding space.  

 
iDA results in an embedding space of dimension  – 1, where  is the number of 

clusters identified by iDA which maximizes the community structure modularity, and 

each dimension determines a decision boundary which best separates an individual 

class from the rest. To find reliable clusters which define the discrete latent structure, 

we use the Walktrap method for community detection in a nearest neighbor graph 

which computes the similarities between neighbors based on random walks in the 

graph (Pons and Latapy, 2005). To compute the shared nearest neighbors graph, first, 


13 

 
a k-nearest neighbors graph is built with the default number of neighbors . 

Then, the pairwise Jaccard Similarity is computed (the number of common neighbors 

each vertex has divided by the number of vertices that are neighbors of at least one of 

the two pairwise vertices). The shared nearest neighbors graph is then pruned such 

that pairwise vertices that have similarities less than 1/15 are set to zero, and 

converted to an undirected graph, which is then the input to the Walktrap clustering 

algorithm. The Walktrap clustering returns a hierarchical clustering. The point at 

which to cut the tree is determined by computing the modularity at each level of the 

tree and cutting it where the modularity is maximized.  

 
For our experiments below, we check for convergence as explained in Section 

Convergence Assessment. Since iDA remains agnostic to the study design, we don’t 

need to worry about recording errors for phenotype or batch data, or self-

identification discrepancies. In our implementation we initialize the embedding by 

selecting  features with highest marginal variance. 

 
2.3 Methods for Performance Assessment 

To assess the performance of iDA, we used two data sets. The first is from the 

Geuvadis bulk RNA-sequencing Project, which has expression data from 462 

individuals from European and Yoruba ancestry individuals and was sequenced in 7 

different laboratories (Brown et al., 2018). This data set exhibits a latent structure 

from an overwhelming amount of variation attributed to batch effects and some 


14 

 
variation attributed to population structure. The second data set is the 10X Genomics 

peripheral blood mononuclear cell (PBMC) single cell RNA-sequencing (scRNA-seq) 

data set which has expression from roughly 3,000 unlabeled immune cells (Zheng et 

al., 2017). We chose these datasets because of the known structure embedded in each 

(genetic population structure and batch in the Geuvadis and distinct cell types in 

PBMC dataset). Note, however, these labels are not used as input to the iDA 

algorithm.  

 
2.3.1 Convergence Assessment 

To assess the convergence of iDA, we need to inspect the stability of the cluster 

assignments resulting from the Walktrap community detection at each iteration. Since 

the cluster assignments dictate the scatter matrices calculated for the discriminant 

analysis, once the cluster assignments are stable, then the discriminators will not 

change. To evaluate when the cluster assignments have stopped changing, we 

calculate the Adjusted Rand Index as concordance for the cluster assignments 

between each iteration. The iterations are stopped when the Adjusted Rand Index is 

larger than 0.98 between the cluster assignment vectors. 

 
2.3.2 Cumulative F-Statistic 

To compare the capacity of iDA and PCA to accurately find a subspace which 

describes the underlying clustering in the data, we adapt the F statistic as well as  

to be able to assess models across dimensions. First, we perform linear regressions 


15 

 
which model each dimension of PCA and iDA as a function of the observed variables 

that describe the known underlying group structure (for Geuvadis data set, this is the 

interaction between population and laboratory, for 10X Genomics PBMC scRNA-seq 

data, this will be cell type). To determine the appropriate number of PCs to use in this 

comparison for each dataset, we evaluated the elbow plots to determine the 

eigenvectors accounting for the majority of the variation in each dataset (Figure 2.2).  

 
Figure 2.2. Elbow plots to determine the optimal number of dimensions to use in the 

PCA embedding.  

To determine how many top PCs to use for each dataset, the eigenvalues are plotted 

for each PC dimension, and the cutoff is determined based on the "bend" in the plot. 

To ensure an appropriate amount of variation is accounted for when comparing to 

iDA, we chose to include the top 10 PCs in the Geuvadis and PBMC datasets.  

 
To ensure an appropriate amount of variation was captured in the PC embedding for 

comparison to iDA, we use the first 10 PCs for the Geuvadis and PBMC datasets.  

 
         (2.6) 

    
16 

 
Then, to assess how well these known classes are separated across dimensions, we 

use an adaptation of the F-statistic. We define this cumulative F-statistic as: 

 
  (2.7) 

 
where we compute the total sum of squares (TSS) and residual sum of squares (RSS) 

per regression model, and sum over each dimension to compute a cumulative F-

statistic from 1 to . This statistic measures the variance of the group means / 

variance of the within group variances over dimensions. This indicates how well the 

dimensional reductions for PCA and iDA describes group separation for known 

groups. Because both the  and  change as dimensions are added, the 

cumulative F-statistic is not monotonic.  

 
2.3.3 Cumulative R2   

We also compute a cumulative  for dimensions 1 through  to measure the total 

variance captured in the sub-spaces from each reduction technique over dimensions.  

 
      (2.8) 

 
17 

 
This cumulative  over dimensions represents the proportion of variance over 

dimensions that is explained by the underlying grouping.  

2.3.4 Clustering Validation 

To validate that the iDA method identifies better latent clustering in the original data 

than the PCA dimensionality reduction and is not creating spurious clustering, we 

will assess the compactness and separation of clusters in the original space using the 

Dunn Index (DI). The DI is defined as the ratio between the smallest distance 

between observations not in the same cluster to the largest intra-cluster distance, 

 
   (2.9) 

 
where  is the inter-cluster distance and  is the intra-cluster distance.  

 
To ensure robustness, the Dunn Index is calculated using a bootstrapping approach 

over multiple sampled observations samples drawn with replacement and the Dunn 

Index is calculated over 1000 bootstrap samples. 

 
2.4 Results 

2.4.1 iDA preserves cluster structure better than PCA 

The cumulative F-statistic is a measure to evaluate how well the low dimensional 

embedding captures the separation of clusters in the data. The iDA algorithm was 

applied to the 10X genomics scRNA-seq PBMC data set and compared to the PCA 


18 

 
reduction of the same dimension. For comparison of the embedding from iDA and 

PCA, we used the cell identities assigned from the Seurat scRNA-seq pipeline as the 

cell types. 

  
Figure 2.3A provides evidence that the decision boundaries from iDA offer an 

embedding which better separates the putative cell types compared to PCA, as 

determined by the cumulative F-statistic. As expected, the first dimension of PCA, 

which captures the most variation in the data, separates one cluster very well (likely 

the most variable cluster), but as dimensions are added, the projections do not capture 

the cluster separation. In the iDA embedding, we see each dimension equally captures 

the separation of one cluster, so the cumulative F-statistic does not degrade as we add 

dimensions. iDA also finds LD's which capture a higher proportion of variance 

explained over all dimensions than the PCs do for evaluating variance between cell 

types (Figure 2.3B). Although PCA's objective function is to maximize variance, we 

see that it fails to find variation attributed to some cell types, likely ones with low 

variation. There is evidence of this in the acute drop off in both the cumulative F-

statistic and  values. 

 
19 

 
Figure 2.3. Cumulative F-statistic and Cumulative  for the PBMC and Geuvadis 

datasets.  

A) As more dimensions are added for each of the reduction methods, the clustering 

from as determined by the Seurat putative cell type assignments for each of the 

samples is better separated by the iDA embedding than PCA. B) The iDA embedding 

captures more of the total variance attributed to the latent clustering than PCA. C) As 

more dimensions are added for each of the reduction methods, the clustering from the 

main effects and interaction between laboratory and population for each of the 

samples is better separated by the iDA embedding than PCA. D) The iDA embedding 

captures more of the total variance attributed to the latent clustering than PCA. 

 
20 

 
We see a similar trend for the statistics computed using the Geuvadis data set. We 

evaluate the performance of iDA and PCA using the groups from the main effects and 

interaction between the population and laboratory the samples were processed in. 

While these variables may not account for all latent clustering in this data, we know 

they contribute to a large portion of the clustering, and an embedding should capture 

these clusters (and potentially others). Here, the first dimension of PCA does a very 

poor job of accounting for the latent clustering attributed to population and 

laboratory; more evidence that maximizing the variation does not guarantee to best 

separate clusters (Figure 2.3C). iDA again consistently finds better separation for this 

data, as well as better overall variation attributed to these groups (Figure 2.3D).  

 
Many of the clusters produced by both iDA and PCA + Louvain (Seurat clusters) are 

nearly identical between each method (Table 2.1). The main difference is the 

boundary between Seurat Clusters 0 and 1 changes such that some Seurat Cluster 0 

cells get grouped with cells in Seurat Cluster 1.  

 
Table 2.1 Confusion matrix between iDA and PCA+Louvain Clustering 

 
21 

 
Based on both the tSNE and UMAP of these projections, the boundary between these 

two clusters is not well defined (Figure 2.4). To determine which clustering is 

optimal, we compute the bootstrapped Dunn Index (DI) in the full gene expression 

space for class assignments resulting from iDA compared to the class assignments 

from other methods over 1,000 iterations. The PCA and iDA embeddings were 

compared using multiple clustering techniques including two community detection 

algorithms common in single cell clustering, Walktrap clustering (the default for the 

iDA algorithm), and Louvain Community Detection (Blondel et al., 2008), as well as 

K-means clustering, which has been shown to be closely linked to the unsupervised 

dimension reduction of PCA (Ding & He, 2004). 

 
22 

 
Figure 2.4.  tSNE and UMAP visualizations for the PCA and iDA PBMC cell 

embeddings.  

Similarly to PCA, the iDA embedding can be used with both tSNE and UMAP to 

visualize cluster separation.   

 
The clusters found in the PBMC scRNA-seq dataset using the iDA embedding with 

each of the community detection algorithms -- Walktrap (green) and Louvain 

clustering (light blue) -- both have Dunn Indices which are significantly increased 

compared to the clusters found with the PCA embedding and Louvain clustering (i.e. 

the clusters found with the Seurat pipeline) (yellow) (Figure 2.5A). In the Geuvadis 

dataset, the Dunn Indices of the clusters found by all clustering methods with the iDA 

embedding are all significantly increased compared to the clusters found by both the 

PCA embedding with Louvain clustering as well as the recorded batch (laboratory) 


23 

 
and population of each sample (Figure 2.5B). This indicates that the class 

assignments resulting from iDA define clusters in the original space which are 

compact and better separated than class assignments determined by either the Seurat 

pipeline for scRNA-seq or measured covariates in Geuvadis.  

 
Figure 2.5. Dunn Index calculated for bootstrapped samples for each dataset. 

Dunn Index computed for bootstrapped samples from PA) PBMC cells and B) 

Geuvadis samples using either the PCA embedding or the iDA embedding and 

several different clustering methods. The sampling was performed by taking n = 

number of observations in each dataset, with replacement, and bootstrapped 1,000 

times. A) The iDA algorithm with the Community detection algorithms (Walktrap 

and Louvain) both have significantly higher Dunn Indices than the PCA embedding 

with Louvain clustering. B) All clustering methods with the iDA algorithm generate 

significantly better clusters with respect to the Dunn Indices when compared to the 

PCA embedding and the known batch and population. 

 
24 

 
2.4.2 iDA directions are highly correlated with known PBMC cell type markers 

Canonical markers of particular cell types have been characterized in PBMCs and can 

be used to validate clustering methods (Zheng et al., 2017). These markers have been 

experimentally elucidated such that cells with high expression of these markers are 

likely of the corresponding cell type based on their known function.  

 
CD14 and LYZ are known markers for monocytes, MS4A1 is a marker for B-cells, 

GNLY and NKG7 are markers for Natural Killer (NK) cells, and PPBP is a marker 

for platelets. If the clustering algorithm captures clusters indicative of cell type, the 

resulting clusters should reflect this in the marker gene’s expression. For example, 

LD6 is the directional vector which separates cluster 1 (with cells in cluster 1 

mapping to the lowest weights of the vector). We see that the cells in cluster 1 highly 

express the two markers for Monocytes, CD14 and LYZ (Figure 2.6A). 

Quantitatively, both markers for monocytes have much higher mean scaled 

expression than the cells in all other clusters (Table 2.2).  

 
Table 2.2. Comparison of mean expression of canonical PBMC cell type markers. 

 
25 

 
The LD3 vector separates the cluster of cells (cluster 6) which highly express the 

markers of NK cells (GNLY and NKG7) (Figure 2.6B). These markers are shown to 

have much higher average scaled expression than cells not in this cluster. 

Additionally, iDA is sensitive to separating the cluster of cells which are high in the 

expression of both marker genes (as opposed to only one), separating them from cells 

in cluster 2 which are high in only NKG7 expression but not GNLY, indicating that 

cluster 6 is a cluster of NK cells. 

 
LD5 separates the cluster which highly expresses the marker for B-cells, MS4A1 

(Figure 2.6C) , also with an overall much higher mean scaled expression when 

compared to cells in all other clusters (1.971 versus -0.291). This indicates cluster 5 is 

likely a population of B-cells.  

 
Lastly, a marker for platelets, PPBP, is highly expressed in the cells in cluster 9 

identified by the iDA algorithm. This cluster is separated by LD1 and shows a clear 

separation, with the platelet cells mapping to high LD weights (Figure 2.6D).  


26 

 
Figure 2.6. Putative cell type clusters found with iDA are separated both by known 

cell type markers and in the iDA embedding.  

A) The group of cells which highly express both LYZ and CD14, known markers for 

monocytes, are separated by the  iDA dimension (LD6) (salmon colored points). 

B) The group of cells which highly express GNLY and NKG7, known markers for 

NK cells, are separated by the  iDA dimension (LD3) (light blue colored points). 

C) The group of cells which highly express MS4A1, a known marker for B-cells, are 

separated by the  iDA dimension (LD5) (teal colored points). D) PPBP is a known 

marker for platelets. The group of cells which highly express PPBP are separated by 

the  iDA dimension (LD1) (pink colored points). E) FCER1A is a known marker 

for Dendritic cells. The group of cells which highly express FCER1A are separated 

by the  iDA dimension (LD4) (purple colored points). 

 
27 

 
2.4.3 iDA directions are highly correlated with SNPs in eQTL 

 
The Geuvadis data set has known sub-population features attributed to ancestry and 

the location where the samples were sequenced. iDA identifies an LD which 

separates the YRI population from the EUR populations (LD3). Interestingly, this LD 

purely captures the separation in population and groups all samples sequenced in 

different laboratories together.  

 
Brown et al (2018) show that expression reflects population structure because alleles, 

which may have different frequencies among different populations, can affect the 

expression level of nearby genes. These allele-gene pairs form expression quantitative 

trait loci, or eQTL, if this relationship occurs. The Geuvadis data set has 116 SNP-

gene pairs which are in eQTL only in YRI individuals and not in the EUR 

individuals. The average difference between the YRI minor allele frequency (MAF) 

and the EUR population’s MAF is a substantial 0.175. Since these alleles are in eQTL 

with a gene pair and the allele frequencies are different between populations, the 

effect on expression level will also differ between populations. These SNPs allele 

frequencies vary greatly between the YRI and EUR populations (Table 2.3). 

 
Table 2.3. Minor Allele Frequencies per population of top eQTL SNP/gene pairs. 

 
28 

 
Figure 2.7. iDA projection 3 separates populations. 

SNP-gene pairs in eQTL affect the expression dependent on the sample population 

since the SNP MAF is variable across populations. The  iDA dimension (LD3) 

separates the YRI population from the European ones. A) RAB5C, which is in eQTL 

with SNP rs11757158 in the YRI population, has higher average expression than the 

European populations. B) SRF, which is in eQTL with SNP rs11413536 in the YRI 

population, has higher average expression than the European populations. C) PSKH1, 

which is in eQTL with SNP rs143415501 in the YRI population, has higher average 

expression than the European populations. D) PWAR6, which is in eQTL with SNP 

rs200846953 in the YRI population, has higher average expression than the European 

populations. 

 
The MAFs differ as much as 36  (as seen in the rs11757158-RAB5C pair), and in 

some cases, the minor allele is not present at all in the EUR population 

(rs143415501). Figure 2.7 A, B and C clearly increase the expression rate of their 

respective eQTL genes, while D shows a decrease in expression of PWAR6 in YRI as 

compared to the EUR samples, mimicking the difference of MAF between the YRI 

population and the EUR populations.  


29 

 
2.4.4 iDA directions are highly correlated with known technical noise 

Another point to consider is if technical noise is being further entrenched in the 

embeddings found by iDA. To ensure iDA is able to capture variation in datasets with 

unknown technical noise, we performed iDA on a control dataset designed with all 

variation attributed to technical noise (Zheng et al., 2017). This dataset consists of 

droplets which were loaded with the same ratio of 92 External RNA Controls 

Consortium (ERCC) synthetic RNAs into GEMs. Since there are no differences in 

cell size, RNA content or transcriptional activity in these GEMs, the only variation is 

from technical noise introduced by the differing number of UMIs per GEM. The 

embedding iDA produces on both the raw counts and the normalized counts confirms 

that the entrenchment of technical noise can arise from the normalization technique 

(Townes et al., 2019). For example, in the negative control ERCC dataset, the iDA 

embedding of the raw counts clearly separates GEMs by the technical variation 

introduced by the mean UMI count per GEM (Figure 2.8B), but the embedding of the 

log normalized counts does not produce this separation (Figure 2.8A). Therefore, the 

entrenchment of technical noise in the iDA embedding is a result of the normalization 

step, and not the iDA algorithm itself. 

 
30 

 
Figure 2.8. tSNE of iDA embedding for ERCC negative control dataset. 

 tSNE of the iDA embedding for the ERCC negative control dataset both on the 

scaled counts (A) and raw counts (B). iDA captures the variation attributed to the 

technical noise when present in the data (B). If normalization techniques have 

entrenched the technical noise (Townes et al., 2019), iDA will not be able to separate 

the technical variation (A).  

 
To show further evidence of accurate iDA clustering and dimensionality reduction, 

we applied both iDA and PCA with Walktrap clustering to two single cell 

benchmarking datasets from the 10X genomics platform with known cell type labels 

(Tian et al., 2019). Known populations of three (10X-3cl) and five cell lines (10X-


31 

 
5cl), respectively, were sequenced using the 10X genomics single cell RNA 

sequencing platform for these two datasets. The tSNE plot of the PCA reduction of 

log counts for each cell reveals discrete clustering within each cell line, but there is 

also a moderate amount of heterogeneity in each cell line (particularly in H1975 and 

A549, which each separate into distinct sub-clusters) (Figure 2.9A). When iDA is 

applied to each of these datasets without any guidance on how many cell lines are in 

each dataset, the resulting clustering is sensitive to the cell type sub-clustering. 

Because of this sensitivity to within cell type heterogeneity, when compared to the 

known cell type labels, the Adjusted Rand Index (ARI) for iDA clustering on the 3 

cell line dataset is .519, and .798 for the 5 cell line dataset. Comparatively, when 

unsupervised clustering is applied to the PCA embedding, the ARI between these 

clusters and the known cell type labels is .457 for 10X-3cl and .615 for 10X-5cl.  


32 

 
Figure 2.9. Method comparison on pseudo-cell scRNA-seq dataset. 

A) tSNE of the Tian et al. (2019) 10X genomics single cell RNA sequencing 

benchmarking datasets. Distinct clustering is visualized for each of the known cell 

lines, but there is a substantial amount of heterogeneity within each cell line for both 

datasets. B) Cumulative F-statistics for the iDA supervised embedding, iDA 

unsupervised embedding, and PCA embedding. The iDA supervised embedding 

outperforms the other methods in capturing variance attributed to the known cell line 

labels. C) Pseudocell mixtures with different ratios of RNAs from 3 cell lines. 

Visually, PCA separates the homogeneous pseudo cells (red, green, blue), but the 

mixed populations form a continuous spectrum between the 3 homogeneous ones. 

The cumulative F-statistic for each dataset shows iDA unsupervised, iDA supervised, 

and PCA all perform comparably in these in-discrete datasets. 


33 

 
In the case where the number of cell lines in a sample is known, this parameter can be 

given to Walktrap clustering. In these two datasets, because there is such a great 

difference between the expression profiles of each cell line, the supervised clustering 

yields much greater results. When iDA is applied with the number of known clusters 

given, the resulting ARIs are .997, .986 for 10X-3cl and 10X-5cl respectively. PCA 

with supervised clustering also does well with ARIs of 1.00 for 10X-3cl and .987 for 

10X-5cl (Table 2.4).  

 
Table 2.4. Adjusted rand index for clusters from iDA versus PCA embeddings. 

 
2.4.5 iDA Converges 

When assessing the convergence of iDA, we evaluate the concordance of cluster 

assignments at each iteration with the Adjusted Rand Index between the two sets of 

class assignments. We set a threshold of concordance larger than  as the stopping 

criterion for iDA. Each data set reaches the threshold within 3 iterations of iDA 

(Figure 2.10). Since the discriminants in iDA are determined based on the cluster 

assignments, if the assignments reach stability between iterations, the discriminants 

will be stable and unchanged between iterations as well.  

 
34 

 
Figure 2.10. iDA Convergence. 

Convergence of the iDA algorithm is measured by the concordance of class 

assignments of each iteration to the previous one. The convergence threshold for iDA 

is set to concordance = 0.98. Both the Geuvadis and PBMC data sets converge within 

3 iterations. 

 
The speed of iDA is heavily dependent on the number of variable features used 

initially since each iteration decomposes the gene-by-gene matrix. Although PCA is 

faster, iDA runs within 6 minutes for datasets with less than 3,000 variable genes and 

500 samples. iDA identifies 2,829 and 3,000 variable genes in the Geuvadis and 

PBMC datasets and runs in five and three minutes, respectively (Figure 2.11). In the 

breakdown of time allocation for the iDA algorithm, the SVD takes the most time for 

each of the data sets, since a decomposition is computed in each iteration. 

 
35 

 
Figure 2.11. iDA runtime. 

Run time of the iDA algorithm versus the PCA reduction for both the Geuvadis and 

PBMC data sets. PCA is much faster than iDA, as it only computes one 

decomposition. iDA computes a decomposition and clustering in each iteration until 

convergence. The Geuvadis data set which has 462 samples and iDA identifies 1,802 

variable genes runs within 3 minutes. The PBMC data set has 2,638 cells and iDA 

identifies 3,000 variable genes and runs within 4.47 minutes. 

 
2.5 Discussion 

Discovering the latent structure in high dimensional data is essential in many 

applications. The latent structure can be attributed to batch, which needs to be 

corrected for before downstream analysis to avoid spurious associations as well as 

reduced power in association testing. The latent structure can also define biological 

clusters which researchers are interested in evaluating such as in scRNA-seq, where 

identifying a set of features which define cell types is of main concern. In standard 

pre-processing using PCA, as well as methods which use PCA to identify sources of 

variation in batch effects cases, classes which are distinctly defined but have low 

variation will not be maximally separated. Although both PCA and iDA are not 

specifically batch correction methods, the resulting low dimensional embedding 


36 

 
which captures unwanted variation can in turn be used as covariates in downstream 

analysis to produce more accurate results. For example, the iDA embedding which 

captures the separation of batches in the scaled Geuvadis counts clearly separates the 

samples from individual laboratories in which they were processed and could be used 

as covariates in downstream analysis to account for this separation (Figure 2.7C). 

This approach, namely, adding embeddings to control for unwanted structure in 

downstream analysis, is preferred to another common approach of removing the first 

few embedding directions from the original input data to produce corrected datasets 

for further downstream analysis. Accordingly, we have focused our analyses in this 

paper on the preferred approach. 

 
One advantage of PCA embeddings is the interpretability of the variability 

contribution each dimension adds when included in the embedding. Often, the 

respective eigenvalues of the top PCs are plotted to determine the "elbow" which 

determines the ideal number of top PCs to include in the final embedding (Figure 

2.2). This offers a natural ordering of PCs based on the amount of variation the PC 

accounts for in the data. Because iDA also finds embeddings with an SVD, the 

resulting LD's order corresponds to “easy-to-separate” clusters. For example, in the 

PBMC dataset, LD1 separates the platelet cells which are best separated from other 

cell types (Figure 2.5D), while the later LDs are not as well separated (Figure 2.5 

A,C,E).  

 
37 

 
Several neighbor-preserving non-linear dimensionality reduction techniques exist for 

data visualization such as t-distributed Stochastic Neighbor Embedding (t-SNE) and 

Uniform Manifold Approximation and Projection (UMAP). These methods are 

particularly popular for scRNA-seq dimensionality reduction and visualization. tSNE 

constructs a probability distribution for all pairs of samples in the data such that 

samples close in Euclidean distance in the original data space to have a higher 

probability of being placed near each other in the resulting lower dimensional space 

than those which are far apart from each other in the original data space, therefore 

preserving the neighborhood structure in a two or three dimensional space (Laurens 

van der Maaten & Geoffrey Hinton, 2008). tSNE has a few shortcomings, particularly 

when applied to scRNA-seq. tSNE does not scale well which is problematic with the 

rapidly growing number of cells being sequenced for scRNA-seq. iDA is more 

scalable for large numbers of samples since the SVD time-consuming step in iDA is 

dependent on the number of variable features and not the number of samples. Since 

tSNE computes the k-nearest neighbor graph as an initial step, it is infeasible to 

compute this in the original high dimensional space, so a preprocessing step like PCA 

is used to reduce the number of dimensions before applying tSNE, which means it 

could fall victim to the downfalls of PCA as discussed before. The newer method, 

UMAP, is also a neighborhood-preserving method with several adaptations to tSNE, 

such as a cost function which attempts to preserve global structure (McInnes et al., 

2018).  

 
38 

 
While t-SNE or UMAP may suffice for data visualization, the embedding only offers 

visual information on dissimilarity between the samples and loses the interpretability 

that the iDA embedding offers. The tSNE and UMAP embeddings do not have 

weights of the features which define the differences between samples, which renders 

the embeddings useless to explain features which drive cluster separation or define 

features that may be used as covariates in downstream analysis. From the SVD of 

 yields the weight matrix of genes by linear discriminants. These gene 

weights allow for the interpretation of specific features which separate latent clusters.  

 
With these interpretable embeddings, we can visualize the separation of the clusters 

in each dimension of the embedding and validate these clusters with known properties 

of each group. For instance, to validate the putative cell types found in the scRNA-

seq data set, we compare the expression of known markers to LDs which separate the 

identified cell type clusters. The cell type clusters identified by the iDA algorithm 

show corresponding levels of expression for known markers of these cell types. The 

iDA embedding and clustering is also sensitive to lowly-abundant cell types. In 

PBMCs, lymphocytes (T cells, B cells, NK cells) are the majority of the cell 

population, monocytes are the next highest abundant, and dendritic cells and platelets 

as the rarest (Kleiveland, 2015). In the iDA embedding of the PBMC dataset, the least 

abundant platelet cell population is the cell cluster which is best separated from the 

others (in LD1), and the dendritic cell population is captured in LD4 (Figure 2.5). 

Therefore, iDA will identify rare populations that are sufficiently separated. 

 
39 

 
Expression is known to reflect population structure because of the eQTL (Brown et 

al., 2018) so it is essential to identify this source of variation as a covariate if this is 

not the desired response variable. In the Geuvadis data set, of the 116 SNPs which are 

in eQTL in only the YRI individuals, the average difference between the YRI MAF 

and the EUR population’s MAF is a substantial 0.175. This means that on average, 

SNPs in eQTL in the YRI population but not in the EUR populations have a 

frequency difference in the population of 17.5 . Since these SNPs are in eQTL in 

only the YRI population, this means they are affecting the expression of the gene 

which is in eQTL with them much differently than other populations. The difference 

in population structure affecting MAFs means that the expression of genes in eQTL 

will be affected by the population structure as well, as seen in Figure 2.7, causing a 

latent clustering of expression. The genes in eQTL with population-dependent alleles 

are recapitulated in the separation of populations by LD as well as gene expression.  

 
When comparing the embeddings that iDA produces compared to PCA, the 

cumulative F-statistic shows the iDA supervised embedding outperforms the other 

methods in both datasets with respect to the known cell types. The iDA unsupervised 

embedding outperforms its PCA counterpart in 10X-5cl and does just as well as the 

PCA embedding in 10X-3cl (Figure 2.9B).  

 
In instances where the nature of the latent structure is from discrete sources of 

variations such as batch effect, population stratification, and discrete cell types, iDA 

is able to capture these clusters well in the resulting embedding and clustering. iDA 


40 

 
leverages LDA in its algorithm to find the embedding and therefore is not as effective 

at finding an accurate embedding in scenarios where the latent structure is of a 

continuous fashion, such as in developmental biology. In such datasets where cell 

types are differentiating and form a continuous trajectory, the iDA embedding may 

find natural or unnatural breaks in trajectories and the resulting embedding will 

produce a more discrete structure than actually exists. Because of this, the use of iDA 

is most appropriate in datasets where the latent structure is thought to be of relatively 

discrete clusters. 

 
The "pseudo cell" mixture datasets from the CellBench experiments, which includes 9 

cell combinations of the same 3 cell lines yielding 34 unique groups with a 

continuous structure, show the case where cell lines are not discretely separated from 

each other (Figure 2.9C). The cumulative F-statistic shows that even in the case of 

continuous, trajectory-like structure between mixes of different cell type 

compositions, the iDA unsupervised (no number of clusters given as input) and iDA 

supervised (the estimated number of latent clusters is given as input) either 

outperforms or matches the performance that a PCA embedding finds with respect to 

attributing within to between cluster separation between cell type groupings.  

 
In standard pre-processing using PCA, classes which are distinctly defined but have 

low variation will not be separated on the axes of maximum variation. We have 

shown that iDA is sensitive to classes with unequal within-class variation and the 

iDA embedding better separates clusters and identifies the class assignments which 


41 

 
determine a more optimal clustering in the original data space, as evidenced by the 

Dunn Index. The iDA embedding can be evaluated to determine which features are 

contributing the most to the class separation and can be used as covariates in 

downstream analysis to correct for this latent clustering.  

2.6 Software Availability 

R (http://www.r-project.org) code for the iDA package is available at 

https://github.com/reesyxan/iDA.  

 
42 

 
Chapter 3: Introduction to investigating -omics data types in 

neurons 
 
 
3.1 Central dogma of biology within neurons 

Neurons are a unique classification of cell types in many ways but one of the most 

notable differences between themselves and homogeneous, symmetric cells is their 

unique structure. Like other cell types, neurons have a centralized cell body where the 

nucleus is housed containing each cell's DNA. In addition, neurons possess long 

projections, dendrites and axons, that enable them to connect to and communicate 

with neighboring cells, as well as cells that are extremely far away.  

  
With this unique structure comes unique regulatory mechanisms to molecularly 

support the growth, development, and homeostatic maturity of neurons and their sub-

compartments (Fernandopulle et al., 2021; Holt & Schuman, 2013). Neurons 

transport molecular packages and organelles bidirectionally both from nucleus to 

axon (anterograde) and axon to nucleus (retrograde) using a series of polarized 

microtubule tracks with mechanisms known as fast axonal transport and slow axonal 

transport (Brady et al., 2014; Dalla Costa et al., 2021; Ganguly et al., 2021). The 

motor proteins kinesin and dynein work together to transport materials in the plus-

end-direction and minus-end-direction, respectively, and offer an energetically 

efficient means for axonal transport and presynaptic replenishment (Guedes-Dias & 

Holzbaur, 2019).  

 
43 

 
Fast axonal transport moves packages at a rate of up to 2 micrometers per second. 

Considering axons can be millimeters to meters long, it can take hours up to days to 

transport vital proteins, organelles, and other molecules needed to support neuronal 

functions (Maday et al., 2014). Recent studies have shown that in addition to 

organelle and protein axonal localization, some neurons localize translationally 

silenced mRNAs to axonal subcompartments where they can then be translated 

locally in order to meet the axonal local proteomic requirements during development 

and maturity (Dougherty, 2017; Glock et al., 2021; Heiman et al., 2008; Ostroff et al., 

2019; Sanz et al., 2009; Shigeoka et al., 2016; Tushev et al., 2018). This process of 

local protein translation greatly reduces the time-sensitive anterograde burden and 

enables neurons to quickly translate the needed proteins locally to where they will be 

functionally incorporated.  

  
3.2 Regulation of local protein translation 

Axonogenesis and synaptogenesis have been shown to be regulated by differential 

mRNA shuttling to axonal subcompartments where local protein translation takes 

place (Preußner & Heyd, 2016; Shigeoka et al., 2016). The local protein translation 

process can be modulated by extrinsic and intrinsic cues (Jung et al., 2012). An 

example of the coupling of external growth cues with a shift in local protein 

translation is beta-actin translation in response to growth signaling cue Netrin-1 

where the attractive Netrin-1 causes 3’ untranslated region (3’ UTR)- mediated 

selective localization of beta-actin within axonal growth cones (Leung et al., 2006; 

Welshhans & Bassell, 2011; Yao et al., 2006). In parallel, past research has shown 


44 

 
that local protein translation is an activity-dependent mechanism for rapid protein 

production to direct axon growth during development (Choe & Cho, 2020). In 

particular, increased neuronal activity leads to an increase in local translation 

mediated by microRNAs (miRNAs) and RNA binding proteins (RBPs) in the 

neuropil (Jung et al., 2014; Narayanan et al., 2007; Sutton et al., 2004). Cis-acting 

miRNAs and RBPs translationally silence localized mRNAs by binding to UTR 

regions, creating a physical block for translational machinery binding. In the case of 

beta-actin, past studies have shown that neuronal activity “unmasks” beta-actin 

mRNA molecules by causing cis-acting UTR-bound elements to disassociate from the 

mRNA, therefore making them available to ribosomal machinery (Buxbaum et al., 

2014). This UTR unmasking was positively correlated with the rate of beta-actin 

translation. Together, this suggests that shifts in local protein translation caused by a 

disruption of extrinsic or intrinsic could lead to developmental defects.  

 
3.3 ipRGCs to study activity-dependent local protein translation 

Researchers have turned to developing new experimental methods to study the 

dynamic properties of local protein translation in both developing and mature neural 

circuits. To study the phenomenon of the dynamic shifts in the translatome 

populations across development as well as in perturbed activity states, we leveraged 

the unique properties of intrinsically photosensitive retinal ganglion cells (ipRGCs).  

ipRGCs are a class of retinal ganglion cells (RGCs), making up just 1%-3% of the 

total RGC population (Do & Yau, 2010), that express the photopigment melanopsin 

(Opn4). Unlike conventional RGCs, Opn4 expression in ipRGCs enables them to 


45 

 
respond to light cues independent of rods and cones and control functions such as 

circadian photoentrainment, pupillary light reflex, and metabolism (Beier et al., 2020; 

D. Berson, 2003; Hattar, 2002; Preußner & Heyd, 2016; Renna et al., 2011; Berry et 

al., 2023). ipRGCs are responsible for relaying environmental photic information 

from the retina to several brain targets and they possess three main properties that 

provide a model system for studying activity-dependent developmental local 

translation: 1. ipRGCs have physical separation between cell subcompartments with 

cell bodies in the retina and long range axonal projections in the brain, 2. ipRGCs 

have cell type-specific functions and innervation patterns allowing the evaluation of 

cell-type specific effects, and 3. ipRGCs are genetically tractable and the intrinsic 

photosensitivity can be exploited to perturb neuronal activity levels. The aggregation 

of these properties provides an ideal system to study dynamic, activity-dependent 

translatomic shifts in neuronal cell types.  

3.3.1 ipRGCs subcompartments are physically separated 

Studying local subcompartment translational regulation is essential for understanding 

subcompartment-specific changes. In order to study the subcompartment-specific 

local translation, cell subcompartments (somata versus axons) must be able to be 

physically separated from one another. ipRGC somata are born in the ganglion cell 

layer during retina cell type neurogenesis at embryonic days 10-12 (Lucas & 

Schmidt, 2019; McNeill et al., 2011). They then begin to extend projections during 

neurite outgrowth to several brain targets including the suprachiasmatic nucleus 

(SCN) and the dorsal lateral geniculate nucleus (dLGN) where they form long-range 


46 

 
connections from eye-to-brain (Do, 2019).  At maturity, these axonal tracts run ~10 

mm axon from retina to the SCN and ~15 mm axon from retina to the dLGN (K.-Y. 

Kim et al., 2019). Because these neurons innervate several brain regions that can be 

microdissected separately, the translatomes in the somata within the retinas can be 

harvested separately from axonal compartments in the SCN and dLGN for 

translatomic comparison.  

3.3.2 ipRGCs are defined by functional, developmental, and regional types 

ipRGC types have been characterized as modulating a wide array of biological 

processes, such as regulation of sleep/awake cycles, metabolism, hunger patterns and 

hormone production (Lazzerini Ospri et al., 2017). ipRGCs form parallel pathways to 

several brain targets to photoentrain these functional roles. ipRGCs projecting to the 

brain’s master circadian pacemaker, the suprachiasmatic nucleus (SCN) play a pivotal 

role in coordinating the circadian rhythm to photic input, while ipRGCs projecting to 

the dorsal lateral geniculate nucleus (dLGN) play a role in in visual perception. 

 
By postnatal day 0 (P0), ipRGC axons have significantly innervated multiple visual 

areas of the brain, including the dLGN (McNeill et al., 2011). However, this is not the 

case for the SCN. Although ipRGC axons pass by the SCN, they do not begin 

innervating the structure for several days (and do not finish doing so for several days 

more). The ipRGCs that innervate the SCN are known to be M1 ipRGCs, and even 

more specifically, are a molecularly defined subset of M1 ipRGCs because they lack 

the expression of Brn3b (Chen et al., 2011; Do, 2019). M1 ipRGCs innervate the 


47 

 
SCN are developmentally delayed when compared to non-M1 ipRGCs innervating 

other areas of the visual brain including the dLGN.  Based on these findings, ipRGC 

types are developmentally and regionally unique; SCN-projecting M1 ipRGCs have a 

different developmental timeline than dLGN-projecting non-M1 ipRGCs. 

 
3.3.3 ipRGCs are genetically tractable 

ipRGCs express the photosensitive protein melanopsin (Opn4). ipRGC axon 

development and early synaptogenesis in central targets occurs during a period where 

the photosensitivity of the retina is determined solely by the presence of the 

photosensitive protein melanopsin (D. M. Berson, 2002; Fu et al., 2005; Hattar, 2002; 

Provencio et al., 1998; Renna et al., 2011). Disruption of melanopsin-driven 

phototransduction has been shown to be critical in maintaining developmental 

features including retinal wave properties, retinal vasculature development, and 

retinofugal projection development (discussed further in chapter 5) (Renna et al., 

2011; Tiriac et al., 2018). In addition, the melanopsin phototransduction pathway also 

drives behavioral and developmental defects such as neonatal photoaversion and 

synaptogenesis (Caval-Holme et al., 2022; Hu et al., 2022). Particularly, melanopsin-

driven phototransduction in ipRGCs directly affects synaptogenesis in various brain 

targets by way of oxytocin release in the supraoptic nucleus and the paraventricular 

nucleus (PVN) (Hu et al., 2022). The direct link between melanopsin-driven activity 

in ipRGCs during development and intracellular translational regulation driving such 

developmental features has yet to be explored. Because of this evidence of the 


48 

 
regulation of local translation by neuronal activity, melanopsin expression dictating 

action potentials evoked by photic stimuli may play an important role in regulating 

activity-dependent plasticity of local protein production in the development of ipRGC 

circuits. Since Opn4 is both a unique cell type marker and controls a whole host of 

phototransduction-dependent developmental features, it is an ideal candidate to study 

ipRGC-specific transcriptomics and translatomics.  

 
3.4 Neuronal Translatomes: Axon-TRAP-RiboTag 

Evaluating different regulatory processes throughout the central dogma pipeline from 

DNA to protein is critical in understanding developmental and homeostatic neuronal 

processes. Tight regulation of transcription and translation allows for not only cell 

type functional specificity, but also cell compartment functional specificity, 

especially in systems where transcription and translation occur in distant 

compartments of the cell, such as neural tissue. Specifically, the translational 

regulation within cell compartments allows for complexity and precision of individual 

cell compartments. Studying the neuronal compartment translatomes has become vital 

to investigate axonal regulatory properties by quantifying the local protein translation 

in axons that enables rapid responses to cell-cell signaling cues during development 

as discussed in previous sections. Axon-TRAP-RiboTag, is a new experimental 

method developed by the Holt Lab to quantify the translatome in axons (Shigeoka et 

al., 2018). RiboTag is a genetic approach which allows for immunoprecipitation of 

ribosome bound mRNA in the presence of knocked-in Cre driver by labeling 

ribosomes with an HA epitope. RiboTag consists of a knocked in floxed wildtype 


49 

 
Rpl22 exon4 followed by an Rpl22 exon 4 which has an inserted HA epitope. When 

crossed with a mouse line containing a Cre recombinase driver, cells which express 

the Cre recombinase will also express the Rpl22HA mutant (Figure 3.1). The HA 

positive ribosomes can then be isolated by way of immunoprecipitation using an 

antibody for HA.  

 
Figure 3.1. RiboTag genetic approach with Opn4 Cre driver.  

RiboTag consists of a knocked-in floxed wildtype Rpl22 exon4 followed by an Rpl22 

exon 4 which has an inserted HA epitope. When crossed with a mouse line containing 

a Cre recombinase driver (such as Opn4 shown in this figure), cells which express the 

Cre recombinase will also express the Rpl22HA mutant (green cells in bottom row). 

Cre negative cells will not have HA labeled ribosomes (orange cells in bottom row). 

 
In the case of ipRGCs, melanopsin mutant Opn4CRE transgenic mice (Ecker et al., 

2010) were crossed with the transgenic mouse line Rpl22HA (Sanz et al., 2019) which 

has an HA epitope tag embedded in the coding region of the large ribosomal subunit 


50 

 
protein (illustrated in Figure 3.1). This cross produced a colony of mice that generates 

HA epitope-tagged ribosomes only in the presence of Cre recombinase. In the case of 

Opn4CRE  x Rpl22HA, the Cre recombinase is expressed in cells that express Opn4, 

labeling ribosomes in ipRGCs. Ribosome-bound mRNAs in tissue collected from 

these mice can then be isolated using an anti-HA antibody, effectively isolating the 

ipRGC translatome, and mRNAs then sequenced using low-input RNA-seq protocols.   

 
3.5 Contributions 

New experimental protocols are pushing the bounds of attainable information 

explaining how cells develop, function, and maintain homeostasis. Experiments using 

complex, cutting-edge sample collection and isolation protocols like axon-TRAP-

Ribotag require careful computational exploration to validate the data. Particularly 

when experiments require several experimental batches, understanding sources of 

variation (unwanted and wanted biological and technical contributions) is extremely 

important to ensure validity of downstream results. A challenge in axon-TRAP-

Ribotag experiments is consistency of the isolation of the translating mRNA 

population from the bulk mRNA population. There is a need for metrics to evaluate 

the efficiency and consistency of the protocol across experimental and technical 

variables. Only after potential batch effect contributors are identified and eradicated 

by sample filtering or batch correction methods can downstream analysis like 

differential expression analysis or gene set enrichment analysis be done. 

 
51 

 
In chapter 4, we provide a case study of special considerations for a complex, multi-

batch high throughput experiment. We investigate the multi-faceted heterogenic 

contributions of a study using the axon-TRAP-RiboTag translatomic isolation 

protocol in a neuronal cell type. Our results highlight that batch-correction methods 

may reduce signal due to true biological heterogeneity in addition to technical noise. 

Further, we show axon-TRAP-Ribotag experiments are subject to varying degrees of 

isolation efficiency with less-efficient pulldowns having a closer resemblance to bulk 

RNA-seq. We offer metrics to help identify biologically signal-driven batch-

differences specific to translatomic experiments.  

 
In chapter 5, we employ our understanding of variational contributions from chapter 4 

in the intrinsically photosensitive retinal ganglion cell (ipRGC) -omics case study to 

explore the biological transcriptomic and translatomic coordination. Our analysis 

revealed ipRGCs participate in subcompartment-specific local protein translation. 

Genetic perturbations of the Opn4 photopigment-driven neuronal activity led to 

global tissue transcriptomic shifts in both the retina and brain targets, but the ipRGC 

axonal-specific translatome was unaltered. 

 
52 

 
Chapter 4: Validating the efficacy of Axon-TRAP RNA-

Sequencing Results in Multi-Batch Studies  

 
4.1 Background 

4.1.1 Multi-batch processing in high-throughput studies  

Since their introduction, bulk RNA-sequencing studies have helped experimentalists 

better understand steady-state RNA levels and have become increasingly more 

accessible. In order to expand the range of biological information that can be obtained 

beyond RNA steady-state levels, experimental protocols have grown more complex 

enabling quantification of spatiotemporally precise expression regulation. In cases 

where spatial regulation can be studied in cell compartments, such as in long-range 

neuronal connections, to relate cell type soma and axonal compartment’s regulation, 

innovative experimental methods are being developed. Scientists in the field of 

neurobiology have turned to methods such as translating ribosome affinity 

purification (TRAP) and axon-TRAP-RiboTag to quantify the translatome, a post-

transcriptional state, to uncover complex biological phenomena (Reynoso et al., 2015; 

Shigeoka et al., 2018). These studies frequently involve a variety of experimental 

conditions, including various tissue locations, genotypic profiles, and developmental 

ages, requiring multiple biological replicates to be collected. Careful consideration of 

the experimental design needs to be taken since a large number of samples are needed 

to guarantee sufficient statistical power for subsequent analyses (Auer & Doerge, 

2010). Also, tissue collection techniques frequently call for pooling of tissue samples 

due to the limited amount of starting tissue available, particularly in the axonal 


53 

 
compartments of the cells, further increasing the number of samples needed for a 

study.  

 
These studies are frequently processed in multiple batches due to limitations on the 

number of samples that can be collected, processed, and sequenced with staffing and 

time limitations. To eliminate erroneous results in subsequent analyses, it is necessary 

to deconvolve potential batch effect (variation due to non-biological factors in an 

experiment) driving variational contributions in complex experiments with numerous 

experimental conditions as well as when processing or experimental batches (Sheerin 

et al., 2019). Par