ABSTRACT Title of Dissertation: PENALIZED STATISTICAL MODELS FOR PATHWAY-BASED TRANSCRIPTOMEWISE ASSOCIATION STUDY AND HIGH-DIMENSIONAL MEDIATION ANALYSIS Neng Wang Doctor of Philosophy, 2025 Dissertation Directed by: Professor Charles Ma Department of Epidemiology and Biostatistics High–throughput genomics and neuroimaging now generate thousands of correlated molecular and imaging features for each participant, presenting unprecedented opportunities—and methodological challenges—for causal and genetic discovery. This dissertation develops two complementary statistical frameworks that address key limitations of classical tools when confronted with such high-dimensional data. In Chapter 1, I introduce that genome-wide association studies (GWAS) have pinpointed numerous SNPs linked to human diseases and traits, yet many of these SNPs are in non-coding regions and hard to interpret. Transcriptome-wide association studies (TWAS) integrate GWAS and expression reference panels to identify the associations at gene level with tissue specificity, potentially improving the interpretability. However, the list of individual genes identified from univariate TWAS contains little unifying biological theme so the underlying mechanisms remain largely elusive. These limitations motivate a unified framework that not only identifies trait-associated genes through multivariate TWAS, but also traces how their effects propagate through intermediate brain features to influence clinical outcomes—thus naturally extending gene-level association analysis into high-dimensional mediation analysis. Mmediation analysis is a fundamental tool for elucidating causal mechanisms in complexsystems. However, the emergence of high-throughput biological and neuroimaging technologies has introduced multivariate exposures and mediators of increasing dimensionality, rendering classical approaches inadequate. In the Chapter 2 we propose a novel multivariate TWAS method that Incorporates Pathway or gene Set information, namely TIPS, to identify genes and pathways most associated with complex polygenic traits. We jointly modeled the imputation and association steps in TWAS, incorporated a sparse group lasso penalty in the model to induce selection at both gene and pathway levels and developed an expectation-maximization algorithm to estimate the parameters for the penalized likelihood. We applied our method to three different complex traits: systolic and diastolic blood pressure, as well as a brain aging biomarker white matter brain age gap in UK Biobank and identified critical biologically relevant pathways and genes associated with these traits. These pathways cannot be detected by traditional univariate TWAS + pathway enrichment analysis approach, showing the power of our model. We also conducted comprehensive simulations with varying heritability levels and genetic architectures and showed our method outperformed other established TWAS methods in feature selection, statistical power and prediction. The R package that implements TIPS is available at https://github.com/nwang123/TIPS. In Chapter 3, we propose a novel aggregation-based mediation framework that simultaneously models and selects high-dimensional multivariate exposures and mediators. Our method identifies sparse linear combinations of variables in each domain that jointly maximize the mediated effect, defined as the product of exposure–mediator and mediator–outcome effects. To estimate these low-dimensional aggregators, we formulate a bi-convex objective function integrating residual https://github.com/nwang123/TIPS sum-of-squares penalties from standard mediation submodels, a structured mediation-enhancing term, and ℓ1-penalties that induce sparsity. The resulting optimization problem is solved efficiently using an alternating direction method of multipliers (ADMM) algorithm with block coordinate updates. Through extensive simulations, we demonstrate that our method achieves superior performance in recovering true mediators and estimating mediation proportions across a wide range of signal strengths, noise levels, and correlation structures. Compared to existing methods—including MMP, Pathway Lasso, sparse PCA mediation, and the Directions of Mediation framework—our approach exhibits higher selection accuracy and reduced bias, particularly in challenging high- correlation regimes. We further illustrate the utility of the method in a real data application involving the mediation of smoking behavior through neuroimaging features, revealing biologically meaningful pathways linking gene expression in the nucleus accumbens to structural and functional brain indices implicated in addiction. These findings highlight the potential of our framework for integrative mediation analysis in high-dimensional biomedical studies. PENALIZED STATISTICAL MODELS FOR PATHWAY-BASED TWAS AND HIGH-DIMENSIONAL MEDIATION ANALYSIS by Neng Wang Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2025 Advisory Committee: Professor Charles Ma, Chair/Advisor Professor Eric V. Slud, Co-Advisor Professor Lizhen Lin Professor Yan Li Professor Chixiang Chen © Copyright by Neng Wang 2025 Acknowledgments I owe my gratitude to all the people who have made this dissertation possible and whose support has made my Ph.D. journey intellectually fulfilling and personally meaningful. First and foremost, I would like to express my deepest thanks to my advisor, Professor Charles Ma, for his constant support, insightful guidance, and encouragement throughout my Ph.D. studies. His patience, dedication, and enthusiasm for research have greatly shaped my development as a statistician. I am especially grateful for his belief in my ideas, his timely advice, and his constructive feedback, all of which have been invaluable to this dissertation. I am also deeply grateful to my co-advisor, Professor Eric V. Slud, for his sharp statistical insights, rigorous standards, and detailed guidance. His thoughtful comments and mathematical clarity have elevated the quality of this work and challenged me to think more precisely and thoroughly. I thank the remaining members of my dissertation committee for their valuable feedback and for taking the time to review this work. Their comments have helped improve the scope, presentation, and clarity of this dissertation. I am thankful to the faculty and staff at the Department of Mathematics and the Statistics Program at the University of Maryland for providing a nurturing academic environment. I also thank my fellow graduate students and friends, who have offered not only stimulating academic conversations but also much-needed companionship and emotional support throughout ii this journey. The shared experiences made even the most challenging moments enjoyable. Most importantly, I owe my deepest thanks to my family — my parents, whose unconditional love, unwavering support, and countless sacrifices have carried me through every step of this journey. Their belief in me has been my greatest source of strength and motivation. Finally, I want to thank myself — for persisting through moments of doubt, for embracing challenges with resilience, and for growing into the person I am today. This dissertation is not only a culmination of academic effort, but also a reflection of personal transformation and perseverance. I am truly grateful to all who have contributed in ways both big and small. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables vi List of Figures vii 1 Introduction 1 1.1 Variable selection and regularization . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Background of variable selection problem . . . . . . . . . . . . . . . . . 2 1.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Causal Inference and Mediation Analysis . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.2 Causal Mediation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3 Transcriptome-Wide Association Studies (TWAS) . . . . . . . . . . . . . . . . . 16 1.4 Imaging Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 TIPS: a novel pathway-guided joint model for transcriptome-wide association studies 22 2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 The TIPS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 EM algorithm for penalized likelihood . . . . . . . . . . . . . . . . . . . 30 2.3.3 Post-selection gene-level and pathway-level inference . . . . . . . . . . 32 2.3.4 Other related methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.4 Application to TWAS analysis in UK Biobank . . . . . . . . . . . . . . . . . . . 34 2.4.1 TWAS analysis on blood pressure . . . . . . . . . . . . . . . . . . . . . 34 2.4.2 TWAS analysis on white matter brain aging . . . . . . . . . . . . . . . . 40 2.5 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.1 Simulation setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 iv 3 High-Dimensional Multivariate Mediation Analysis Integrating Genetic, Imaging, and Phenotypic Data 51 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.3.1 A multi-exposure-to-multi-mediator model . . . . . . . . . . . . . . . . 57 3.3.2 Optimization using alternating direction method of multipliers (ADMM) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3.3 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.1 Simulation settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.5 Application to Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4 Conclusion 82 4.1 Implications for Statistical Genetics and Neuroimaging Studies . . . . . . . . . . 83 4.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A Additional simulation results for Chapter 2 87 A.0.1 Review of TWAS method development . . . . . . . . . . . . . . . . . . 87 A.0.2 Supplementary results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.0.3 Data simulation results for MAF = 0.5 . . . . . . . . . . . . . . . . . . . 96 A.0.4 Results for gene selection by TIPS . . . . . . . . . . . . . . . . . . . . . 97 B Additional theorem and lemma with detailed proof for Chapter 3 98 B.0.1 Convergence Theorem of ADMM . . . . . . . . . . . . . . . . . . . . . 98 B.0.2 Oracle ℓ2–consistency of the sparse pathway estimator . . . . . . . . . . 103 Bibliography 109 v List of Tables 2.1 Main characteristics of different TWAS methods and their comparison to our TIPS method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2 Mean area under the ROC curve (AUC) for simulation Scenarios I–III under additive or heterogeneous genetic architectures with the standard deviation shown in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1 Results (n = 200) for complete mediation effect under σ = 1 across four correlation scenarios (ρX, ρM). Each entry is averaged over 1,000 replications and summarises active-mediator selection accuracy and mediation-proportion estimation performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.2 Results (n = 200) for partial mediation effect under σ = 1 across four correlation scenarios (ρX, ρM). Each entry is the average of 1,000 replications, summarising active-mediator selection accuracy and mediation-proportion estimation performance. 74 3.3 Results (n = 200) for complete and partial mediation effects under σ = 3, across four correlation scenarios (ρX, ρM). Each entry is the average of 1,000 replications, summarising active-mediator selection accuracy and mediation-proportion estimation performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 vi List of Figures 2.1 (A). Conceptual framework to motivate the TIPS model; (B) Flowchart of the TIPS method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 (A) Manhattan plot of -log10(p-value) from pathway level test results of TIPS sorted by KEGG IDs (each dot represents a pathway) for SBP (upper) and DBP (lower) traits. (B) Topology plot of two significant KEGG pathways: Arginine and Proline metabolism and Glycerolipid metabolism. Selected genes in the pathway are highlighted in red (SBP) and green (DBP). . . . . . . . . . . . . . . 39 2.3 (A) Comparison of major categories of pathways detected in different methods. Number in the cells refers to the number of significant pathways within each category; (B) Topology plot of selected KEGG pathway “Pyrimidine metabolism” by TIPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 2.4 ROC curve comparison for the simulation scenarios I-III with additive or heterogeneous genetic architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.5 Power comparison for the simulation scenarios I-III with additive or heterogeneous genetic architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Conceptual diagram illustrating the mediation framework for gene–brain–disorder pathway. Gene expression features serve as exposures, MRI features (e.g., functional connectivity) act as mediators, and brain disorder phenotypes are the outcomes. . 56 3.2 (A) Heatmap of pairwise correlations among the 9 TWAS-identified genes used as exposures. (B) Correlation heatmap of ROI-to-ROI functional connectivity among 235 brain regions. The highlighted block (top left) marks 35 subcortical regions and their 630 FCs used as mediators in the analysis. . . . . . . . . . . . . 80 vii Chapter 1: Introduction 1.1 Variable selection and regularization Beyond traditional statistical models, variable selection and regularization methods play a critical role in advanced machine learning models such as deep learning to reduce overfitting and improve the interpretability of the model. In this section, we review the foundational principles and methodological advancements in variable selection and regularization, with an emphasis on their relevance to modern high-dimensional data analysis. We begin by introducing the classical variable selection problem within the low-dimensional paradigm, and subsequently discuss their limitations in the presence of high-dimensional feature space. To address the challenges posed by large-scale data, we examine regularization techniques that augment the loss function with structured penalty terms to enforce sparsity or other desirable properties. Widely used methods such as the Lasso, Elastic Net, and Adaptive Lasso are introduced alongside nonconvex penalties like SCAD and MCP, each offering trade-offs between computational tractability, variable selection consistency, and bias reduction. We further discuss group-based extensions and structured regularizers that incorporate prior domain knowledge to improve selection stability and biological interpretability. The implementation of regularized methods inherently involves solving complex optimization problems. To this end, we provide an overview of optimization algorithms widely used in high- 1 dimensional statistics, including gradient-based methods, Newton-type algorithms, Expectation- Maximization (EM), coordinate descent, proximal gradient, and alternating direction method of multipliers (ADMM) algorithms. These techniques, with their distinct convergence properties and computational advantages, form the algorithmic backbone for scalable inference in penalized models. 1.1.1 Background of variable selection problem Variable selection remains a foundational problem in statistical modeling, underpinning the development of parsimonious, interpretable, and predictive models across a wide spectrum of applications. From classical linear regression to modern high-dimensional statistical learning, the central objective is to identify a subset of relevant explanatory variables that adequately explains the variation in a response variable, while excluding irrelevant or redundant features. In low-dimensional settings, when the number of variables p is small relative to the sample size n, classical model selection criteria such as AIC and BIC serve as primary tools for variable selection. AIC, proposed by [Akaike, 1973], is grounded in information theory and aims to minimize the Kullback–Leibler divergence between the true model and the candidate model. BIC, introduced by [Schwarz, 1978], incorporates a Bayesian perspective and imposes a heavier penalty for model complexity, thereby favoring more parsimonious models under certain asymptotic regimes. Both criteria balance goodness-of-fit and model complexity, typically applied in a stepwise manner (forward, backward, or both) or through exhaustive search of the best subset when computationally feasible. With the advent of high-throughput technologies and the proliferation of data-rich environments, 2 modern datasets are characterized by the so-called “large p, small n” paradigm, where p ≫ n. This renders classical selection tools inadequate due to their instability, computational infeasibility, and poor performance in the presence of multicollinearity and overfitting. Consequently, variable selection in high-dimensional contexts necessitates the development of regularization-based methods that incorporate sparsity-inducing penalties. Regularization approaches augment the objective function—often a loss function such as least squares or negative log-likelihood—with a penalty term that enforces structural constraints on the parameter space. The most popular method in this class is the Least Absolute Shrinkage and Selection Operator (Lasso) introduced by [Tibshirani, 1996], which employs an ℓ1-norm penalty to shrink coefficients towards zero and perform automatic variable selection. The Lasso is particularly appealing due to its convex formulation and interpretability; however, it exhibits limitations in scenarios with high correlations among predictors and may fail to achieve consistent variable selection under certain conditions [Zhao and Yu, 2006, Meinshausen and Bühlmann, 2006]. To address these limitations, several extensions and alternatives are proposed. The Elastic Net [Zou and Hastie, 2005] combines ℓ1 and ℓ2 penalties to stabilize variable selection in correlated settings. The Adaptive Lasso [Zou, 2006] modifies the Lasso by introducing adaptive weights, enabling oracle properties under suitable conditions. In addition, other non-convex penalties, such as the smoothly clipped absolute deviation (SCAD) penalty [Fan and Li, 2001] and the minimax concave penalty (MCP) [Zhang, 2010], are proposed with improved theoretical properties and better mimic the ideal behavior of the ℓ0-penalty in variable selection. Beyond penalized likelihood, model selection in high-dimensional settings also embraces methods rooted in Bayesian inference (e.g., spike-and-slab priors), screening procedures (e.g., 3 Sure Independence Screening by [Fan and Lv, 2008]), and stability-based techniques (e.g., stability selection by [Meinshausen and Bühlmann, 2010]). Recent developments in debiased estimation [van de Geer et al., 2014, Javanmard and Montanari, 2014] and post-selection inference further highlight the importance of balancing selection accuracy with valid statistical inference, a critical need in scientific disciplines to ensure reproducibility. As big data continue to expand and diversify, new challenges arise for variable selection. For example, we may need to incorporate structural constraints (e.g., group sparsity, hierarchical models) and leverage prior knowledge (e.g., through graph- or pathway-based penalties) to guide variable selection. In addition, variables may need to be selected for different purposes, so it becomes natural to integrate variable selection into broader modeling frameworks such as mediation analysis, survival models, and causal inference, beyond simple linear model or generalized linear model where most of the variable selection methods are first developed. 1.1.2 Regularization Regularization techniques become central to modern statistical methodology, particularly in the context of high-dimensional data analysis where traditional approaches often fail. Regularization refers to the addition of a penalty term to a loss function, with the goal of enforcing structural constraints such as sparsity, smoothness, or low rank, thereby improving model interpretability and generalizability. In the variable selection context, regularization not only stabilizes estimation procedures but also facilitates the identification of relevant predictors by shrinking some coefficients exactly to zero. The most widely adopted regularization method is the Lasso (Least Absolute Shrinkage 4 and Selection Operator), introduced by [Tibshirani, 1996]. In linear model, Lasso solves the problem β̂lasso = arg min β∈Rp { 1 2n ∥Y −Xβ∥22 + λ∥β∥1 } , where λ ≥ 0 is a tuning parameter that controls the degree of penalization. The ℓ1 penalty encourages sparsity in the estimated coefficients, leading to automatic variable selection. Its computational efficiency and convex formulation make Lasso a popular tool across numerous domains. However, Lasso also has well-known limitations: it tends to select only one variable from a group of highly correlated predictors and can lead to biased coefficient estimates due to the nature of the ℓ1 shrinkage [Zhao and Yu, 2006, Meinshausen and Bühlmann, 2006]. To mitigate these issues, several extensions of Lasso are proposed. The Elastic Net [Zou and Hastie, 2005] combines ℓ1 and ℓ2 penalties: β̂EN = arg min β∈Rp { 1 2n ∥Y −Xβ∥22 + λ [ α∥β∥1 + (1− α)∥β∥22 ]} . where α compromises between ridge (α = 0) and lasso regression (α = 1). The Elastic net regression stabilizes the lasso solution when predictors are highly correlated. The Adaptive Lasso [Zou, 2006] introduces weights to the ℓ1 penalty, enabling the estimator to achieve the oracle property under appropriate conditions: β̂AL = arg min β∈Rp { 1 2n ∥Y −Xβ∥22 + λ p∑ j=1 wj|βj| } , wj = 1 |β̂init j |γ . Beyond convex penalties, nonconvex regularization methods also gain more attention for their improved theoretical properties and ability to better approximate the ideal ℓ0 penalty. Two 5 notable examples are the smoothly clipped absolute deviation (SCAD) penalty [Fan and Li, 2001] and the minimax concave penalty (MCP) [Zhang, 2010]. These penalties exhibit continuity and sparsity while reducing estimation bias for large coefficients. Although the resulting optimization problems are nonconvex, local solutions are often found using coordinate descent algorithms or local quadratic approximations. In parallel, group-based regularization techniques are developed to leverage structural knowledge about the predictors. The Group Lasso [Yuan and Lin, 2006] extends Lasso by penalizing the ℓ2 norm of groups of coefficients, enabling variable selection at the group level. More recent variants include the Sparse Group Lasso [Simon et al., 2013], which combines group and individual sparsity, and methods for overlapping [Jacob et al., 2009] or hierarchical groups [Jenatton et al., 2011]. The choice of regularization penalty and its tuning remains a critical issue in practice. Model selection consistency, estimation accuracy, and computational tractability often present trade-offs. Recent developments in theoretical statistics provide conditions under which regularized estimators achieve desirable properties such as sparsistency, asymptotic normality, and valid inference [van de Geer et al., 2014, Bühlmann and van de Geer, 2011]. These advances pave the way for principled approaches to high-dimensional modeling that balance empirical performance with theoretical guarantees. 1.1.3 Optimization Optimization lies at the core of modern statistical estimation and learning algorithms. Most statistical methods for high-dimensional variable selection problems ultimately require the 6 minimization of a loss function augmented with a penalty term. These optimization problems often involve non-smooth, non-convex, or high-dimensional objective functions, and their numerical resolution requires careful algorithmic choices to ensure convergence, scalability, and statistical accuracy. Gradient-based methods are the classical workhorses of continuous optimization. Gradient descent updates parameters by moving along the negative gradient of a differentiable loss function, iteratively refining the parameter estimates as β(t+1) = β(t) − ηt∇f(β(t)), where ηt denotes the learning rate. Under convexity and smoothness assumptions, gradient descent guarantees convergence to a global minimum [Nocedal and Wright, 2006]. However, its convergence can be slow, especially for ill-conditioned problems. To accelerate convergence, Nesterov’s method introduces a momentum term that improves the convergence rate to O(1/t2) for convex functions [Nesterov, 1983]. In large-scale data settings, stochastic gradient descent (SGD) and its adaptive variants, such as AdaGrad [Duchi et al., 2011], RMSProp, and Adam [Kingma and Ba, 2014], are widely used due to their computational efficiency and suitability for streaming or online learning contexts. Newton and quasi-Newton methods make use of second-order curvature information to achieve faster convergence. Newton’s method computes parameter updates using the inverse Hessian matrix of the objective function, thereby enjoying quadratic convergence near the optimum. Nevertheless, the cost of computing and inverting the Hessian scales poorly with dimensionality. Quasi-Newton algorithms, such as BFGS and L-BFGS [Byrd et al., 1995], approximate the Hessian and offer a practical balance between convergence speed and computational cost, making them effective for medium-scale problems. The Expectation-Maximization (EM) algorithm is particularly valuable for models involving 7 latent variables or missing data. Introduced by Dempster et al. [1977], the EM algorithm iteratively alternates between computing the expected complete-data log-likelihood (E-step) and maximizing it (M-step). Its penalized variant, known as the Penalized EM algorithm, integrates regularization into the M-step, enabling variable selection in models such as sparse Gaussian mixtures, latent class regression, and factor analysis [Green, 1990, Böhning and Lindsay, 1994, Stadler et al., 2010]. In Chapter 2, we develop an EM algorithm for penalized likelihood with sparse group lasso like regularizer for identifying both pathway groups and genes associated with phenotype in TWAS studies. Although EM monotonically increases the observed-data likelihood and converges to a stationary point, that point may be only a local optimum—an issue it shares with gradient- based and quasi-Newton methods. Because EM’s non-descent nature also allows convergence to saddle points, practitioners typically use random restarts or other multi-start strategies to improve the chance of reaching a globally optimal solution. Coordinate descent methods are especially effective for solving regularized problems with separable penalty functions, such as the Lasso and its extensions. At each iteration, coordinate descent cyclically updates one parameter while keeping all others fixed, yielding closed-form updates in many cases. This strategy proves remarkably efficient for convex objective functions with sparsity-inducing penalties, such as ℓ1 and group-lasso norms [Friedman et al., 2010]. It also extends to generalized linear models and Cox regression, with additional screening rules to accelerate computation [Tibshirani et al., 2012]. Proximal gradient methods generalize gradient descent to composite optimization problems, where the objective function can be decomposed into a smooth part and a non-differentiable regularizer. For an objective of the form f(β) + g(β), where f is smooth and convex and g is convex but possibly non-smooth, the proximal gradient method performs updates via β(t+1) = 8 proxλg ( β(t) − ηt∇f(β(t)) ) , where proxλg(·) denotes the proximal operator of g. Proximal algorithms are particularly well-suited for ℓ1-regularized regression, fused lasso, and total variation denoising problems [Parikh and Boyd, 2014]. Accelerated variants such as FISTA (Fast Iterative Shrinkage- Thresholding Algorithm) improve convergence from O(1/t) to O(1/t2) [Beck and Teboulle, 2009]. The Alternating Direction Method of Multipliers (ADMM) is another prominent technique for solving convex problems with decomposable structures. ADMM is particularly useful in distributed and large-scale settings, where it can break complex optimization tasks into simpler subproblems that are solved alternately. For an objective function decomposed as f(x) + g(z) subject to linear constraints Ax + Bz = c, ADMM alternates updates to x, z, and the dual variable using augmented Lagrangian formulations [Boyd et al., 2011]. In the context of sparse estimation, ADMM is successfully applied to problems such as sparse logistic regression, inverse covariance estimation, and structured regularization. In high-dimensional settings involving non-convex penalties such as SCAD or MCP, the optimization problem becomes more challenging due to the presence of multiple local minima. Nevertheless, recent theoretical developments show that under certain regularity conditions, local solutions can still yield statistically consistent estimators [Loh and Wainwright, 2015]. Techniques such as the concave-convex procedure (CCCP) and difference-of-convex (DC) programming [Gasso et al., 2009] are developed to solve such problems with convergence guarantees to critical points. The theoretical properties of optimization algorithms, particularly their convergence behavior, are crucial for statistical validity. For convex problems, convergence guarantees under strong convexity and Lipschitz continuity are well established. For instance, coordinate descent, gradient 9 descent, and proximal methods all enjoy global convergence under mild conditions [Bubeck, 2015]. In non-convex settings, the Kurdyka–Łojasiewicz inequality is employed to establish convergence of iterative schemes to critical points, even when global optimality is not guaranteed [Attouch et al., 2013, Bolte et al., 2014]. 1.2 Causal Inference and Mediation Analysis Causal inference provides a foundational framework for understanding and quantifying the effects of interventions, exposures, or treatments on outcomes in complex systems. In contrast to associational analysis, causal inference methods aim to disentangle cause-effect relationships through explicit modeling of counterfactual outcomes and structural assumptions. Over the past decades, the development of both potential outcomes theory and graphical causal models has enabled rigorous identification, estimation, and inference of causal effects from both randomized and observational data. These advances have profoundly impacted fields such as epidemiology, social sciences, economics, and genomics. Within the broader scope of causal inference, causal mediation analysis plays a pivotal role in uncovering mechanistic pathways and intermediate processes through which causal effects propagate. By decomposing the total effect of an exposure into direct and indirect components, mediation analysis facilitates deeper understanding regarding the underlying mechanisms. Early developments in causal mediation analysis focused on single exposure and single mediator settings. However, the increasing availability of high-dimensional and structured data—such as gene expression profiles, neuroimaging measurements, and microbiome compositions—has necessitated new methodological innovations with multiple related exposures and mediators. 10 This chapter presents a unified overview of modern causal inference and mediation analysis, with an emphasis on their application to high-dimensional and biologically structured data. We begin with classical foundations, including the Rubin Causal Model and structural causal models, and review identification strategies such as the back-door criterion and sequential ignorability. We then discuss methodological extensions to accommodate high-dimensional mediators, regularization strategies, and algorithmic advances in optimization for scalable computation. These developments underpin the contributions of this dissertation, which integrates causal reasoning, sparse estimation, and computational techniques to study mediation pathways in complex biological systems. 1.2.1 Causal Inference Causal inference seeks to draw conclusions about cause-effect relationships, distinguishing itself from traditional association-based statistical methods. One of the most influential frameworks for formalizing causal inference is the potential outcomes model, also known as the Rubin Causal Model (RCM) [Rubin, 1974]. Under this framework, each unit has two potential outcomes: one under treatment and one under control. The fundamental problem of causal inference is that only one potential outcome is observed per unit, which introduces the need for strong assumptions and careful design to identify causal effects. Estimands such as the average treatment effect (ATE) and the average treatment effect on the treated (ATT) are defined as expectations over these counterfactual outcomes. Another foundational approach is Pearl’s structural causal model (SCM) and the use of directed acyclic graphs (DAGs) [Pearl, 2009]. DAGs provide a graphical representation of causal assumptions and facilitate identification of causal effects through graphical criteria such as d- 11 separation and back-door paths. Under this approach, do-calculus enables reasoning about interventions and policy changes, while adjustment sets can be determined to estimate causal effects nonparametrically. Identification of causal effects from observational data requires conditions such as unconfoundedness (also known as conditional ignorability), consistency, and positivity. The back-door criterion in DAGs and the ignorability assumption in RCM both formalize when simple adjustment (e.g., regression or stratification) can be used to recover causal effects from observed data. In the absence of randomization, methods such as propensity score weighting [Rosenbaum and Rubin, 1983], matching [Rosenbaum, 1989], inverse probability weighting [Robins et al., 1994], and doubly robust estimation [Robins et al., 1995, Bang and Robins, 2005] are frequently employed. Causal inference has evolved to address complexities such as longitudinal treatments, time- varying confounding, mediation, and interference. Marginal structural models (MSMs) [Robins et al., 2000] and g-methods [Robins, 1986] have been developed to handle these settings, particularly in epidemiological studies. More recently, causal mediation analysis has become popular for decomposing total effects into direct and indirect effects, with identification results established under both sequential ignorability and more complex assumptions [Imai et al., 2010a, VanderWeele, 2015]. The advent of high-dimensional data has spurred new developments in causal inference. Traditional methods that rely on parametric modeling or univariate confounder adjustment are insufficient when the number of covariates is large or exceeds the sample size. Regularization- based methods, such as Lasso-adjusted treatment effect estimation [Belloni et al., 2014], have been proposed to perform variable selection while maintaining valid inference. Techniques like double machine learning [Chernozhukov et al., 2018], targeted maximum likelihood estimation (TMLE) [van der Laan and Rubin, 2006], and semiparametric efficiency theory are now central 12 tools for high-dimensional causal estimation. In addition to parametric and semiparametric methods, causal discovery—inferring causal structure from data—has gained popularity. Constraint-based methods (e.g., PC algorithm), score-based methods (e.g., GES), and functional model-based methods (e.g., LiNGAM) aim to learn DAG structures from data under assumptions such as faithfulness and acyclicity [Spirtes et al., 2000, Shimizu et al., 2006]. While causal discovery is more exploratory in nature, it plays an increasingly important role in guiding model specification and hypothesis generation in the absence of data from designed or randomized experiments. 1.2.2 Causal Mediation Analysis Causal mediation analysis is a critical extension of causal inference that seeks to elucidate the mechanisms through which a treatment or exposure affects an outcome. Rather than focusing solely on estimating total causal effects, mediation analysis decomposes this total effect into two components: the direct effect of the treatment on the outcome and the indirect effect mediated through one or more intermediate variables, known as mediators. The modern development of causal mediation analysis is grounded in the potential outcomes framework pioneered by Rubin [1974] and further formalized in the seminal work of Holland [1988]. Under this framework, the concepts of the average causal mediation effect (ACME) and average direct effect (ADE) are defined, providing a rigorous statistical interpretation of mediation. Subsequent contributions by Robins and Greenland [1992] and Pearl [2001] introduce structural causal models (SCMs) and graphical representations, enabling identification of mediation effects under weaker assumptions. Notably, the identification of natural direct and indirect 13 effects hinges on key assumptions such as sequential ignorability, which asserts the conditional independence of the mediator and the outcome from the treatment assignment given observed covariates. The methodological landscape of causal mediation analysis has since expanded to accommodate diverse data structures and research goals. Imai et al. [2010b] develop a flexible framework for parametric and nonparametric mediation analysis using linear and generalized linear models, along with sensitivity analysis to evaluate violations of the no-unmeasured-confounding assumption. Their work also leads to the development of the widely used mediation R package, facilitating the adoption of mediation techniques in applied research. In parallel, VanderWeele [2015] provides a comprehensive synthesis of identification and inference under various causal models, extending the framework to settings involving multiple mediators, longitudinal data, and interactions. While early mediation studies focus on low-dimensional settings, the rise of high-throughput technologies and multi-omics data shifts attention to high-dimensional mediation analysis. These settings pose substantial methodological challenges due to the curse of dimensionality, multicollinearity among mediators, and the need for variable selection. To address these challenges, recent research integrates regularization and sparse modeling into the mediation framework. For example, Zhou and Pan [2020] and Huang et al. [2020] propose penalized mediation models using Lasso-type penalties to identify sparse sets of mediators while retaining interpretability. Similarly, the debiased estimation approach by Zhao et al. [2019] extends post-selection inference to the mediation context, offering valid confidence intervals for selected mediators. For high-dimensional omics data, many features are highly correlated; it is therefore common to analyze multiple exposures and multiple mediators jointly in the same model. In Chapter 3, we propose a novel multi- exposure-to-multi-mediator mediation model that integrates genetic, neuroimaging, and phenotypic 14 data to understand the neurogenetic mechanism of brain disorders. More recent developments include models for group-structured and hierarchical mediators, such as the group-lasso-based methods introduced by Song et al. [2021], and approaches that incorporate prior biological knowledge through structured penalties or graphical priors. These innovations are particularly relevant in genomics and neuroimaging, where mediators often possess inherent spatial or functional structures. For example, in imaging genetics, researchers apply sparse structural equation models and canonical-correlation-based strategies to identify brain regions mediating the effects of genetic variants on behavior and disease traits [Chen et al., 2022]. Moreover, advances in optimization and computational algorithms facilitate scalable implementation of high-dimensional mediation estimators. Methods based on ADMM, coordinate descent, and proximal gradient schemes enable efficient solutions to nonconvex and non-smooth optimization problems that arise in penalized mediation modeling. These algorithmic improvements, along with theoretical guarantees such as consistency and asymptotic normality, broaden the applicability of mediation analysis in high-dimensional settings. In summary, causal mediation analysis evolves from a conceptual tool for decomposition into a rigorous and flexible framework capable of handling complex, high-dimensional, and structured data. The integration of causal inference principles with regularization, optimization, and computational statistics provides a powerful foundation for mechanistic understanding in contemporary data science. 15 1.3 Transcriptome-Wide Association Studies (TWAS) The past decade has witnessed rapid advancements in statistical genetics facilitated by the proliferation of high-throughput platforms and large-scale biobank resources. These developments enable researchers to move beyond genome-wide association studies (GWAS) that merely locate trait-associated variants, toward integrative frameworks that connect molecular level features (e.g. genes), functions and mechanisms with complex phenotypes. Transcriptome-Wide Association Studies (TWAS) have emerged as a powerful integrative framework to detect associations between gene expression (a.k.a. genetically regulated expression or GReX) and complex traits by leveraging genome-wide association study (GWAS) summary statistics and reference transcriptomic data. TWAS integrates GWAS summary statistics with transcriptomic reference panels to identify genes whose genetically regulated expression is associated with traits of interest. By imputing gene expression levels using models trained on expression quantitative trait loci (eQTLs), TWAS facilitates gene-level association analysis even in the absence of observed expression data. This paradigm advances our ability to prioritize putative causal genes and explore regulatory mechanisms underlying complex traits. Modern TWAS implementations further incorporate statistical tools for mediation analysis, sparse modeling, and probabilistic fine-mapping to address challenges such as linkage disequilibrium and pleiotropy. The TWAS framework was initially formalized by Gusev et al. [2016], who proposed the use of predictive models trained on reference expression data (e.g., GTEx or DGN) to estimate the cis-genetic component of gene expression. The genetically predicted expression levels are then tested for association with the phenotype using GWAS summary statistics. This two-stage approach enables the identification of genes whose genetically regulated expression levels are 16 associated with the trait, even when direct transcriptomic measurements are unavailable in the GWAS sample. The pioneering PrediXcan method by Gamazon et al. [2015] utilizes elastic net regression to train gene expression models and provides a foundation for both individual-level and summary-based inference. The summary-based TWAS approach (S-PrediXcan) was later refined to improve scalability and accessibility to public data. By avoiding the need for individual-level GWAS data, S- PrediXcan enables broad application across many phenotypes [Barbeira et al., 2018]. Further, extensions such as FUSION [Gusev et al., 2016] and UTMOST [Hu et al., 2019] expand TWAS models by incorporating multiple tissues, sharing information across tissues, and modeling expression covariances, which increase power and interpretability in detecting regulatory mechanisms underlying complex traits. Despite the widespread use of TWAS, several limitations have been documented. One key concern is confounding due to linkage disequilibrium (LD), which can induce spurious associations when SNPs used to impute gene expression are in LD with causal variants affecting the phenotype independently of expression. This can result in TWAS signals that do not reflect true mediation. To address this, methods such as conditional and joint analysis [Zhang et al., 2020], fine-mapping approaches (e.g., FOCUS; Mancuso et al., 2019), and probabilistic Mendelian randomization [Yang et al., 2020a] have been developed to improve TWAS interpretability and causal inference. Recent methodological advances further integrate TWAS into the framework of causal mediation analysis, allowing for direct testing of the mediating role of gene expression between genotype and phenotype. Zhou et al. [2020] proposed sparse mediation models that combine regularization with TWAS to account for high-dimensional and correlated mediators, enhancing 17 power in identifying biologically meaningful genes. Other strategies leverage structural equation modeling and instrumental variable frameworks to separate mediated from direct effects, thereby improving the causal interpretation of TWAS findings. In addition to statistical innovations, TWAS has found significant success in empirical applications. It has been employed to elucidate genetic mechanisms of diseases such as schizophrenia [Gusev et al., 2016], type 2 diabetes [Mancuso et al., 2019], and cardiovascular traits [Barbeira et al., 2019]. These studies underscore the potential of TWAS to translate statistical associations into functional hypotheses and therapeutic targets. Overall, TWAS represents a paradigm shift in post-GWAS functional interpretation by connecting regulatory genomics with complex trait biology. As datasets continue to grow in size and resolution, the integration of TWAS with multi-omics and causal inference tools is proving pivotal in dissecting the architecture of complex diseases. 1.4 Imaging Genetics Imaging genetics is an interdisciplinary research area at the intersection of neuroimaging, genomics, and biostatistics, aimed at elucidating the biological mechanisms linking genetic variation to brain structure and function. By leveraging high-dimensional brain imaging phenotypes (e.g., structural or functional MRI) and genome-wide genotype data, imaging genetics studies seek to uncover intermediate neural phenotypes—often referred to as endophenotypes—that mediate genetic influences on complex behaviors and neuropsychiatric disorders. Early imaging genetics efforts focused on candidate gene approaches, investigating associations between a small number of genetic variants (e.g., COMT, BDNF) and specific imaging-derived 18 features such as hippocampal volume or prefrontal activation. However, the limitations of low power, bias, and lack of replicability in candidate gene studies lead to a paradigm shift toward genome-wide association studies (GWAS), which enable agnostic scans of millions of single- nucleotide polymorphisms (SNPs) across the genome. Large-scale consortia such as ENIGMA [Thompson et al., 2014] and UK Biobank [Alfaro-Almagro et al., 2018] facilitated such analyses, providing unprecedented sample sizes and harmonized imaging protocols. Despite their success, GWAS of imaging traits suffer from severe multiple testing burdens due to the high dimensionality of both genetic and neuroimaging data. To mitigate this, multivariate methods and dimensionality reduction techniques are proposed. Canonical correlation analysis (CCA) and sparse CCA [Witten et al., 2009] are employed to identify linear combinations of genetic variants and imaging phenotypes that exhibit maximal correlation. Partial least squares (PLS) regression and reduced-rank regression (RRR) are also adapted to this context [Zhou, 2014]. These approaches, however, often lack interpretability and may not be equipped to handle complex mediation structures. Another prominent approach involves structural equation modeling (SEM), which models causal pathways from SNPs to brain phenotypes and subsequently to behavior or disease outcomes [Liu et al., 2017]. SEMs are particularly appealing due to their ability to encode directed relationships and to simultaneously estimate direct and indirect effects. To address the high-dimensional nature of imaging and genetic data, sparse SEMs are developed using penalized likelihood estimation with ℓ1 or group lasso penalties [Zhang et al., 2015]. Beyond classical parametric methods, machine learning techniques such as random forests, deep neural networks, and variational autoencoders have been adopted to model nonlinear and hierarchical associations in imaging genetics. However, the interpretability and inference limitations 19 of these models remain active areas of research. More recent methodological advances have begun to integrate imaging genetics into broader causal inference frameworks. Mediation models with high-dimensional mediators have been applied to identify brain regions that mediate genetic effects on phenotypes such as cognitive function, substance use, or psychiatric risk [Chen et al., 2022]. These approaches leverage regularization and optimization tools (e.g., ADMM, coordinate descent) to achieve sparse estimation while preserving statistical validity under complex causal assumptions. In summary, imaging genetics has evolved into a rich field combining statistical innovation with domain-specific knowledge to bridge the gap between genotype and phenotype. As imaging and genomic data continue to grow in scale and complexity, future advances require scalable, interpretable, and causally informed models that can uncover mechanistic pathways in the human brain. 1.5 Organization of the Dissertation In Chapter 2, we present the first research project, which develops a novel penalized mediation-based framework for TWAS. This approach integrates genome-wide association study (GWAS) summary statistics with gene expression prediction using regularized models, enabling the identification of genetically regulated transcripts associated with complex traits. The methodology addresses key issues such as linkage disequilibrium confounding and indirect genetic effects, and is validated through both simulation studies and real-data applications. Chapter 3 introduces a high-dimensional causal mediation model tailored to structured mediators, such as those found in neuroimaging and multi-omics data. We propose an estimator 20 based on an alternating direction method of multipliers (ADMM) optimization algorithm that jointly identifies active mediators while controlling for the causal pathway. The model is supported by theoretical results and evaluated through extensive simulations. Its utility is further demonstrated in an application to imaging genetics data, revealing brain regions that mediate genetic effects on behavioral traits. Chapter 4 concludes the dissertation by summarizing the methodological contributions and empirical findings of the previous chapters. It discusses the limitations of the proposed methods and outlines future research directions. 21 Chapter 2: TIPS: a novel pathway-guided joint model for transcriptome-wide association studies 2.1 Overview In the past two decades, genome-wide association studies (GWAS) have pinpointed numerous SNPs linked to human diseases and traits, yet many of these SNPs are in non-coding regions and remain hard to interpret. Transcriptome-wide association studies (TWAS) integrate GWAS and expression reference panels to identify gene-level associations with tissue specificity, potentially improving interpretability. However, the list of individual genes identified from univariate TWAS contains little unifying biological theme, so the underlying mechanisms remain largely elusive. In this chapter, we propose a novel multivariate TWAS method that Incorporates Pathway or gene Set information, namely TIPS, to identify genes and pathways most associated with complex polygenic traits. We jointly model the imputation and association steps in TWAS, incorporate a sparse group lasso penalty in the model to induce selection at both gene and pathway levels, and develop an expectation–maximization algorithm to estimate the parameters for the penalized likelihood. We apply our method to three different complex traits—systolic and diastolic blood pressure, as well as a brain-aging biomarker, white-matter brain-age gap, in UK Biobank—and identify critical biologically relevant pathways and genes associated with these traits. These 22 pathways cannot be detected by the traditional univariate TWAS + pathway-enrichment analysis approach, showing the power of our model. We also conduct comprehensive simulations with varying heritability levels and genetic architectures and show our method outperforms other established TWAS methods in feature selection, statistical power, and prediction. The R package that implements TIPS is available at https://github.com/nwang123/TIPS. 2.2 Introduction During the past two decades, large-scale Genome-Wide Association Studies (GWAS) have identified thousands of genetic variants (typically SNPs) that influence risk of numerous human traits and diseases [Abdellaoui et al., 2023]. However, the GWAS to biology path is still not clear as a SNP-trait association cannot be directly used to infer the target gene or the mechanism whereby the variant is associated with phenotypic differences. Recent developments in genetic epidemiology have underscored a refined focus on the transcriptome’s role in understanding genetic influences on complex traits. The paradigm has notably shifted from genome-wide association studies (GWAS) to transcriptome-wide association studies (TWAS). TWAS leverages expression reference panels—such as eQTL cohorts like GTEx [Lonsdale et al., 2013], which include both gene expression and genotype data—to discover gene–trait associations using GWAS datasets [Wainberg et al., 2019]. Since the birth of the first TWAS method [Gamazon et al., 2015], many statistical methods have been developed to perform TWAS [Gusev et al., 2016, Barbeira et al., 2018, Hu et al., 2019, Zhu and Zhou, 2021, Xie et al., 2021]. These methods are typically implemented in two stages: an imputation stage that utilizes the reference dataset to ‘impute’ gene expression for each individual and the association analysis stage that correlates 23 https://github.com/nwang123/TIPS each individual’s ’imputed’ gene expression with their phenotype. Alternatively, joint models of TWAS that combine the two stages together are also being developed and emerge as a more holistic approach to account for the imputation uncertainty [Yang et al., 2019, Yuan et al., 2020]. Such TWAS methods typically use likelihood-based inference and have been shown to be more powerful and efficient than the two-stage methods. Most complex traits and diseases are polygenic by nature, with their heritability depending on a large number of genes across the genome. Like GWAS, the standard TWAS is essentially a univariate analysis where each gene is tested. However, genes do not function in isolation; rather, they act cooperatively in groups (e.g., biological pathways) to perform biological functions. Such a univariate analysis provides the contribution of each single gene to the trait without providing a unifying biological theme. In addition, univariate analysis ignores the complex correlation between genes making it difficult to identify the most critical and potentially causal genes of a trait. Development of multivariate TWAS is still in its infancy and existing methods mainly target at specific risk regions and are strictly two-stage methods only [Mancuso et al., 2019, Knutson et al., 2020, Lin et al., 2022]. To the best of our knowledge, no methods have ever considered the gene–gene correlation, gene sets with similar function on a broad transcriptome-wide scale, in prioritizing the most critical genes and pathways that contribute to a polygenic trait while at the same time accounting for the imputation uncertainty in TWAS. To fill this gap, we develop a novel multivariate TWAS method that incorporates Pathway or gene Set information, namely TIPS, and utilize a sparse group lasso penalty to select the most important genes and pathways that contribute to a polygenic trait. Figure 2.1(A) shows a conceptual framework that motivates our model. As in most TWAS methods, we assume each gene can be regulated by multiple SNPs (i.e. eQTLs) in the reference dataset. TIPS jointly 24 examines multiple genes, potentially grouped according to predefined biological pathways, to capture their collective impact on the trait. By leveraging prior pathway information, this multivariate, pathway-guided approach enables a more comprehensive understanding of the genetic architecture of complex polygenic traits and helps reveal the underlying biological processes. Such a multivariate pathway-guided approach is essential for a comprehensive understanding of the genetic architecture of a complex polygenic trait and revealing the biological processes underlying the trait. Furthermore, TIPS integrates the imputation and association analysis stages together to enhance the efficiency of the model and develops an Expectation–Maximization (EM) algorithm for penalized likelihood to estimate the parameters and select both individual-level (genes) and group-level (pathways) features. In addition, we employ a data-splitting scheme to perform post-selection inference at both gene and pathway levels using likelihood ratio tests. We show the advantage of TIPS over existing TWAS methods in extensive simulations and via TWAS analysis on three different complex traits: systolic blood pressure (SBP), diastolic blood pressure (DBP) and a brain aging biomarker white matter Brain Age Gap (BAG), in UK Biobank (UKB). The chapter is organized as follows. In Section 2.3, we introduce the TIPS model, the penalized likelihood, the EM algorithm to estimate the parameters, and the likelihood ratio test for post-selection inference. In Section 2.4, we apply our method to a TWAS analysis to identify the genes and pathways contributing to SBP variation in UKB. In Section 2.5, we conduct various simulations to show the advantages of our method over other TWAS methods in feature selection and prediction. In Section 2.6, we discuss the potential extension of our method in future studies. 25 (A) (B) ! Reference dataset !(") "(") (Gene expression) (Genotype) # GWAS dataset !($) (Phenotype) (Genotype)"($) EM algorithm for penalized likelihood min () !," " ," $ , #, + , Selected genes and pathways Significant genes and pathways Likelihood ratio test Encourage pathway selection Encourage gene selection within pathway ! = "(")+ + /" 0 = "($)+1 + /$ Joint modeling latent effect of SNPs on genes Pathway database Multiple genes grouped by pathways +sparse group lasso penalty: %& ' ! + (1 − %)&+√-" '(") % & "'! Figure 2.1: (A). Conceptual framework to motivate the TIPS model; (B) Flowchart of the TIPS method. 26 2.3 Method 2.3.1 The TIPS model Figure 2.1(B) presents the data flow for the proposed TIPS procedure. We work with two observed data sets. The first is an eQTL reference panel, denoted D(R) = { Y, W(R) } ,where Y ∈ Rn1×m corresponds to the gene expression data matrix, and W(R) ∈ {0, 1, 2}n1×p is the corresponding genotype matrix, with each entry reporting the allele count for a single-nucleotide polymorphism (SNP). Here, n1 is the sample size in the reference panel, m indicates the number of genes potentially predictive of phenotype of interest, and p represents the total number of genetic variants that potentially regulate these genes. Denote the GWAS dataset by D(G) ={ Z, W(G) } , where Z is an (n21) vector of the phenotype of interest, and W(G) ∈ {0, 1, 2}n2×p corresponds to the genotype data matrix from GWAS. n2 is the sample size in the GWAS dataset. TIPS jointly models Y, W(R), W(G) and Z together and performs imputation and association analysis simultaneously using the following model consisting of two regression equations: Y = W(R)U+ eR; Z = W(G)Uβ + eG, (2.1) where U is a (p×m) matrix representing the genetic effects on the gene expression. In general, U matrix can have arbitrary elements to represent the complex SNP-gene regulatory relationship. If we only consider the cis regulatory effect on gene expression (i.e. cis-eQTL) and assume no common regulatory variants shared among genes, then U matrix can be simplified to a sparse block matrix with off-block elements being zero. Following Jiang et al. [2016], we assume a 27 matrix-variate normal prior on the SNP-gene effect matrix U, denoted by U ∼MN p×m(0p×m, Ip,Σu). This distribution implies that the entries of U are jointly Gaussian, and the vectorized form vec(U) follows a multivariate normal distribution with mean zero and covariance matrix Ip⊗Σu, where ⊗ denotes the Kronecker product. Here, Ip is the identity matrix of size p, and Σu is an m × m diagonal matrix capturing the covariance structure across genes. The error term eR in the first regression followsMN n1×m(0n1×m, In1 ,Σ1), where Σ1 is assumed to be an (m × m) diagonal matrix with a common residual variance σ2 1 , i.e., Σ1 = σ2 1Im, where Im is the m ×m identity matrix. The error term eG in the second regression follows N (0, σ2 2In2). Crucially, the m× 1 vector β captures the effects of gene expression on the phenotype and is our parameter of interest. We assume both Y and Z are centered so the intercepts are not included. In real data, we further adjust for commonly used covariates (e.g. age, sex and BMI) and population stratification by regressing phenotype on these confounders and use the residuals after adjustment Z̃ in the second regression to implement the model in (2.1). Viewing U as a matrix of latent variables, we can write out the complete-data log-likelihood and then use standard expectation-maximization (EM) algorithm to estimate the parameters of interest. For polygenic phenotype/trait that can be potentially affected by multiple genes across the genome, the genes are naturally grouped in pathways or gene sets that represent different biological processes (e.g. those pathways in KEGG pathway database [Kanehisa et al., 2017] or gene sets from MSigDB [Subramanian et al., 2005]), we are particularly interested in learning what biological processes are most related to the phenotype. In addition, we are also interested in learning what specific genes within those processes are most critical in regulating the phenotype. To dissect both gene and pathway level contribution to the complex polygenic phenotype, we propose a sparse group lasso penalty [Simon et al., 2013] on β and the penalized complete-data 28 log-likelihood is given as: pℓ(Y,W(R),W(G),Z,U|θ) = − n1m 2 log ( 2πσ2 1 ) − 1 2σ2 1 n1∑ i=1 (Yi. −W (R) i. U)(Yi. −W (R) i. U)T − n2m 2 log ( 2πσ2 2 ) − 1 2σ2 2 ∥∥∥Z−W(G)Uβ ∥∥∥2 − p 2 m∑ j=1 log(2πσ2 uj )− m∑ j=1 uTj Ipuj 2σ2 uj − {aλ||β||1 + (1− a)λ G∑ g=1 √ kg∥β(g)∥2}, (2.2) where uj is the jth column of U, Yi., W (R) i. and W (G) i. are ith row of Y, W(R) and W(G), ||.||1 and ||.||2 are ℓ1 and ℓ2 norms. θ = {β, σ2 uj , σ2 1, σ 2 2} are the parameters that need to be estimated. In this formulation, the first line on the right side corresponds to the log-likelihood in the reference data (the first regression in equation (2.1)), the second line corresponds to the log-likelihood in the GWAS dataset (the second regression in equation (2.1)), the third line corresponds to the log-likelihood of prior on U and the fourth line is the sparse group lasso penalty imposed on the coefficient vector β. The m genes are classified into G pathway groups where g = 1, 2, . . . , G indicates the group index and kg indicates the group size for the gth group. Scattered genes not belonging to any groups will have kg = 1. Possible overlap of genes between pathways might exist. We adopt the widely used overlapping group lasso method [Jacob et al., 2009], where genes may appear in multiple pathway groups. To accommodate this, we duplicate the corresponding columns in the design matrix W(G)U for each group in which a gene appears, and assign separate coefficients to each copy during model fitting. After estimation, the final gene-level effect is obtained by averaging the estimated coefficients of the duplicated copies corresponding to the 29 same gene. λ is the regularization parameter, a ∈ [0, 1] indicates a convex combination of lasso (i.e. ||β||1) and group lasso penalties (i.e. ∑G g=1 √ kg∥β(g)∥2), where β(g) denotes the subvector of coefficients corresponding to genes in the g-th pathway group. To select the tuning parameters λ and a, we perform a grid search over a predefined set of candidate values. Specifically, we define a logarithmic grid for λ, e.g., λ ∈ {10−3, 10−2, . . . , 100.5, 101}, and a linear grid for a, e.g., a ∈ {0, 0.1, 0.2, . . . , 1}. For each combination of (λ, a), we fit the model and compute the prediction mean squared error (MSE) using K-fold cross-validation (CV), typically with K = 5. In addition, we also consider model selection criteria such as the Bayesian Information Criterion (BIC) to validate the stability and parsimony of the selected model. In practice, we also apply the “one standard error rule” to guide model selection when multiple models perform similarly under cross-validation (CV) or BIC. Specifically, when the CV error or BIC values are close across different tuning parameter combinations, we prefer the more parsimonious model—i.e., the one with fewer nonzero coefficients—whose CV error or BIC lies within one standard error of the minimum value [Hastie et al., 2009]. 2.3.2 EM algorithm for penalized likelihood In the presence of latent variables in U matrix, we propose an EM algorithm to estimate the parameters that maximize the penalized likelihood function in equation (2.2). From the complete data penalized log-likelihood and given the block matrix structure of U, it is easy to recognize that, given Y, W(R), W(G), Z and θ, terms involving each uj can be written as a quadratic form 30 for j = 1, ...,m: uTj ( − 1 2σ2 1 W(R)TW(R) − β2 j σ2 2 W(G)TW(G) − 1 2σ2 uj Ip ) uj+( 1 σ2 1 yTj W (R) + βj σ2 2 ZTW(G) ) uj + constant, where yj is the jth column of Y, βj is the corresponding coefficient for jth gene. Thus its posterior distribution can be written in the form of a normal distribution. In the E-step, we take expectation of the complete data penalized log-likelihood with respect to the posterior distribution of uj and derive the Q-function by utilizing the fact that E(uTXu) = µT uXµu + Tr(XΣu) if u ∼ N(u|µu,Σu): Q(θ|θold) = − n1m 2 log ( 2πσ2 1 ) − m∑ j=1 µuj Tµuj 2σ2 uj − 1 2σ2 1 m∑ j=1 n1∑ i=1 (Yi. −W (R) i. µuj )(Yi. −W (R) i. µuj )T − n2m 2 log ( 2πσ2 2 ) − 1 2σ2 2 m∑ j=1 ∥∥∥Z−W(G)µuj βj ∥∥∥2 − m∑ j=1 Tr (( 1 2σ2 1 W(R)TW(R) + β2 j 2σ2 2 W(G)TW(G) + 1 2σ2 uj Ip ) Σuj ) − {aλ||β||1 + (1− a)λ G∑ g=1 √ kg∥β(g)∥2}, (2.3) where Σuj = ( 1 σ2 1 W(R)TW(R)+ β2 j σ2 2 W(G)TW(G)+ 1 σ2 uj Ip) −1 and µuj = Σuj ( 1 σ2 1 W(R)Tyj + βj σ2 2 W(G)TZ ) . In the M-step, we obtain the updated model parameters in θ by setting the derivative of 31 Q-function to zero. The updating equations for the parameters of interest are as follows: σ2 1 = 1 n1m ( m∑ j=1 ∥∥yj −W(R)µuj ∥∥2) , σ2 2 = 1 n2m (∥∥Z− βjW(G)µuj ∥∥2) , σ2 uj = 1 p ( µuj Tµuj + Tr ( Σuj )) , for j = 1, 2, . . . ,m, βj =  0, if ||sj|| ≤ (1− a)λ √ kg or ||sj|| ≤ aλ ( 2µT uj W(G)TW(G)µuj + 2Tr(W(G)Σuj W(G)T ) )−1 × sj ( 1− (1−a)λ √ kg ||sj || ) , Otherwise, for j = 1, 2, . . . ,m, where sj = 2ZTW(G)µuj − aλsign(βj). The E-step and M-step are alternated iteratively until convergence. In practice, we monitor the change in the penalized observed-data log-likelihood. The algorithm is declared to have converged when the relative change in a single iteration falls below the threshold 10−4. 2.3.3 Post-selection gene-level and pathway-level inference The EM algorithm estimates the coefficient vector β and selects the most critical genes and pathway groups for the trait with nonzero coefficients (i.e. β ̸= 0) but cannot directly make inference about the effects. To avoid overfitting, we consider a data splitting scheme to make post-selection inference on the coefficients [Kuchibhotla et al., 2022]. We first split the GWAS dataset into a GWAS training set and a GWAS testing set of about equal size. The GWAS training set and the reference dataset are used to explore the data and select genes with nonzero 32 coefficients to the model. In the GWAS testing set, we evaluate the significance of the coefficients of selected genes using a likelihood ratio test (LRT). To test gene-level coefficient, we formalize the following hypothesis test: H0 : βj = 0, vs. H1 : βj ̸= 0. The LRT statistics is given by Λj = 2 ( ℓ(Y,W(R),W(G),Z,U|θ̂)− ℓ(Y,W(R),W(G),Z,U|θ̂βj=0) ) , where θ̂ is the vector of coefficient estimates under the full model while θ̂βj=0 is the coefficient estimates under H0, and ℓ(.) represents the likelihood function without penalty. The test statistic Λj asymptotically follows a chi-squared distribution with 1 degree of freedom under the null hypothesis, according to Wilks’ Theorem. See Theorem 5.10 on page 213 of van der Vaart [1998] for a formal statement and conditions under which the result holds. Likewise, we can use a similar LRT to test for pathway-level coefficient vector under the following hypothesis: H0 : β(g) = 0, vs. H1 : β(g) ̸= 0, where β(g) represents the coefficient vector for the gth pathway group. The pathway-level LRT statistics Λ(g) = 2 ( ℓ(Y,W(R),W(G),Z,U|θ̂)− ℓ(Y,W(R),W(G),Z,U|θ̂β(g)=0) ) , where θ̂ is the vector of coefficient estimates under the full model while θ̂β(g)=0 is the coefficient estimates underH0. The test statistics Λ(g) asymptotically follows the χ2 distribution with degree of freedom of kg under the null. 2.3.4 Other related methods Many methods have been developed to perform TWAS in the past decade (see a review of TWAS method development in the Supplement). These methods are either performed in two- stages or in joint model, use individual or summary level GWAS data, and conduct single-gene or multi-gene analysis. Table 2.1 summarizes the characteristics of representative TWAS methods (focusing on methods using individual level GWAS data) in each category and its comparison 33 with our method. PrediXcan [Gamazon et al., 2015] is the very first two-stage univariate TWAS method. CoMM [Yang et al., 2019] is the first joint model that combines imputation and analysis stages and conducts univariate TWAS only. kTWAS [Cao et al., 2021] and later on mkTWAS [Cao et al., 2022a] uses kernel machine to select genetic variants but are still univariate at gene level. MV-IWAS [Knutson et al., 2020] and TWAS-LQ [Lin et al., 2022] are multivariate TWAS methods but do not incorporate the pathway information nor conduct feature selection in their implementation. NeRiT [Jin et al., 2022] incorporates pathway information as network edges and performs network regression without any feature selection, thus is easily prone to overfitting. On the other hand, our method “TIPS” conducts multivariate TWAS that uses pathway group information from existing pathway database and incorporates sparse group lasso penalty function to select the most critical genes and pathways that contribute to a complex trait. Table 2.1: Main characteristics of different TWAS methods and their comparison to our TIPS method. Method Joint/Two stage Univariate/Multivariate Pathway info. Gene/pathway selection Ref PrediXcan Two-stage Univariate No No Gamazon et al. [2015] CoMM Joint Univariate No No Yang et al. [2019] kTWAS Two-stage Univariate No No Cao et al. [2021] mkTWAS Two-stage Univariate No No Cao et al. [2022a] MV-IWAS Two-stage Multivariate No No Knutson et al. [2020] TWAS-LQ Two-stage Multivariate No No Lin et al. [2022] NeRiT Two-stage Multivariate Yes No Jin et al. [2022] TIPS Joint Multivariate Yes Yes Ours 2.4 Application to TWAS analysis in UK Biobank 2.4.1 TWAS analysis on blood pressure High blood pressure or hypertension is one of the leading risk factors for stroke and coronary artery disease, contributing to about 10 million deaths every year worldwide. Blood pressure is a complex heritable (heritability∼50%) and polygenic trait potentially impacted by hundreds to 34 thousands of variants over the whole genome based on large-scale GWAS [Evangelou et al., 2018]. Few TWAS have been conducted to identify the genes that contribute to blood pressure variation and understand its genetic regulatory mechanism. In this subsection, we apply our method to perform a multivariate TWAS on two traits of blood pressure: systolic blood pressure (SBP) and diastolic blood pressure (DBP) to identify the genes and pathways that influence blood pressure using data from the large-scale UK Biobank (UKB) [Bycroft et al., 2018]. We collected the genotype, SBP, DBP and demographic data (e.g. age, sex and BMI) from non-pregnant, family-unrelated individuals of European ancestry in UKB with at least two blood pressure measurements. Both SBP and DBP values were calculated as the mean of two nonnull blood pressure measurements using phenotype code 4080 and 4079 in UKB. After excluding subjects with missing data and those with various types of cardiovascular diseases, the complete data we analyzed included n = 16, 470 and n = 17, 394 participants for SBP and DBP, respectively. We first performed standard QC of the GWAS data and only kept the genetic variants with: minor allele frequency (MAF) ≥ 0.05, Hardy-Weinberg equilibrium p-value ≥ 0.001, missing genotype rate ≤ 0.1. We then matched the GWAS SNP data with the genotype data in GTEx reference database (downloaded from dbGaP website, under phs000424.v9.p2) and mapped SNPs to genes in GTEx by physical proximity (1Mb upstream and downstream of the transcription start site, i.e. focusing on cis-eQTLs) on the reference genome. As there was no consensus on the physical distance used to define cis-eQTLs, to minimize the impact from these different definitions on the final TWAS results, we followed from PrediXcan [Gamazon et al., 2015] and further performed elastic net as in PrediXcan to pre-select the most predictive SNPs for each gene among the cis-eQTLs (Table S13) to ensure more stable selection and reduce computational burden. The same idea held potential when we plan to include trans-eQTLs into 35 our model in future extensions. For the SBP and DBP traits, we used the genotype and gene expression data from the heart left ventricle tissue in GTEx (GTEx v8; RNA-seq data) to train the imputation model for the tissue’s close relationship with blood pressure. We used the pathway information from large curated pathway databases including KEGG [Kanehisa et al., 2017], Reactome [Jassal et al., 2020] and Biocarta [Nishimura, 2001] to group the genes. Pathway database like Gene Ontology were largely based on prediction so are not included here. Pathways with size smaller than 5 and larger than 100 were excluded to ensure the best interpretability and biological relevance of the selected pathways. After the preparation steps, our final TWAS analysis included 2089 genes, with 848 genes grouped in 438 pathways and the remaining genes as scattered genes. To mitigate the confounding effects from other covariates, we standardized and regressed SBP and DBP phenotype on sex, age and BMI and took the residual as the adjusted SBP and DBP phenotype for TWAS analysis. The GWAS data was split into a training set and a testing set. In the training set, we ran our TIPS method to select the top genes and pathways predictive of SBP and DBP. In the testing set, we performed post-selection LRT for genes and pathways selected in the training set and obtained the p-values. We evaluated MSBP pathways for systolic blood pressure (SBP) and MDBP pathways for diastolic blood pressure (DBP), where MSBP = MDBP = 562 corresponds to the total number of hypothesis tests, including 438 pathway-level tests and 124 tests for scattered (singleton) genes. For each trait, the raw p-values were multiplied by 562 to obtain Bonferroni-adjusted p-values. Treating hypotheses with adjusted p-values adj-p < 0.05 as significant, we identified 87 significant pathways for SBP and 96 for DBP. Fig 3.2(A) and Fig S1 highlighted the significant KEGG and Reactome pathways ordered by pathway IDs, so the closer they were, the more similar their gene contents and functions. Among them, we found a few clusters of biological pathways, 36 including nucleotide and amino acid metabolism [Feig et al., 2013, Poggiogalle et al., 2019], lipid and carbohydrate metabolism [Reaven and Hoffman, 1989, Zicha et al., 1999], cytokine signaling [Chae et al., 2001] and immune signaling [Rodriguez-Iturbe et al., 2017] that had been widely reported to be related to blood pressure. In addition, we also found pathways related to neurodegenerative diseases and cellular level processes such as cell motility and cell cycle worth further investigation in future studies. Interestingly, there were a large amount of similar pathways (e.g. those related to nucleotide, amino acid and lipid metabolism, and related to cytokine pathways) identified for both SBP and DBP traits, validating the high genetic correlation between the two traits. On the other hand, pathways related to cardiomyopathy were only seen for SBP while those related to ECM and cell interaction were only seen for DBP, showing the specificity of the two traits. We also ran CoMM, PrediXcan and MV-IWAS and applied the same cutoff to detect significant genes and performed pathway enrichment analysis on these genes for a fair comparison. For SBP, the enrichment analysis identified 8, 57 and 36 significant pathways with Benjamini- Hochberg adjusted p-value<0.05 for CoMM, PrediXcan and MV-IWAS, respectively. For DBP, the enrichment analysis identified 38, 40 and 163 significant pathways for CoMM, PrediXcan and MV-IWAS, respectively. Important pathways related to lipid and glucose metabolism had large p-values thus were missing from their top lists (see Fig S1 and Fig S2 for the Manhattan plot and Venn diagram in Appendix A.0.2). This showed the power of performing multivariate TWAS and highlighted the importance of incorporating pathway group information earlier in the model to encourage pathway selection, as compared to the traditional univariate TWAS + pathway enrichment analysis which lacks unifying biological themes. Due to the lack of feature selection, the multivariate MV-IWAS method was subject to serious collinearity issue which 37 makes it unstable resulting in very different sets of detected pathways for SBP and DBP thus was less reliable. As a sensitivity analysis of possible false positives generated from our method, we also performed random grouping of the genes into groups of comparable sizes (repeat for five times) and applied our method to each random grouped data and compared the significance of top selected pathways to our current results. As shown, top pathways based on real pathway group information from pathway database are far more significant than those with random grouping validating our current results (Fig S3 in Appendix A.0.2). The pathway level comparison was not all fair for multivariate vs. univariate TWAS methods, so we also compared the different methods at gene level. For SBP, our method selected 285 significant genes while CoMM, PrediXcan and MV-IWAS selected 140, 184 and 208 significant genes respectively with Bonferroni adjusted p-value<0.05; for DBP, we selected 312 genes while CoMM, PrediXcan and MV-IWAS selected 165, 215 and 356 significant genes respectively (see Fig S4 for the Venn diagram). Multivariate TWAS methods were in general more powered than univariate TWAS methods, and the gain of power for our method mainly came from the group- based selection that potentially aggregates signals in the same group(s). We picked two example KEGG pathways and highlighted the selected genes within the selected pathways by our method in the KEGG topology plots (Fig 2.2(B)). Significant genes including aldehyde dehydrogenase (ALDH) genes in arginine and proline metabolism and AKR1B1 gene in glycerolipid metabolism pathways, had been found to be related to blood pressure in previous studies [Kato et al., 2011, Wang et al., 2016]. These findings facilitated our understanding of the genetic basis of blood pressure and potential genetic regulatory mechanism that might be related to antihypertensive drug targets. 38 (A) (B) ALDH9A1 ALDH4A1PRODH LAP3 ALDH9A1 AKR1B1 DGKG, DGKQ Arginine and Proline Metabolism Glycerolipid Metabolism Figure 2.2: (A) Manhattan plot of -log10(p-value) from pathway level test results of TIPS sorted by KEGG IDs (each dot represents a pathway) for SBP (upper) and DBP (lower) traits. (B) Topology plot of two significant KEGG pathways: Arginine and Proline metabolism and Glycerolipid metabolism. Selected genes in the pathway are highlighted in red (SBP) and green (DBP). 39 2.4.2 TWAS analysis on white matter brain aging Brain aging involved the gradual loss of structure and function of neurons and their connections, leading to cognitive decline and increased vulnerability to neurodegenerative diseases [Gold et al., 2012, Kantarci et al., 2014]. Previous studies employed a machine learning algorithm and leverage multiple fractional anisotropy (FA) tract measurements obtained from diffusion tensor imaging (DTI) data to predict the white matter (WM) brain age gap (BAG) as a marker of brain aging [Cole and Franke, 2017, Cole et al., 2019]. The BAG is highly heritable [Lee et al., 2020, Wen et al., 2024], but the genetic underpinnings of the BAG are not completely understood. In this subsection, we applied our method to identify genes and pathways that influence the BAG outcome using data from UKB. We collected the DTI data from UKB imaging visit and preprocessed the data using ENIGMA structural and DTI pipelines [Jahanshad et al., 2013]. The FA images were aligned onto a standard-space white matter skeleton and FA measurements were derived from 39 white matter tracts covering various brain regions [Smith et al., 2006]. Following previous papers [Mo et al., 2023, Feng et al., 2023], we applied a random forest regression model to predict brain age from these 39 regional FA measures and derived our outcome of interest BAG as the difference between brain age and chronological age for n= 11,299 participants. The larger BAG, the faster one’s brain ages as compared to their chronological age. The GWAS data was preprocessed as in the blood pressure example and the GTEx reference panel data for the brain cortex tissue was used to train the imputation model. As before, the curated KEGG, Reactome and Biocarta pathway database were used to group the genes. We identified 136 significant pathways for BAG with Bonferroni adjusted p-value<0.05. 40 As a comparison, CoMM, PrediXcan and MV-IWAS detected 35, 43 and 0 significant pathways (Appendix), following the TWAS+pathway enrichment analysis procedure using the same cutoffs. Leonardsen et al. [2023] listed a few major categories of pathways related to the genetics of BAG including neural system, DNA repair, DNA metabolism, protein metabolism and immune defense. Among them, only TIPS detected all the five categories of pathways (Fig 2.3(A)). At gene level, we had a large overlapping proportion with the other two methods but also detected more significant genes that are otherwise missed by other methods (Fig S5 in Appendix A.0.2). NME4 and NT5E genes elected by TIPS in the “Pyrimidine metabolism” pathway (Fig 2.3(B)) were previously found to be related to brain health and neurodegenerative diseases [Rahman, 2009, Grimm and Eckert, 2017]. These findings helped identify the critical genetic risk factors and the impacted biological pathways and functions of brain aging and improved our understanding of the aging process. 2.5 Simulation study 2.5.1 Simulation setting In this section, we conducted various simulation studies to show the performance of our TIPS method. For benchmarking, we main evaluated the feature selection and statistical power performance of our method as compared to other representative TWAS methods including CoMM [Yang et al., 2019], PrediXcan-EN (PrediXcan+elastic net), PrediXcan-Lasso (PrediXcan+Lasso) [Gamazon et al., 2015] and the multivariate MV-IWAS method [Knutson et al., 2020], the network regression based NeRiT method [Jin et al., 2022], kernel machine based kTWAS method [Cao et al., 2021] and its marginal+kernel version mkTWAS method [Cao et al., 2022a]. 41 PrediXcanCoMMTIPSCategory of pathways 004Neural system 003DNA repair 002DNA metabolism 336Protein metabolism 545Immune defense (A) (B) Pyrimidine Metabolism NME4 NT5E UCK1, UCKL1 Figure 2.3: (A) Comparison of major categories of pathways detected in different methods. Number in the cells refers to the number of significant pathways within each category; (B) Topology plot of selected KEGG pathway “Pyrimidine metabolism” by TIPS. 42 For the reference dataset, we assumed a sample size n1 = 300 and p = 900 SNPs that are potentially regulating the expression of m = 180 genes. We assumed each gene was independently regulated by p̄m = 5 different SNPs without cross-talks and each of these SNPs is in a different locus (i.e. independent lead eQTLs). The m = 180 genes were further classified into five pathway groups with 25 genes in each group without overlapping, and 55 scattered genes not belonging to any pathways. To mimic the real genotype pattern, we first defined a linkage disequilibrium (LD) matrix for each gene with fixed diagonal values equal to 1 and off-diagonal values equal to 0.5, representing the pairwise LD between SNPs within the same gene. We then followed from Han and Pan [2012] to simulate the genotype data (i.e. the number of minor alleles for all SNPs) for each sample in the reference dataset using a latent Gaussian model with MAF∈ (0.2, 0.5) and block-diagonal correlation matrix where each block was defined by the aforementioned LD matrix. For a complete evaluation, we followed from Cao et al. [2021] and considered different genetic architectures: additive (2 vs 1 vs 0 minor alleles); heterogeneous (any vs none minor alleles); recessive (two alternative alleles vs others) and compensatory (one alternative allele vs others). The matrix of genetic effects was drawn as U ∼ MN (0, Ip,Diag(σ2 u)), assuming a common σ2 u for all genes. The gene expression matrix Y of the reference dataset was simulated based on the first regression equation in (2.1), where σ2 1 varied depending on the scenarios (see below). For the GWAS dataset, we assumed a sample size n2 = 600, the genotype data was similarly simulated as in the reference dataset. The coefficient vector β was constructed to reflect both pathway group-level and individual gene-level contribution to the trait: Group 1 that included non-associated genes (all β = 0); Group 2 that included genes with homogeneous moderate effect (all β = 0.2); Group 3 that included genes with homogeneous weak effect (all 43 β = 0.02); Group 4 and 5 more heterogeneous that included genes with moderate effect, weak effect and non-associated genes. The 55 scattered genes also included a mixture of moderate effect, weak effect and non-associated genes. The phenotypic data was simulated based on the second regression equation in (2.1). Following Yang et al. [2019], we considered two heritability measures in simulating the data. The first cellular level heritability ĥ2C = p̄mσ̂2 u p̄mσ̂2 u+σ̂2 1 quantified the proportion of variance of gene expression explained by genetic variants, where p̄m = 5 in our simulations. The second organism level heritability ĥ2T = β̂T β̂σ̂2 u β̂T β̂σ̂2 u+σ̂2 2 quantified the gene effect on the trait. We simulated the following three scenarios with different values of σ̂2 1 , σ̂2 2 , and σ̂2 u to reflect varying cellular and organism level heritability. By adjusting heritability parameters, it demonstrated the model’s robustness in diverse biological contexts and its applicability in understanding gene-environment interactions. 1. Scenario I: σ̂2 1 = 0.2, σ̂2 2 = 0.2, σ̂2 u = 0.02, ĥ2C = 0.33 and ĥ2T = 0.161. 2. Scenario II: σ̂2 1 = 0.2, σ̂2 2 = 0.2, σ̂2 u = 0.01; ĥ2C = 0.2 and ĥ2T = 0.091. 3. Scenario III: σ̂2 1 = 0.2, σ̂2 2 = 0.45, σ̂2 u = 0.01; ĥ2C = 0.2 and ĥ2T = 0.043. We also considered additional simulation scenarios when the number of genes exceeds the sample size of GWAS dataset (high-dimensional or high-D case), when the genes were correlated and when MAF and heritability levels change. Each simulation was conducted B = 1000 times. For all scenarios under investigation, we quantified the feature selection and power by plotting the Receiver Operating Characteristic (ROC) Curve (with Area Under Curve (AUC) values) and bar graphs. Five-fold cross-validation 44 was used to find the optimal value of regularization parameter for our method. AUCs for all univariate TWAS methods (CoMM, PrediXcan, kTWAS and mkTWAS) were calculated based on the ranking of genes by the p-values they provided. The true model was a multi-gene model (e.g. MV-IWAS and TIPS), so we also evaluated the predictive accuracy of the model using the Mean Squared Error (MSE). MV-IWAS did not include a penalty function so cannot work in high-D scenario. Table 2.2: Mean area under the ROC curve (AUC) for simulation Scenarios I–III under additive or heterogeneous genetic architectures with the standard deviation shown in parentheses. Scenario I Scenario II Scenario III Additive TIPS 0.953(0.041) 0.903(0.064) 0.833(0.047) CoMM 0.753(0.042) 0.696(0.049) 0.668(0.034) PrediXcan-EN 0.731(0.041) 0.667(0.029) 0.624(0.035) PrediXcan-lasso 0.728(0.044) 0.669(0.028) 0.621(0.032) MV-IWAS 0.631(0.044) 0.610(0.040) 0.604(0.043) NeRiT 0.571(0.042) 0.541(0.043) 0.533(0.045) kTWAS 0.501(0.055) 0.504(0.048) 0.499(0.048) mkTWAS 0.501(0.046) 0.506(0.050) 0.502(0.049) Heterogeneous TIPS 0.966(0.051) 0.923(0.043) 0.845(0.033) CoMM 0.763(0.043) 0.728(0.026) 0.655(0.039) PrediXcan-EN 0.737(0.032) 0.691(0.055) 0.645(0.039) PrediXcan-lasso 0.743(0.027) 0.678(0.051) 0.647(0.047) MV-IWAS 0.612(0.048) 0.591(0.047) 0.583(0.038) NeRiT 0.577(0.041) 0.554(0.043) 0.517(0.040) kTWAS 0.494(0.048) 0.491(0.035) 0.493(0.044) mkTWAS 0.495(0.048) 0.496(0.031) 0.498(0.031) 2.5.2 Simulation results Figure 2.4 and 2.5 show the ROC curve and power comparison, and Table 2.2 shows the corresponding AUC values for all TWAS methods in different simulation scenarios. TIPS has 45 Additive Heterogeneous I TIPS CoMM PrediXcan -EN PrediXcan -Lasso MV-IWAS NeRiT kTWAS mkTWAS II III Figure 2.4: ROC curve comparison for the simulation scenarios I-III with additive or heterogeneous genetic architecture. 46 Additive Heterogeneous I TIPS CoMM PrediXcan -EN PrediXcan -Lasso MV-IWAS NeRiT kTWAS mkTWAS II III Figure 2.5: Power comparison for the simulation scenarios I-III with additive or heterogeneous genetic architecture. 47 higher AUC and power than other methods in all scenarios for different genetic architectures (additive or heterogeneous in Fig 2.4 and 2.5, recessive and compensatory in Fig S6-S7). As the heritability level drops from Scenario I to Scenario II and III, all methods have decreased AUC and power but TIPS remains the best performer. Overall, the joint models (TIPS and CoMM) that accounts for the imputation uncertainty outperform two-stage methods (PrediXcan- EN, PrediXcan-Lasso, kTWAS and mkTWAS). The multivariate TIPS model outperforms univariate models (CoMM, PrediXcan-EN, PrediXcan-Lasso) in feature selection and power. MV-IWAS is a multivariate model but does not include any penalty function for feature selection thus performs worse, and has higher MSE than TIPS (Table S14). NeRiT incorporates group information as network edges and runs network regression, but it is easily subject to overfitting issue thus also performs worse. The results look consistent when we consider gene-gene correlation (Fig S8), for high dimensional scenario (Table S15), and with other heritability levels and MAF values (Table S16 and S17). 2.6 Discussion In this paper, we propose a novel pathway-guided “TIPS” method to perform multivariate TWAS analysis and identify genes and pathways that contribute to complex polygenic traits. The method is featured by its multi-gene group-based approach that examines hundreds to thousands genes simultaneously in the context of biological pathways, capturing their collective impact on the trait rather than individually. In addition, the method integrates the imputation and association stages of TWAS together to improve efficiency and employs a sparse group lasso penalty for both gene-level and pathway-level feature selection. We also develop a fast EM algorithm on penalized 48 likelihood for parameter estimation and a data splitting scheme coupled with likelihood ratio test for post-selection inference at both gene and pathway levels. In its current implementation, TIPS only considers cis-eQTLs that regulate the gene expression in reference database. Incorporating trans-eQTLs into its applications improve the predictive performance of genetically regulated expression but is also subject to heavy computational burden due to the vast number of SNP-gene pairs that need to be tested. We have proposed one potential solution when there are too many eQTLs included by performing pre-selection of SNPs most predictive of the genes in GTEx using elastic net penalty as in PrediXcan. The applicability of such a solution for our method when incorporating all trans-eQTLs needs to be evaluated in future studies. In addition, though we consider the gene-gene correlation and grouping in the association part, the independent gene regulation assumption in the imputation part could be strong in real application. For example, it is not rare to see common genetic regulators and regulatory mechanisms of different genes in proximity or functionally related. Future extension could consider such complex multi-SNPs-to-multi-genes regulatory relationship. The current method, tailors for individual-level GWAS data and continuous phenotype, holds potential for adaptation to accommodate GWAS summary statistics and mixed types of phenotype, broadening its applicability as in other TWAS methods [Barbeira et al., 2018, Yang et al., 2020b]. In addition, the current TWAS method is only trained in one tissue and performs single population analysis. Future works can be focused on extending the current TWAS method towards a joint multi-tissue model [Hu et al., 2019] and joint multi-ancestry analysis [Chen et al., 2023]. As a wide variety of phenotypes are being collected nowadays, testing multiple phenotypes simultaneously tends to be more powerful than testing each phenotype at a time by considering the correlation across phenotypes. The current TIPS method could be readily 49 extended to address multivariate phenotype, allowing for the investigation of a number of complex traits all together. This will likely lead to a high-dimension-to-high-dimensional problem (a large number of genes to a large number of phenotypes) that may require special care in modeling for feature screening and selection [Ke et al., 2022, Canida et al., 2022]. There are a few existing online TWAS resources of disease susceptibility genes such as TWAS-hub [Mancuso et al., 2017] [Mancuso et al., 2017] and webTWAS [Cao et al., 2022b], our proposed TIPS method, on the other hand, is the first method that incorporates curated pathway database in multivariate TWAS model and provides pathway-level information associated with a disease or trait. It could potentially be a valuable tool offering more reliable and interpretable results for TWAS analyses. An efficient R package called TIPS was developed (https:// github.com/nwang123/TIPS) to implement the method. 50 https://github.com/nwang123/TIPS https://github.com/nwang123/TIPS Chapter 3: High-Dimensional Multivariate Mediation Analysis Integrating Genetic, Imaging, and Phenotypic Data 3.1 Overview Mediation analysis is a fundamental tool for elucidating causal mechanisms in complex systems. The emergence of high-throughput biological and neuroimaging technologies has introduced multivariate exposures and mediators of increasing dimensionality, rendering univariate mediation analysis methods approaches inadequate. In this work, we propose a novel aggregation-based mediation framework that simultaneously models high-dimensional multivariate exposures and mediators. Our method jointly reduces the dimensions of e