ABSTRACT

Title of Dissertation: PENALIZED STATISTICAL MODELS
FOR PATHWAY-BASED TRANSCRIPTOMEWISE
ASSOCIATION STUDY AND
HIGH-DIMENSIONAL MEDIATION ANALYSIS

Neng Wang
Doctor of Philosophy, 2025

Dissertation Directed by: Professor Charles Ma
Department of Epidemiology and Biostatistics

High–throughput genomics and neuroimaging now generate thousands of correlated molecular

and imaging features for each participant, presenting unprecedented opportunities—and methodological

challenges—for causal and genetic discovery. This dissertation develops two complementary

statistical frameworks that address key limitations of classical tools when confronted with such

high-dimensional data. In Chapter 1, I introduce that genome-wide association studies (GWAS)

have pinpointed numerous SNPs linked to human diseases and traits, yet many of these SNPs

are in non-coding regions and hard to interpret. Transcriptome-wide association studies (TWAS)

integrate GWAS and expression reference panels to identify the associations at gene level with

tissue specificity, potentially improving the interpretability. However, the list of individual genes

identified from univariate TWAS contains little unifying biological theme so the underlying

mechanisms remain largely elusive. These limitations motivate a unified framework that not only

identifies trait-associated genes through multivariate TWAS, but also traces how their effects


propagate through intermediate brain features to influence clinical outcomes—thus naturally

extending gene-level association analysis into high-dimensional mediation analysis. Mmediation

analysis is a fundamental tool for elucidating causal mechanisms in complexsystems. However,

the emergence of high-throughput biological and neuroimaging technologies has introduced multivariate

exposures and mediators of increasing dimensionality, rendering classical approaches inadequate.

In the Chapter 2 we propose a novel multivariate TWAS method that Incorporates Pathway

or gene Set information, namely TIPS, to identify genes and pathways most associated with

complex polygenic traits. We jointly modeled the imputation and association steps in TWAS,

incorporated a sparse group lasso penalty in the model to induce selection at both gene and

pathway levels and developed an expectation-maximization algorithm to estimate the parameters

for the penalized likelihood. We applied our method to three different complex traits: systolic

and diastolic blood pressure, as well as a brain aging biomarker white matter brain age gap in UK

Biobank and identified critical biologically relevant pathways and genes associated with these

traits. These pathways cannot be detected by traditional univariate TWAS + pathway enrichment

analysis approach, showing the power of our model. We also conducted comprehensive simulations

with varying heritability levels and genetic architectures and showed our method outperformed

other established TWAS methods in feature selection, statistical power and prediction. The R

package that implements TIPS is available at https://github.com/nwang123/TIPS.

In Chapter 3, we propose a novel aggregation-based mediation framework that simultaneously

models and selects high-dimensional multivariate exposures and mediators. Our method identifies

sparse linear combinations of variables in each domain that jointly maximize the mediated effect,

defined as the product of exposure–mediator and mediator–outcome effects. To estimate these

low-dimensional aggregators, we formulate a bi-convex objective function integrating residual

https://github.com/nwang123/TIPS


sum-of-squares penalties from standard mediation submodels, a structured mediation-enhancing

term, and ℓ1-penalties that induce sparsity. The resulting optimization problem is solved efficiently

using an alternating direction method of multipliers (ADMM) algorithm with block coordinate

updates.

Through extensive simulations, we demonstrate that our method achieves superior performance

in recovering true mediators and estimating mediation proportions across a wide range of signal

strengths, noise levels, and correlation structures. Compared to existing methods—including

MMP, Pathway Lasso, sparse PCA mediation, and the Directions of Mediation framework—our

approach exhibits higher selection accuracy and reduced bias, particularly in challenging high-

correlation regimes. We further illustrate the utility of the method in a real data application

involving the mediation of smoking behavior through neuroimaging features, revealing biologically

meaningful pathways linking gene expression in the nucleus accumbens to structural and functional

brain indices implicated in addiction. These findings highlight the potential of our framework for

integrative mediation analysis in high-dimensional biomedical studies.


PENALIZED STATISTICAL MODELS FOR PATHWAY-BASED TWAS
AND HIGH-DIMENSIONAL MEDIATION ANALYSIS

by

Neng Wang

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2025

Advisory Committee:
Professor Charles Ma, Chair/Advisor
Professor Eric V. Slud, Co-Advisor
Professor Lizhen Lin
Professor Yan Li
Professor Chixiang Chen


© Copyright by
Neng Wang

2025


Acknowledgments

I owe my gratitude to all the people who have made this dissertation possible and whose

support has made my Ph.D. journey intellectually fulfilling and personally meaningful.

First and foremost, I would like to express my deepest thanks to my advisor, Professor

Charles Ma, for his constant support, insightful guidance, and encouragement throughout my

Ph.D. studies. His patience, dedication, and enthusiasm for research have greatly shaped my

development as a statistician. I am especially grateful for his belief in my ideas, his timely

advice, and his constructive feedback, all of which have been invaluable to this dissertation.

I am also deeply grateful to my co-advisor, Professor Eric V. Slud, for his sharp statistical

insights, rigorous standards, and detailed guidance. His thoughtful comments and mathematical

clarity have elevated the quality of this work and challenged me to think more precisely and

thoroughly.

I thank the remaining members of my dissertation committee for their valuable feedback

and for taking the time to review this work. Their comments have helped improve the scope,

presentation, and clarity of this dissertation.

I am thankful to the faculty and staff at the Department of Mathematics and the Statistics

Program at the University of Maryland for providing a nurturing academic environment.

I also thank my fellow graduate students and friends, who have offered not only stimulating

academic conversations but also much-needed companionship and emotional support throughout

ii


this journey. The shared experiences made even the most challenging moments enjoyable.

Most importantly, I owe my deepest thanks to my family — my parents, whose unconditional

love, unwavering support, and countless sacrifices have carried me through every step of this

journey. Their belief in me has been my greatest source of strength and motivation.

Finally, I want to thank myself — for persisting through moments of doubt, for embracing

challenges with resilience, and for growing into the person I am today. This dissertation is

not only a culmination of academic effort, but also a reflection of personal transformation and

perseverance. I am truly grateful to all who have contributed in ways both big and small.

iii


Table of Contents

Acknowledgements ii

Table of Contents iv

List of Tables vi

List of Figures vii

1 Introduction 1
1.1 Variable selection and regularization . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Background of variable selection problem . . . . . . . . . . . . . . . . . 2
1.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Causal Inference and Mediation Analysis . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Causal Mediation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 Transcriptome-Wide Association Studies (TWAS) . . . . . . . . . . . . . . . . . 16
1.4 Imaging Genetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 TIPS: a novel pathway-guided joint model for transcriptome-wide association studies 22
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.1 The TIPS model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 EM algorithm for penalized likelihood . . . . . . . . . . . . . . . . . . . 30
2.3.3 Post-selection gene-level and pathway-level inference . . . . . . . . . . 32
2.3.4 Other related methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Application to TWAS analysis in UK Biobank . . . . . . . . . . . . . . . . . . . 34
2.4.1 TWAS analysis on blood pressure . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 TWAS analysis on white matter brain aging . . . . . . . . . . . . . . . . 40

2.5 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.1 Simulation setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

iv


3 High-Dimensional Multivariate Mediation Analysis Integrating Genetic, Imaging, and
Phenotypic Data 51
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3.1 A multi-exposure-to-multi-mediator model . . . . . . . . . . . . . . . . 57
3.3.2 Optimization using alternating direction method of multipliers (ADMM)

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.3 Convergence Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Simulation settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.5 Application to Real Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4 Conclusion 82
4.1 Implications for Statistical Genetics and Neuroimaging Studies . . . . . . . . . . 83
4.2 Limitations and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

A Additional simulation results for Chapter 2 87
A.0.1 Review of TWAS method development . . . . . . . . . . . . . . . . . . 87
A.0.2 Supplementary results . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.0.3 Data simulation results for MAF = 0.5 . . . . . . . . . . . . . . . . . . . 96
A.0.4 Results for gene selection by TIPS . . . . . . . . . . . . . . . . . . . . . 97

B Additional theorem and lemma with detailed proof for Chapter 3 98
B.0.1 Convergence Theorem of ADMM . . . . . . . . . . . . . . . . . . . . . 98
B.0.2 Oracle ℓ2–consistency of the sparse pathway estimator . . . . . . . . . . 103

Bibliography 109

v


List of Tables

2.1 Main characteristics of different TWAS methods and their comparison to our
TIPS method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2 Mean area under the ROC curve (AUC) for simulation Scenarios I–III under
additive or heterogeneous genetic architectures with the standard deviation shown
in parentheses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.1 Results (n = 200) for complete mediation effect under σ = 1 across four
correlation scenarios (ρX, ρM). Each entry is averaged over 1,000 replications
and summarises active-mediator selection accuracy and mediation-proportion estimation
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.2 Results (n = 200) for partial mediation effect under σ = 1 across four correlation
scenarios (ρX, ρM). Each entry is the average of 1,000 replications, summarising
active-mediator selection accuracy and mediation-proportion estimation performance. 74

3.3 Results (n = 200) for complete and partial mediation effects under σ = 3,
across four correlation scenarios (ρX, ρM). Each entry is the average of 1,000
replications, summarising active-mediator selection accuracy and mediation-proportion
estimation performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

vi


List of Figures

2.1 (A). Conceptual framework to motivate the TIPS model; (B) Flowchart of the
TIPS method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 (A) Manhattan plot of -log10(p-value) from pathway level test results of TIPS
sorted by KEGG IDs (each dot represents a pathway) for SBP (upper) and DBP
(lower) traits. (B) Topology plot of two significant KEGG pathways: Arginine
and Proline metabolism and Glycerolipid metabolism. Selected genes in the
pathway are highlighted in red (SBP) and green (DBP). . . . . . . . . . . . . . . 39

2.3 (A) Comparison of major categories of pathways detected in different methods.
Number in the cells refers to the number of significant pathways within each
category; (B) Topology plot of selected KEGG pathway “Pyrimidine metabolism”
by TIPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.4 ROC curve comparison for the simulation scenarios I-III with additive or heterogeneous
genetic architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.5 Power comparison for the simulation scenarios I-III with additive or heterogeneous
genetic architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.1 Conceptual diagram illustrating the mediation framework for gene–brain–disorder
pathway. Gene expression features serve as exposures, MRI features (e.g., functional
connectivity) act as mediators, and brain disorder phenotypes are the outcomes. . 56

3.2 (A) Heatmap of pairwise correlations among the 9 TWAS-identified genes used
as exposures. (B) Correlation heatmap of ROI-to-ROI functional connectivity
among 235 brain regions. The highlighted block (top left) marks 35 subcortical
regions and their 630 FCs used as mediators in the analysis. . . . . . . . . . . . . 80

vii


Chapter 1: Introduction

1.1 Variable selection and regularization

Beyond traditional statistical models, variable selection and regularization methods play a

critical role in advanced machine learning models such as deep learning to reduce overfitting and

improve the interpretability of the model. In this section, we review the foundational principles

and methodological advancements in variable selection and regularization, with an emphasis on

their relevance to modern high-dimensional data analysis. We begin by introducing the classical

variable selection problem within the low-dimensional paradigm, and subsequently discuss their

limitations in the presence of high-dimensional feature space.

To address the challenges posed by large-scale data, we examine regularization techniques

that augment the loss function with structured penalty terms to enforce sparsity or other desirable

properties. Widely used methods such as the Lasso, Elastic Net, and Adaptive Lasso are introduced

alongside nonconvex penalties like SCAD and MCP, each offering trade-offs between computational

tractability, variable selection consistency, and bias reduction. We further discuss group-based

extensions and structured regularizers that incorporate prior domain knowledge to improve selection

stability and biological interpretability.

The implementation of regularized methods inherently involves solving complex optimization

problems. To this end, we provide an overview of optimization algorithms widely used in high-

1


dimensional statistics, including gradient-based methods, Newton-type algorithms, Expectation-

Maximization (EM), coordinate descent, proximal gradient, and alternating direction method

of multipliers (ADMM) algorithms. These techniques, with their distinct convergence properties

and computational advantages, form the algorithmic backbone for scalable inference in penalized

models.

1.1.1 Background of variable selection problem

Variable selection remains a foundational problem in statistical modeling, underpinning

the development of parsimonious, interpretable, and predictive models across a wide spectrum of

applications. From classical linear regression to modern high-dimensional statistical learning, the

central objective is to identify a subset of relevant explanatory variables that adequately explains

the variation in a response variable, while excluding irrelevant or redundant features.

In low-dimensional settings, when the number of variables p is small relative to the sample

size n, classical model selection criteria such as AIC and BIC serve as primary tools for variable

selection. AIC, proposed by [Akaike, 1973], is grounded in information theory and aims to

minimize the Kullback–Leibler divergence between the true model and the candidate model.

BIC, introduced by [Schwarz, 1978], incorporates a Bayesian perspective and imposes a heavier

penalty for model complexity, thereby favoring more parsimonious models under certain asymptotic

regimes. Both criteria balance goodness-of-fit and model complexity, typically applied in a

stepwise manner (forward, backward, or both) or through exhaustive search of the best subset

when computationally feasible.

With the advent of high-throughput technologies and the proliferation of data-rich environments,

2


modern datasets are characterized by the so-called “large p, small n” paradigm, where p ≫ n.

This renders classical selection tools inadequate due to their instability, computational infeasibility,

and poor performance in the presence of multicollinearity and overfitting. Consequently, variable

selection in high-dimensional contexts necessitates the development of regularization-based methods

that incorporate sparsity-inducing penalties.

Regularization approaches augment the objective function—often a loss function such as

least squares or negative log-likelihood—with a penalty term that enforces structural constraints

on the parameter space. The most popular method in this class is the Least Absolute Shrinkage

and Selection Operator (Lasso) introduced by [Tibshirani, 1996], which employs an ℓ1-norm

penalty to shrink coefficients towards zero and perform automatic variable selection. The Lasso

is particularly appealing due to its convex formulation and interpretability; however, it exhibits

limitations in scenarios with high correlations among predictors and may fail to achieve consistent

variable selection under certain conditions [Zhao and Yu, 2006, Meinshausen and Bühlmann,

2006].

To address these limitations, several extensions and alternatives are proposed. The Elastic

Net [Zou and Hastie, 2005] combines ℓ1 and ℓ2 penalties to stabilize variable selection in correlated

settings. The Adaptive Lasso [Zou, 2006] modifies the Lasso by introducing adaptive weights,

enabling oracle properties under suitable conditions. In addition, other non-convex penalties,

such as the smoothly clipped absolute deviation (SCAD) penalty [Fan and Li, 2001] and the

minimax concave penalty (MCP) [Zhang, 2010], are proposed with improved theoretical properties

and better mimic the ideal behavior of the ℓ0-penalty in variable selection.

Beyond penalized likelihood, model selection in high-dimensional settings also embraces

methods rooted in Bayesian inference (e.g., spike-and-slab priors), screening procedures (e.g.,

3


Sure Independence Screening by [Fan and Lv, 2008]), and stability-based techniques (e.g., stability

selection by [Meinshausen and Bühlmann, 2010]). Recent developments in debiased estimation

[van de Geer et al., 2014, Javanmard and Montanari, 2014] and post-selection inference further

highlight the importance of balancing selection accuracy with valid statistical inference, a critical

need in scientific disciplines to ensure reproducibility.

As big data continue to expand and diversify, new challenges arise for variable selection.

For example, we may need to incorporate structural constraints (e.g., group sparsity, hierarchical

models) and leverage prior knowledge (e.g., through graph- or pathway-based penalties) to guide

variable selection. In addition, variables may need to be selected for different purposes, so

it becomes natural to integrate variable selection into broader modeling frameworks such as

mediation analysis, survival models, and causal inference, beyond simple linear model or generalized

linear model where most of the variable selection methods are first developed.

1.1.2 Regularization

Regularization techniques become central to modern statistical methodology, particularly

in the context of high-dimensional data analysis where traditional approaches often fail. Regularization

refers to the addition of a penalty term to a loss function, with the goal of enforcing structural

constraints such as sparsity, smoothness, or low rank, thereby improving model interpretability

and generalizability. In the variable selection context, regularization not only stabilizes estimation

procedures but also facilitates the identification of relevant predictors by shrinking some coefficients

exactly to zero.

The most widely adopted regularization method is the Lasso (Least Absolute Shrinkage

4


and Selection Operator), introduced by [Tibshirani, 1996]. In linear model, Lasso solves the

problem

β̂lasso = arg min
β∈Rp

{
1

2n
∥Y −Xβ∥22 + λ∥β∥1

}
,

where λ ≥ 0 is a tuning parameter that controls the degree of penalization. The ℓ1 penalty

encourages sparsity in the estimated coefficients, leading to automatic variable selection. Its

computational efficiency and convex formulation make Lasso a popular tool across numerous

domains. However, Lasso also has well-known limitations: it tends to select only one variable

from a group of highly correlated predictors and can lead to biased coefficient estimates due to

the nature of the ℓ1 shrinkage [Zhao and Yu, 2006, Meinshausen and Bühlmann, 2006].

To mitigate these issues, several extensions of Lasso are proposed. The Elastic Net [Zou

and Hastie, 2005] combines ℓ1 and ℓ2 penalties:

β̂EN = arg min
β∈Rp

{
1
2n
∥Y −Xβ∥22 + λ

[
α∥β∥1 + (1− α)∥β∥22

]}
.

where α compromises between ridge (α = 0) and lasso regression (α = 1). The Elastic net

regression stabilizes the lasso solution when predictors are highly correlated. The Adaptive Lasso

[Zou, 2006] introduces weights to the ℓ1 penalty, enabling the estimator to achieve the oracle

property under appropriate conditions:

β̂AL = arg min
β∈Rp

{
1

2n
∥Y −Xβ∥22 + λ

p∑
j=1

wj|βj|

}
, wj =

1

|β̂init
j |γ

.

Beyond convex penalties, nonconvex regularization methods also gain more attention for

their improved theoretical properties and ability to better approximate the ideal ℓ0 penalty. Two

5


notable examples are the smoothly clipped absolute deviation (SCAD) penalty [Fan and Li, 2001]

and the minimax concave penalty (MCP) [Zhang, 2010]. These penalties exhibit continuity and

sparsity while reducing estimation bias for large coefficients. Although the resulting optimization

problems are nonconvex, local solutions are often found using coordinate descent algorithms or

local quadratic approximations.

In parallel, group-based regularization techniques are developed to leverage structural knowledge

about the predictors. The Group Lasso [Yuan and Lin, 2006] extends Lasso by penalizing the ℓ2

norm of groups of coefficients, enabling variable selection at the group level. More recent variants

include the Sparse Group Lasso [Simon et al., 2013], which combines group and individual

sparsity, and methods for overlapping [Jacob et al., 2009] or hierarchical groups [Jenatton et al.,

2011].

The choice of regularization penalty and its tuning remains a critical issue in practice.

Model selection consistency, estimation accuracy, and computational tractability often present

trade-offs. Recent developments in theoretical statistics provide conditions under which regularized

estimators achieve desirable properties such as sparsistency, asymptotic normality, and valid

inference [van de Geer et al., 2014, Bühlmann and van de Geer, 2011]. These advances pave the

way for principled approaches to high-dimensional modeling that balance empirical performance

with theoretical guarantees.

1.1.3 Optimization

Optimization lies at the core of modern statistical estimation and learning algorithms.

Most statistical methods for high-dimensional variable selection problems ultimately require the

6


minimization of a loss function augmented with a penalty term. These optimization problems

often involve non-smooth, non-convex, or high-dimensional objective functions, and their numerical

resolution requires careful algorithmic choices to ensure convergence, scalability, and statistical

accuracy.

Gradient-based methods are the classical workhorses of continuous optimization. Gradient

descent updates parameters by moving along the negative gradient of a differentiable loss function,

iteratively refining the parameter estimates as β(t+1) = β(t) − ηt∇f(β(t)), where ηt denotes

the learning rate. Under convexity and smoothness assumptions, gradient descent guarantees

convergence to a global minimum [Nocedal and Wright, 2006]. However, its convergence can

be slow, especially for ill-conditioned problems. To accelerate convergence, Nesterov’s method

introduces a momentum term that improves the convergence rate to O(1/t2) for convex functions

[Nesterov, 1983]. In large-scale data settings, stochastic gradient descent (SGD) and its adaptive

variants, such as AdaGrad [Duchi et al., 2011], RMSProp, and Adam [Kingma and Ba, 2014], are

widely used due to their computational efficiency and suitability for streaming or online learning

contexts.

Newton and quasi-Newton methods make use of second-order curvature information to

achieve faster convergence. Newton’s method computes parameter updates using the inverse

Hessian matrix of the objective function, thereby enjoying quadratic convergence near the optimum.

Nevertheless, the cost of computing and inverting the Hessian scales poorly with dimensionality.

Quasi-Newton algorithms, such as BFGS and L-BFGS [Byrd et al., 1995], approximate the

Hessian and offer a practical balance between convergence speed and computational cost, making

them effective for medium-scale problems.

The Expectation-Maximization (EM) algorithm is particularly valuable for models involving

7


latent variables or missing data. Introduced by Dempster et al. [1977], the EM algorithm iteratively

alternates between computing the expected complete-data log-likelihood (E-step) and maximizing

it (M-step). Its penalized variant, known as the Penalized EM algorithm, integrates regularization

into the M-step, enabling variable selection in models such as sparse Gaussian mixtures, latent

class regression, and factor analysis [Green, 1990, Böhning and Lindsay, 1994, Stadler et al.,

2010]. In Chapter 2, we develop an EM algorithm for penalized likelihood with sparse group

lasso like regularizer for identifying both pathway groups and genes associated with phenotype in

TWAS studies. Although EM monotonically increases the observed-data likelihood and converges

to a stationary point, that point may be only a local optimum—an issue it shares with gradient-

based and quasi-Newton methods. Because EM’s non-descent nature also allows convergence to

saddle points, practitioners typically use random restarts or other multi-start strategies to improve

the chance of reaching a globally optimal solution.

Coordinate descent methods are especially effective for solving regularized problems with

separable penalty functions, such as the Lasso and its extensions. At each iteration, coordinate

descent cyclically updates one parameter while keeping all others fixed, yielding closed-form

updates in many cases. This strategy proves remarkably efficient for convex objective functions

with sparsity-inducing penalties, such as ℓ1 and group-lasso norms [Friedman et al., 2010]. It

also extends to generalized linear models and Cox regression, with additional screening rules to

accelerate computation [Tibshirani et al., 2012].

Proximal gradient methods generalize gradient descent to composite optimization problems,

where the objective function can be decomposed into a smooth part and a non-differentiable

regularizer. For an objective of the form f(β) + g(β), where f is smooth and convex and g is

convex but possibly non-smooth, the proximal gradient method performs updates via β(t+1) =

8


proxλg

(
β(t) − ηt∇f(β(t))

)
, where proxλg(·) denotes the proximal operator of g. Proximal algorithms

are particularly well-suited for ℓ1-regularized regression, fused lasso, and total variation denoising

problems [Parikh and Boyd, 2014]. Accelerated variants such as FISTA (Fast Iterative Shrinkage-

Thresholding Algorithm) improve convergence from O(1/t) to O(1/t2) [Beck and Teboulle,

2009].

The Alternating Direction Method of Multipliers (ADMM) is another prominent technique

for solving convex problems with decomposable structures. ADMM is particularly useful in

distributed and large-scale settings, where it can break complex optimization tasks into simpler

subproblems that are solved alternately. For an objective function decomposed as f(x) + g(z)

subject to linear constraints Ax + Bz = c, ADMM alternates updates to x, z, and the dual

variable using augmented Lagrangian formulations [Boyd et al., 2011]. In the context of sparse

estimation, ADMM is successfully applied to problems such as sparse logistic regression, inverse

covariance estimation, and structured regularization.

In high-dimensional settings involving non-convex penalties such as SCAD or MCP, the

optimization problem becomes more challenging due to the presence of multiple local minima.

Nevertheless, recent theoretical developments show that under certain regularity conditions, local

solutions can still yield statistically consistent estimators [Loh and Wainwright, 2015]. Techniques

such as the concave-convex procedure (CCCP) and difference-of-convex (DC) programming

[Gasso et al., 2009] are developed to solve such problems with convergence guarantees to critical

points.

The theoretical properties of optimization algorithms, particularly their convergence behavior,

are crucial for statistical validity. For convex problems, convergence guarantees under strong

convexity and Lipschitz continuity are well established. For instance, coordinate descent, gradient

9


descent, and proximal methods all enjoy global convergence under mild conditions [Bubeck,

2015]. In non-convex settings, the Kurdyka–Łojasiewicz inequality is employed to establish

convergence of iterative schemes to critical points, even when global optimality is not guaranteed

[Attouch et al., 2013, Bolte et al., 2014].

1.2 Causal Inference and Mediation Analysis

Causal inference provides a foundational framework for understanding and quantifying the

effects of interventions, exposures, or treatments on outcomes in complex systems. In contrast

to associational analysis, causal inference methods aim to disentangle cause-effect relationships

through explicit modeling of counterfactual outcomes and structural assumptions. Over the past

decades, the development of both potential outcomes theory and graphical causal models has

enabled rigorous identification, estimation, and inference of causal effects from both randomized

and observational data. These advances have profoundly impacted fields such as epidemiology,

social sciences, economics, and genomics.

Within the broader scope of causal inference, causal mediation analysis plays a pivotal role

in uncovering mechanistic pathways and intermediate processes through which causal effects

propagate. By decomposing the total effect of an exposure into direct and indirect components,

mediation analysis facilitates deeper understanding regarding the underlying mechanisms. Early

developments in causal mediation analysis focused on single exposure and single mediator settings.

However, the increasing availability of high-dimensional and structured data—such as gene expression

profiles, neuroimaging measurements, and microbiome compositions—has necessitated new methodological

innovations with multiple related exposures and mediators.

10


This chapter presents a unified overview of modern causal inference and mediation analysis,

with an emphasis on their application to high-dimensional and biologically structured data. We

begin with classical foundations, including the Rubin Causal Model and structural causal models,

and review identification strategies such as the back-door criterion and sequential ignorability. We

then discuss methodological extensions to accommodate high-dimensional mediators, regularization

strategies, and algorithmic advances in optimization for scalable computation. These developments

underpin the contributions of this dissertation, which integrates causal reasoning, sparse estimation,

and computational techniques to study mediation pathways in complex biological systems.

1.2.1 Causal Inference

Causal inference seeks to draw conclusions about cause-effect relationships, distinguishing

itself from traditional association-based statistical methods. One of the most influential frameworks

for formalizing causal inference is the potential outcomes model, also known as the Rubin Causal

Model (RCM) [Rubin, 1974]. Under this framework, each unit has two potential outcomes: one

under treatment and one under control. The fundamental problem of causal inference is that only

one potential outcome is observed per unit, which introduces the need for strong assumptions

and careful design to identify causal effects. Estimands such as the average treatment effect

(ATE) and the average treatment effect on the treated (ATT) are defined as expectations over

these counterfactual outcomes.

Another foundational approach is Pearl’s structural causal model (SCM) and the use of

directed acyclic graphs (DAGs) [Pearl, 2009]. DAGs provide a graphical representation of causal

assumptions and facilitate identification of causal effects through graphical criteria such as d-

11


separation and back-door paths. Under this approach, do-calculus enables reasoning about interventions

and policy changes, while adjustment sets can be determined to estimate causal effects nonparametrically.

Identification of causal effects from observational data requires conditions such as unconfoundedness

(also known as conditional ignorability), consistency, and positivity. The back-door criterion in

DAGs and the ignorability assumption in RCM both formalize when simple adjustment (e.g.,

regression or stratification) can be used to recover causal effects from observed data. In the

absence of randomization, methods such as propensity score weighting [Rosenbaum and Rubin,

1983], matching [Rosenbaum, 1989], inverse probability weighting [Robins et al., 1994], and

doubly robust estimation [Robins et al., 1995, Bang and Robins, 2005] are frequently employed.

Causal inference has evolved to address complexities such as longitudinal treatments, time-

varying confounding, mediation, and interference. Marginal structural models (MSMs) [Robins

et al., 2000] and g-methods [Robins, 1986] have been developed to handle these settings, particularly

in epidemiological studies. More recently, causal mediation analysis has become popular for

decomposing total effects into direct and indirect effects, with identification results established

under both sequential ignorability and more complex assumptions [Imai et al., 2010a, VanderWeele,

2015].

The advent of high-dimensional data has spurred new developments in causal inference.

Traditional methods that rely on parametric modeling or univariate confounder adjustment are

insufficient when the number of covariates is large or exceeds the sample size. Regularization-

based methods, such as Lasso-adjusted treatment effect estimation [Belloni et al., 2014], have

been proposed to perform variable selection while maintaining valid inference. Techniques like

double machine learning [Chernozhukov et al., 2018], targeted maximum likelihood estimation

(TMLE) [van der Laan and Rubin, 2006], and semiparametric efficiency theory are now central

12


tools for high-dimensional causal estimation.

In addition to parametric and semiparametric methods, causal discovery—inferring causal

structure from data—has gained popularity. Constraint-based methods (e.g., PC algorithm),

score-based methods (e.g., GES), and functional model-based methods (e.g., LiNGAM) aim to

learn DAG structures from data under assumptions such as faithfulness and acyclicity [Spirtes

et al., 2000, Shimizu et al., 2006]. While causal discovery is more exploratory in nature, it plays

an increasingly important role in guiding model specification and hypothesis generation in the

absence of data from designed or randomized experiments.

1.2.2 Causal Mediation Analysis

Causal mediation analysis is a critical extension of causal inference that seeks to elucidate

the mechanisms through which a treatment or exposure affects an outcome. Rather than focusing

solely on estimating total causal effects, mediation analysis decomposes this total effect into two

components: the direct effect of the treatment on the outcome and the indirect effect mediated

through one or more intermediate variables, known as mediators.

The modern development of causal mediation analysis is grounded in the potential outcomes

framework pioneered by Rubin [1974] and further formalized in the seminal work of Holland

[1988]. Under this framework, the concepts of the average causal mediation effect (ACME)

and average direct effect (ADE) are defined, providing a rigorous statistical interpretation of

mediation. Subsequent contributions by Robins and Greenland [1992] and Pearl [2001] introduce

structural causal models (SCMs) and graphical representations, enabling identification of mediation

effects under weaker assumptions. Notably, the identification of natural direct and indirect

13


effects hinges on key assumptions such as sequential ignorability, which asserts the conditional

independence of the mediator and the outcome from the treatment assignment given observed

covariates.

The methodological landscape of causal mediation analysis has since expanded to accommodate

diverse data structures and research goals. Imai et al. [2010b] develop a flexible framework

for parametric and nonparametric mediation analysis using linear and generalized linear models,

along with sensitivity analysis to evaluate violations of the no-unmeasured-confounding assumption.

Their work also leads to the development of the widely used mediation R package, facilitating

the adoption of mediation techniques in applied research. In parallel, VanderWeele [2015] provides

a comprehensive synthesis of identification and inference under various causal models, extending

the framework to settings involving multiple mediators, longitudinal data, and interactions.

While early mediation studies focus on low-dimensional settings, the rise of high-throughput

technologies and multi-omics data shifts attention to high-dimensional mediation analysis. These

settings pose substantial methodological challenges due to the curse of dimensionality, multicollinearity

among mediators, and the need for variable selection. To address these challenges, recent research

integrates regularization and sparse modeling into the mediation framework. For example, Zhou

and Pan [2020] and Huang et al. [2020] propose penalized mediation models using Lasso-type

penalties to identify sparse sets of mediators while retaining interpretability. Similarly, the

debiased estimation approach by Zhao et al. [2019] extends post-selection inference to the mediation

context, offering valid confidence intervals for selected mediators. For high-dimensional omics

data, many features are highly correlated; it is therefore common to analyze multiple exposures

and multiple mediators jointly in the same model. In Chapter 3, we propose a novel multi-

exposure-to-multi-mediator mediation model that integrates genetic, neuroimaging, and phenotypic

14


data to understand the neurogenetic mechanism of brain disorders.

More recent developments include models for group-structured and hierarchical mediators,

such as the group-lasso-based methods introduced by Song et al. [2021], and approaches that

incorporate prior biological knowledge through structured penalties or graphical priors. These

innovations are particularly relevant in genomics and neuroimaging, where mediators often possess

inherent spatial or functional structures. For example, in imaging genetics, researchers apply

sparse structural equation models and canonical-correlation-based strategies to identify brain

regions mediating the effects of genetic variants on behavior and disease traits [Chen et al., 2022].

Moreover, advances in optimization and computational algorithms facilitate scalable implementation

of high-dimensional mediation estimators. Methods based on ADMM, coordinate descent, and

proximal gradient schemes enable efficient solutions to nonconvex and non-smooth optimization

problems that arise in penalized mediation modeling. These algorithmic improvements, along

with theoretical guarantees such as consistency and asymptotic normality, broaden the applicability

of mediation analysis in high-dimensional settings.

In summary, causal mediation analysis evolves from a conceptual tool for decomposition

into a rigorous and flexible framework capable of handling complex, high-dimensional, and

structured data. The integration of causal inference principles with regularization, optimization,

and computational statistics provides a powerful foundation for mechanistic understanding in

contemporary data science.

15


1.3 Transcriptome-Wide Association Studies (TWAS)

The past decade has witnessed rapid advancements in statistical genetics facilitated by the

proliferation of high-throughput platforms and large-scale biobank resources. These developments

enable researchers to move beyond genome-wide association studies (GWAS) that merely locate

trait-associated variants, toward integrative frameworks that connect molecular level features (e.g.

genes), functions and mechanisms with complex phenotypes.

Transcriptome-Wide Association Studies (TWAS) have emerged as a powerful integrative

framework to detect associations between gene expression (a.k.a. genetically regulated expression

or GReX) and complex traits by leveraging genome-wide association study (GWAS) summary

statistics and reference transcriptomic data. TWAS integrates GWAS summary statistics with

transcriptomic reference panels to identify genes whose genetically regulated expression is associated

with traits of interest. By imputing gene expression levels using models trained on expression

quantitative trait loci (eQTLs), TWAS facilitates gene-level association analysis even in the

absence of observed expression data. This paradigm advances our ability to prioritize putative

causal genes and explore regulatory mechanisms underlying complex traits. Modern TWAS

implementations further incorporate statistical tools for mediation analysis, sparse modeling, and

probabilistic fine-mapping to address challenges such as linkage disequilibrium and pleiotropy.

The TWAS framework was initially formalized by Gusev et al. [2016], who proposed the

use of predictive models trained on reference expression data (e.g., GTEx or DGN) to estimate

the cis-genetic component of gene expression. The genetically predicted expression levels are

then tested for association with the phenotype using GWAS summary statistics. This two-stage

approach enables the identification of genes whose genetically regulated expression levels are

16


associated with the trait, even when direct transcriptomic measurements are unavailable in the

GWAS sample. The pioneering PrediXcan method by Gamazon et al. [2015] utilizes elastic net

regression to train gene expression models and provides a foundation for both individual-level

and summary-based inference.

The summary-based TWAS approach (S-PrediXcan) was later refined to improve scalability

and accessibility to public data. By avoiding the need for individual-level GWAS data, S-

PrediXcan enables broad application across many phenotypes [Barbeira et al., 2018]. Further,

extensions such as FUSION [Gusev et al., 2016] and UTMOST [Hu et al., 2019] expand TWAS

models by incorporating multiple tissues, sharing information across tissues, and modeling expression

covariances, which increase power and interpretability in detecting regulatory mechanisms underlying

complex traits.

Despite the widespread use of TWAS, several limitations have been documented. One

key concern is confounding due to linkage disequilibrium (LD), which can induce spurious

associations when SNPs used to impute gene expression are in LD with causal variants affecting

the phenotype independently of expression. This can result in TWAS signals that do not reflect

true mediation. To address this, methods such as conditional and joint analysis [Zhang et al.,

2020], fine-mapping approaches (e.g., FOCUS; Mancuso et al., 2019), and probabilistic Mendelian

randomization [Yang et al., 2020a] have been developed to improve TWAS interpretability and

causal inference.

Recent methodological advances further integrate TWAS into the framework of causal

mediation analysis, allowing for direct testing of the mediating role of gene expression between

genotype and phenotype. Zhou et al. [2020] proposed sparse mediation models that combine

regularization with TWAS to account for high-dimensional and correlated mediators, enhancing

17


power in identifying biologically meaningful genes. Other strategies leverage structural equation

modeling and instrumental variable frameworks to separate mediated from direct effects, thereby

improving the causal interpretation of TWAS findings.

In addition to statistical innovations, TWAS has found significant success in empirical

applications. It has been employed to elucidate genetic mechanisms of diseases such as schizophrenia

[Gusev et al., 2016], type 2 diabetes [Mancuso et al., 2019], and cardiovascular traits [Barbeira

et al., 2019]. These studies underscore the potential of TWAS to translate statistical associations

into functional hypotheses and therapeutic targets.

Overall, TWAS represents a paradigm shift in post-GWAS functional interpretation by

connecting regulatory genomics with complex trait biology. As datasets continue to grow in size

and resolution, the integration of TWAS with multi-omics and causal inference tools is proving

pivotal in dissecting the architecture of complex diseases.

1.4 Imaging Genetics

Imaging genetics is an interdisciplinary research area at the intersection of neuroimaging,

genomics, and biostatistics, aimed at elucidating the biological mechanisms linking genetic variation

to brain structure and function. By leveraging high-dimensional brain imaging phenotypes (e.g.,

structural or functional MRI) and genome-wide genotype data, imaging genetics studies seek

to uncover intermediate neural phenotypes—often referred to as endophenotypes—that mediate

genetic influences on complex behaviors and neuropsychiatric disorders.

Early imaging genetics efforts focused on candidate gene approaches, investigating associations

between a small number of genetic variants (e.g., COMT, BDNF) and specific imaging-derived

18


features such as hippocampal volume or prefrontal activation. However, the limitations of low

power, bias, and lack of replicability in candidate gene studies lead to a paradigm shift toward

genome-wide association studies (GWAS), which enable agnostic scans of millions of single-

nucleotide polymorphisms (SNPs) across the genome. Large-scale consortia such as ENIGMA

[Thompson et al., 2014] and UK Biobank [Alfaro-Almagro et al., 2018] facilitated such analyses,

providing unprecedented sample sizes and harmonized imaging protocols.

Despite their success, GWAS of imaging traits suffer from severe multiple testing burdens

due to the high dimensionality of both genetic and neuroimaging data. To mitigate this, multivariate

methods and dimensionality reduction techniques are proposed. Canonical correlation analysis

(CCA) and sparse CCA [Witten et al., 2009] are employed to identify linear combinations of

genetic variants and imaging phenotypes that exhibit maximal correlation. Partial least squares

(PLS) regression and reduced-rank regression (RRR) are also adapted to this context [Zhou,

2014]. These approaches, however, often lack interpretability and may not be equipped to handle

complex mediation structures.

Another prominent approach involves structural equation modeling (SEM), which models

causal pathways from SNPs to brain phenotypes and subsequently to behavior or disease outcomes

[Liu et al., 2017]. SEMs are particularly appealing due to their ability to encode directed relationships

and to simultaneously estimate direct and indirect effects. To address the high-dimensional nature

of imaging and genetic data, sparse SEMs are developed using penalized likelihood estimation

with ℓ1 or group lasso penalties [Zhang et al., 2015].

Beyond classical parametric methods, machine learning techniques such as random forests,

deep neural networks, and variational autoencoders have been adopted to model nonlinear and

hierarchical associations in imaging genetics. However, the interpretability and inference limitations

19


of these models remain active areas of research.

More recent methodological advances have begun to integrate imaging genetics into broader

causal inference frameworks. Mediation models with high-dimensional mediators have been

applied to identify brain regions that mediate genetic effects on phenotypes such as cognitive

function, substance use, or psychiatric risk [Chen et al., 2022]. These approaches leverage

regularization and optimization tools (e.g., ADMM, coordinate descent) to achieve sparse estimation

while preserving statistical validity under complex causal assumptions.

In summary, imaging genetics has evolved into a rich field combining statistical innovation

with domain-specific knowledge to bridge the gap between genotype and phenotype. As imaging

and genomic data continue to grow in scale and complexity, future advances require scalable,

interpretable, and causally informed models that can uncover mechanistic pathways in the human

brain.

1.5 Organization of the Dissertation

In Chapter 2, we present the first research project, which develops a novel penalized

mediation-based framework for TWAS. This approach integrates genome-wide association study

(GWAS) summary statistics with gene expression prediction using regularized models, enabling

the identification of genetically regulated transcripts associated with complex traits. The methodology

addresses key issues such as linkage disequilibrium confounding and indirect genetic effects, and

is validated through both simulation studies and real-data applications.

Chapter 3 introduces a high-dimensional causal mediation model tailored to structured

mediators, such as those found in neuroimaging and multi-omics data. We propose an estimator

20


based on an alternating direction method of multipliers (ADMM) optimization algorithm that

jointly identifies active mediators while controlling for the causal pathway. The model is supported

by theoretical results and evaluated through extensive simulations. Its utility is further demonstrated

in an application to imaging genetics data, revealing brain regions that mediate genetic effects on

behavioral traits.

Chapter 4 concludes the dissertation by summarizing the methodological contributions and

empirical findings of the previous chapters. It discusses the limitations of the proposed methods

and outlines future research directions.

21


Chapter 2: TIPS: a novel pathway-guided joint model for transcriptome-wide

association studies

2.1 Overview

In the past two decades, genome-wide association studies (GWAS) have pinpointed numerous

SNPs linked to human diseases and traits, yet many of these SNPs are in non-coding regions and

remain hard to interpret. Transcriptome-wide association studies (TWAS) integrate GWAS and

expression reference panels to identify gene-level associations with tissue specificity, potentially

improving interpretability. However, the list of individual genes identified from univariate TWAS

contains little unifying biological theme, so the underlying mechanisms remain largely elusive. In

this chapter, we propose a novel multivariate TWAS method that Incorporates Pathway or gene

Set information, namely TIPS, to identify genes and pathways most associated with complex

polygenic traits. We jointly model the imputation and association steps in TWAS, incorporate

a sparse group lasso penalty in the model to induce selection at both gene and pathway levels,

and develop an expectation–maximization algorithm to estimate the parameters for the penalized

likelihood. We apply our method to three different complex traits—systolic and diastolic blood

pressure, as well as a brain-aging biomarker, white-matter brain-age gap, in UK Biobank—and

identify critical biologically relevant pathways and genes associated with these traits. These

22


pathways cannot be detected by the traditional univariate TWAS + pathway-enrichment analysis

approach, showing the power of our model. We also conduct comprehensive simulations with

varying heritability levels and genetic architectures and show our method outperforms other

established TWAS methods in feature selection, statistical power, and prediction. The R package

that implements TIPS is available at https://github.com/nwang123/TIPS.

2.2 Introduction

During the past two decades, large-scale Genome-Wide Association Studies (GWAS) have

identified thousands of genetic variants (typically SNPs) that influence risk of numerous human

traits and diseases [Abdellaoui et al., 2023]. However, the GWAS to biology path is still not

clear as a SNP-trait association cannot be directly used to infer the target gene or the mechanism

whereby the variant is associated with phenotypic differences. Recent developments in genetic

epidemiology have underscored a refined focus on the transcriptome’s role in understanding

genetic influences on complex traits. The paradigm has notably shifted from genome-wide

association studies (GWAS) to transcriptome-wide association studies (TWAS). TWAS leverages

expression reference panels—such as eQTL cohorts like GTEx [Lonsdale et al., 2013], which

include both gene expression and genotype data—to discover gene–trait associations using GWAS

datasets [Wainberg et al., 2019]. Since the birth of the first TWAS method [Gamazon et al.,

2015], many statistical methods have been developed to perform TWAS [Gusev et al., 2016,

Barbeira et al., 2018, Hu et al., 2019, Zhu and Zhou, 2021, Xie et al., 2021]. These methods

are typically implemented in two stages: an imputation stage that utilizes the reference dataset

to ‘impute’ gene expression for each individual and the association analysis stage that correlates

23

https://github.com/nwang123/TIPS


each individual’s ’imputed’ gene expression with their phenotype. Alternatively, joint models

of TWAS that combine the two stages together are also being developed and emerge as a more

holistic approach to account for the imputation uncertainty [Yang et al., 2019, Yuan et al., 2020].

Such TWAS methods typically use likelihood-based inference and have been shown to be more

powerful and efficient than the two-stage methods.

Most complex traits and diseases are polygenic by nature, with their heritability depending

on a large number of genes across the genome. Like GWAS, the standard TWAS is essentially a

univariate analysis where each gene is tested. However, genes do not function in isolation; rather,

they act cooperatively in groups (e.g., biological pathways) to perform biological functions. Such

a univariate analysis provides the contribution of each single gene to the trait without providing

a unifying biological theme. In addition, univariate analysis ignores the complex correlation

between genes making it difficult to identify the most critical and potentially causal genes of a

trait. Development of multivariate TWAS is still in its infancy and existing methods mainly target

at specific risk regions and are strictly two-stage methods only [Mancuso et al., 2019, Knutson

et al., 2020, Lin et al., 2022]. To the best of our knowledge, no methods have ever considered the

gene–gene correlation, gene sets with similar function on a broad transcriptome-wide scale, in

prioritizing the most critical genes and pathways that contribute to a polygenic trait while at the

same time accounting for the imputation uncertainty in TWAS.

To fill this gap, we develop a novel multivariate TWAS method that incorporates Pathway

or gene Set information, namely TIPS, and utilize a sparse group lasso penalty to select the

most important genes and pathways that contribute to a polygenic trait. Figure 2.1(A) shows a

conceptual framework that motivates our model. As in most TWAS methods, we assume each

gene can be regulated by multiple SNPs (i.e. eQTLs) in the reference dataset. TIPS jointly

24


examines multiple genes, potentially grouped according to predefined biological pathways, to

capture their collective impact on the trait. By leveraging prior pathway information, this multivariate,

pathway-guided approach enables a more comprehensive understanding of the genetic architecture

of complex polygenic traits and helps reveal the underlying biological processes. Such a multivariate

pathway-guided approach is essential for a comprehensive understanding of the genetic architecture

of a complex polygenic trait and revealing the biological processes underlying the trait. Furthermore,

TIPS integrates the imputation and association analysis stages together to enhance the efficiency

of the model and develops an Expectation–Maximization (EM) algorithm for penalized likelihood

to estimate the parameters and select both individual-level (genes) and group-level (pathways)

features. In addition, we employ a data-splitting scheme to perform post-selection inference

at both gene and pathway levels using likelihood ratio tests. We show the advantage of TIPS

over existing TWAS methods in extensive simulations and via TWAS analysis on three different

complex traits: systolic blood pressure (SBP), diastolic blood pressure (DBP) and a brain aging

biomarker white matter Brain Age Gap (BAG), in UK Biobank (UKB).

The chapter is organized as follows. In Section 2.3, we introduce the TIPS model, the

penalized likelihood, the EM algorithm to estimate the parameters, and the likelihood ratio test

for post-selection inference. In Section 2.4, we apply our method to a TWAS analysis to identify

the genes and pathways contributing to SBP variation in UKB. In Section 2.5, we conduct various

simulations to show the advantages of our method over other TWAS methods in feature selection

and prediction. In Section 2.6, we discuss the potential extension of our method in future studies.

25


(A)

(B) !
Reference dataset !(")

"(")
(Gene expression)

(Genotype)

#
GWAS dataset !($)

(Phenotype)

(Genotype)"($)

EM
 algorithm

 for 
penalized likelihood

min		() !," " ," $ , #, + ,

Selected genes and 
pathways

Significant genes and 
pathways

Likelihood 
ratio test

Encourage pathway 
selection

Encourage gene 
selection within 

pathway

! = "(")+ + /"

0 = "($)+1 + /$

Joint modeling
latent effect of SNPs 

on genes

Pathway 
database

Multiple genes 
grouped by pathways

+sparse group lasso 
penalty:

%& ' ! + (1 − %)&+√-" '(")
%

&

"'!

Figure 2.1: (A). Conceptual framework to motivate the TIPS model; (B) Flowchart of the TIPS method.

26


2.3 Method

2.3.1 The TIPS model

Figure 2.1(B) presents the data flow for the proposed TIPS procedure. We work with two

observed data sets. The first is an eQTL reference panel, denoted D(R) =
{
Y, W(R)

}
,where

Y ∈ Rn1×m corresponds to the gene expression data matrix, and W(R) ∈ {0, 1, 2}n1×p is the

corresponding genotype matrix, with each entry reporting the allele count for a single-nucleotide

polymorphism (SNP). Here, n1 is the sample size in the reference panel, m indicates the number

of genes potentially predictive of phenotype of interest, and p represents the total number of

genetic variants that potentially regulate these genes. Denote the GWAS dataset by D(G) ={
Z, W(G)

}
, where Z is an (n21) vector of the phenotype of interest, and W(G) ∈ {0, 1, 2}n2×p

corresponds to the genotype data matrix from GWAS. n2 is the sample size in the GWAS dataset.

TIPS jointly models Y, W(R), W(G) and Z together and performs imputation and association

analysis simultaneously using the following model consisting of two regression equations:

Y = W(R)U+ eR;

Z = W(G)Uβ + eG,

(2.1)

where U is a (p×m) matrix representing the genetic effects on the gene expression. In general,

U matrix can have arbitrary elements to represent the complex SNP-gene regulatory relationship.

If we only consider the cis regulatory effect on gene expression (i.e. cis-eQTL) and assume no

common regulatory variants shared among genes, then U matrix can be simplified to a sparse

block matrix with off-block elements being zero. Following Jiang et al. [2016], we assume a

27


matrix-variate normal prior on the SNP-gene effect matrix U, denoted by U ∼MN p×m(0p×m, Ip,Σu).

This distribution implies that the entries of U are jointly Gaussian, and the vectorized form

vec(U) follows a multivariate normal distribution with mean zero and covariance matrix Ip⊗Σu,

where ⊗ denotes the Kronecker product. Here, Ip is the identity matrix of size p, and Σu is an

m × m diagonal matrix capturing the covariance structure across genes. The error term eR in

the first regression followsMN n1×m(0n1×m, In1 ,Σ1), where Σ1 is assumed to be an (m × m)

diagonal matrix with a common residual variance σ2
1 , i.e., Σ1 = σ2

1Im, where Im is the m ×m

identity matrix. The error term eG in the second regression follows N (0, σ2
2In2). Crucially, the

m× 1 vector β captures the effects of gene expression on the phenotype and is our parameter of

interest. We assume both Y and Z are centered so the intercepts are not included. In real data, we

further adjust for commonly used covariates (e.g. age, sex and BMI) and population stratification

by regressing phenotype on these confounders and use the residuals after adjustment Z̃ in the

second regression to implement the model in (2.1).

Viewing U as a matrix of latent variables, we can write out the complete-data log-likelihood

and then use standard expectation-maximization (EM) algorithm to estimate the parameters of

interest. For polygenic phenotype/trait that can be potentially affected by multiple genes across

the genome, the genes are naturally grouped in pathways or gene sets that represent different

biological processes (e.g. those pathways in KEGG pathway database [Kanehisa et al., 2017] or

gene sets from MSigDB [Subramanian et al., 2005]), we are particularly interested in learning

what biological processes are most related to the phenotype. In addition, we are also interested in

learning what specific genes within those processes are most critical in regulating the phenotype.

To dissect both gene and pathway level contribution to the complex polygenic phenotype, we

propose a sparse group lasso penalty [Simon et al., 2013] on β and the penalized complete-data

28


log-likelihood is given as:

pℓ(Y,W(R),W(G),Z,U|θ) =

− n1m

2
log
(
2πσ2

1

)
− 1

2σ2
1

n1∑
i=1

(Yi. −W
(R)
i. U)(Yi. −W

(R)
i. U)T

− n2m

2
log
(
2πσ2

2

)
− 1

2σ2
2

∥∥∥Z−W(G)Uβ
∥∥∥2

− p

2

m∑
j=1

log(2πσ2
uj
)−

m∑
j=1

uTj Ipuj

2σ2
uj

− {aλ||β||1 + (1− a)λ
G∑

g=1

√
kg∥β(g)∥2},

(2.2)

where uj is the jth column of U, Yi., W
(R)
i. and W

(G)
i. are ith row of Y, W(R) and W(G), ||.||1

and ||.||2 are ℓ1 and ℓ2 norms. θ = {β, σ2
uj
, σ2

1, σ
2
2} are the parameters that need to be estimated. In

this formulation, the first line on the right side corresponds to the log-likelihood in the reference

data (the first regression in equation (2.1)), the second line corresponds to the log-likelihood in

the GWAS dataset (the second regression in equation (2.1)), the third line corresponds to the

log-likelihood of prior on U and the fourth line is the sparse group lasso penalty imposed on the

coefficient vector β. The m genes are classified into G pathway groups where g = 1, 2, . . . , G

indicates the group index and kg indicates the group size for the gth group. Scattered genes not

belonging to any groups will have kg = 1. Possible overlap of genes between pathways might

exist. We adopt the widely used overlapping group lasso method [Jacob et al., 2009], where genes

may appear in multiple pathway groups. To accommodate this, we duplicate the corresponding

columns in the design matrix W(G)U for each group in which a gene appears, and assign separate

coefficients to each copy during model fitting. After estimation, the final gene-level effect is

obtained by averaging the estimated coefficients of the duplicated copies corresponding to the

29


same gene. λ is the regularization parameter, a ∈ [0, 1] indicates a convex combination of lasso

(i.e. ||β||1) and group lasso penalties (i.e.
∑G

g=1

√
kg∥β(g)∥2), where β(g) denotes the subvector

of coefficients corresponding to genes in the g-th pathway group. To select the tuning parameters

λ and a, we perform a grid search over a predefined set of candidate values. Specifically, we

define a logarithmic grid for λ, e.g., λ ∈ {10−3, 10−2, . . . , 100.5, 101}, and a linear grid for a,

e.g., a ∈ {0, 0.1, 0.2, . . . , 1}. For each combination of (λ, a), we fit the model and compute the

prediction mean squared error (MSE) using K-fold cross-validation (CV), typically with K = 5.

In addition, we also consider model selection criteria such as the Bayesian Information Criterion

(BIC) to validate the stability and parsimony of the selected model. In practice, we also apply

the “one standard error rule” to guide model selection when multiple models perform similarly

under cross-validation (CV) or BIC. Specifically, when the CV error or BIC values are close

across different tuning parameter combinations, we prefer the more parsimonious model—i.e.,

the one with fewer nonzero coefficients—whose CV error or BIC lies within one standard error

of the minimum value [Hastie et al., 2009].

2.3.2 EM algorithm for penalized likelihood

In the presence of latent variables in U matrix, we propose an EM algorithm to estimate the

parameters that maximize the penalized likelihood function in equation (2.2). From the complete

data penalized log-likelihood and given the block matrix structure of U, it is easy to recognize

that, given Y, W(R), W(G), Z and θ, terms involving each uj can be written as a quadratic form

30


for j = 1, ...,m:

uTj

(
− 1

2σ2
1

W(R)TW(R) −
β2
j

σ2
2

W(G)TW(G) − 1

2σ2
uj

Ip

)
uj+(

1

σ2
1

yTj W
(R) +

βj
σ2
2

ZTW(G)

)
uj + constant,

where yj is the jth column of Y, βj is the corresponding coefficient for jth gene. Thus its

posterior distribution can be written in the form of a normal distribution. In the E-step, we take

expectation of the complete data penalized log-likelihood with respect to the posterior distribution

of uj and derive the Q-function by utilizing the fact that E(uTXu) = µT
uXµu + Tr(XΣu) if

u ∼ N(u|µu,Σu):

Q(θ|θold) = −
n1m

2
log
(
2πσ2

1

)
−

m∑
j=1

µuj

Tµuj

2σ2
uj

− 1

2σ2
1

m∑
j=1

n1∑
i=1

(Yi. −W
(R)
i. µuj

)(Yi. −W
(R)
i. µuj

)T

− n2m

2
log
(
2πσ2

2

)
− 1

2σ2
2

m∑
j=1

∥∥∥Z−W(G)µuj
βj

∥∥∥2
−

m∑
j=1

Tr

((
1

2σ2
1

W(R)TW(R) +
β2
j

2σ2
2

W(G)TW(G) +
1

2σ2
uj

Ip

)
Σuj

)

− {aλ||β||1 + (1− a)λ
G∑

g=1

√
kg∥β(g)∥2},

(2.3)

where Σuj
= ( 1

σ2
1
W(R)TW(R)+

β2
j

σ2
2
W(G)TW(G)+ 1

σ2
uj

Ip)
−1 and µuj

= Σuj

(
1
σ2
1
W(R)Tyj +

βj

σ2
2
W(G)TZ

)
.

In the M-step, we obtain the updated model parameters in θ by setting the derivative of

31


Q-function to zero. The updating equations for the parameters of interest are as follows:

σ2
1 =

1

n1m

(
m∑
j=1

∥∥yj −W(R)µuj

∥∥2) , σ2
2 =

1

n2m

(∥∥Z− βjW(G)µuj

∥∥2) ,
σ2
uj

=
1

p

(
µuj

Tµuj
+ Tr

(
Σuj

))
, for j = 1, 2, . . . ,m,

βj =



0, if ||sj|| ≤ (1− a)λ
√
kg or ||sj|| ≤ aλ

(
2µT

uj
W(G)TW(G)µuj

+ 2Tr(W(G)Σuj
W(G)T )

)−1

×

sj

(
1− (1−a)λ

√
kg

||sj ||

)
, Otherwise,

for j = 1, 2, . . . ,m, where sj = 2ZTW(G)µuj
− aλsign(βj).

The E-step and M-step are alternated iteratively until convergence. In practice, we monitor

the change in the penalized observed-data log-likelihood. The algorithm is declared to have

converged when the relative change in a single iteration falls below the threshold 10−4.

2.3.3 Post-selection gene-level and pathway-level inference

The EM algorithm estimates the coefficient vector β and selects the most critical genes

and pathway groups for the trait with nonzero coefficients (i.e. β ̸= 0) but cannot directly

make inference about the effects. To avoid overfitting, we consider a data splitting scheme to

make post-selection inference on the coefficients [Kuchibhotla et al., 2022]. We first split the

GWAS dataset into a GWAS training set and a GWAS testing set of about equal size. The GWAS

training set and the reference dataset are used to explore the data and select genes with nonzero

32


coefficients to the model. In the GWAS testing set, we evaluate the significance of the coefficients

of selected genes using a likelihood ratio test (LRT). To test gene-level coefficient, we formalize

the following hypothesis test: H0 : βj = 0, vs. H1 : βj ̸= 0. The LRT statistics is given by

Λj = 2
(
ℓ(Y,W(R),W(G),Z,U|θ̂)− ℓ(Y,W(R),W(G),Z,U|θ̂βj=0)

)
,

where θ̂ is the vector of coefficient estimates under the full model while θ̂βj=0 is the coefficient

estimates under H0, and ℓ(.) represents the likelihood function without penalty. The test statistic

Λj asymptotically follows a chi-squared distribution with 1 degree of freedom under the null

hypothesis, according to Wilks’ Theorem. See Theorem 5.10 on page 213 of van der Vaart [1998]

for a formal statement and conditions under which the result holds. Likewise, we can use a similar

LRT to test for pathway-level coefficient vector under the following hypothesis: H0 : β(g) =

0, vs. H1 : β(g) ̸= 0, where β(g) represents the coefficient vector for the gth pathway group.

The pathway-level LRT statistics Λ(g) = 2
(
ℓ(Y,W(R),W(G),Z,U|θ̂)− ℓ(Y,W(R),W(G),Z,U|θ̂β(g)=0)

)
,

where θ̂ is the vector of coefficient estimates under the full model while θ̂β(g)=0 is the coefficient

estimates underH0. The test statistics Λ(g) asymptotically follows the χ2 distribution with degree

of freedom of kg under the null.

2.3.4 Other related methods

Many methods have been developed to perform TWAS in the past decade (see a review

of TWAS method development in the Supplement). These methods are either performed in two-

stages or in joint model, use individual or summary level GWAS data, and conduct single-gene or

multi-gene analysis. Table 2.1 summarizes the characteristics of representative TWAS methods

(focusing on methods using individual level GWAS data) in each category and its comparison

33


with our method. PrediXcan [Gamazon et al., 2015] is the very first two-stage univariate TWAS

method. CoMM [Yang et al., 2019] is the first joint model that combines imputation and analysis

stages and conducts univariate TWAS only. kTWAS [Cao et al., 2021] and later on mkTWAS

[Cao et al., 2022a] uses kernel machine to select genetic variants but are still univariate at gene

level. MV-IWAS [Knutson et al., 2020] and TWAS-LQ [Lin et al., 2022] are multivariate TWAS

methods but do not incorporate the pathway information nor conduct feature selection in their

implementation. NeRiT [Jin et al., 2022] incorporates pathway information as network edges

and performs network regression without any feature selection, thus is easily prone to overfitting.

On the other hand, our method “TIPS” conducts multivariate TWAS that uses pathway group

information from existing pathway database and incorporates sparse group lasso penalty function

to select the most critical genes and pathways that contribute to a complex trait.

Table 2.1: Main characteristics of different TWAS methods and their comparison to our TIPS method.

Method Joint/Two stage Univariate/Multivariate Pathway info. Gene/pathway selection Ref

PrediXcan Two-stage Univariate No No Gamazon et al. [2015]
CoMM Joint Univariate No No Yang et al. [2019]
kTWAS Two-stage Univariate No No Cao et al. [2021]
mkTWAS Two-stage Univariate No No Cao et al. [2022a]
MV-IWAS Two-stage Multivariate No No Knutson et al. [2020]
TWAS-LQ Two-stage Multivariate No No Lin et al. [2022]
NeRiT Two-stage Multivariate Yes No Jin et al. [2022]
TIPS Joint Multivariate Yes Yes Ours

2.4 Application to TWAS analysis in UK Biobank

2.4.1 TWAS analysis on blood pressure

High blood pressure or hypertension is one of the leading risk factors for stroke and coronary

artery disease, contributing to about 10 million deaths every year worldwide. Blood pressure is

a complex heritable (heritability∼50%) and polygenic trait potentially impacted by hundreds to

34


thousands of variants over the whole genome based on large-scale GWAS [Evangelou et al.,

2018]. Few TWAS have been conducted to identify the genes that contribute to blood pressure

variation and understand its genetic regulatory mechanism. In this subsection, we apply our

method to perform a multivariate TWAS on two traits of blood pressure: systolic blood pressure

(SBP) and diastolic blood pressure (DBP) to identify the genes and pathways that influence blood

pressure using data from the large-scale UK Biobank (UKB) [Bycroft et al., 2018].

We collected the genotype, SBP, DBP and demographic data (e.g. age, sex and BMI)

from non-pregnant, family-unrelated individuals of European ancestry in UKB with at least

two blood pressure measurements. Both SBP and DBP values were calculated as the mean of

two nonnull blood pressure measurements using phenotype code 4080 and 4079 in UKB. After

excluding subjects with missing data and those with various types of cardiovascular diseases,

the complete data we analyzed included n = 16, 470 and n = 17, 394 participants for SBP and

DBP, respectively. We first performed standard QC of the GWAS data and only kept the genetic

variants with: minor allele frequency (MAF) ≥ 0.05, Hardy-Weinberg equilibrium p-value ≥

0.001, missing genotype rate ≤ 0.1. We then matched the GWAS SNP data with the genotype

data in GTEx reference database (downloaded from dbGaP website, under phs000424.v9.p2)

and mapped SNPs to genes in GTEx by physical proximity (1Mb upstream and downstream of

the transcription start site, i.e. focusing on cis-eQTLs) on the reference genome. As there was

no consensus on the physical distance used to define cis-eQTLs, to minimize the impact from

these different definitions on the final TWAS results, we followed from PrediXcan [Gamazon

et al., 2015] and further performed elastic net as in PrediXcan to pre-select the most predictive

SNPs for each gene among the cis-eQTLs (Table S13) to ensure more stable selection and reduce

computational burden. The same idea held potential when we plan to include trans-eQTLs into

35


our model in future extensions. For the SBP and DBP traits, we used the genotype and gene

expression data from the heart left ventricle tissue in GTEx (GTEx v8; RNA-seq data) to train

the imputation model for the tissue’s close relationship with blood pressure. We used the pathway

information from large curated pathway databases including KEGG [Kanehisa et al., 2017],

Reactome [Jassal et al., 2020] and Biocarta [Nishimura, 2001] to group the genes. Pathway

database like Gene Ontology were largely based on prediction so are not included here. Pathways

with size smaller than 5 and larger than 100 were excluded to ensure the best interpretability

and biological relevance of the selected pathways. After the preparation steps, our final TWAS

analysis included 2089 genes, with 848 genes grouped in 438 pathways and the remaining genes

as scattered genes. To mitigate the confounding effects from other covariates, we standardized

and regressed SBP and DBP phenotype on sex, age and BMI and took the residual as the adjusted

SBP and DBP phenotype for TWAS analysis. The GWAS data was split into a training set and

a testing set. In the training set, we ran our TIPS method to select the top genes and pathways

predictive of SBP and DBP. In the testing set, we performed post-selection LRT for genes and

pathways selected in the training set and obtained the p-values.

We evaluated MSBP pathways for systolic blood pressure (SBP) and MDBP pathways for

diastolic blood pressure (DBP), where MSBP = MDBP = 562 corresponds to the total number

of hypothesis tests, including 438 pathway-level tests and 124 tests for scattered (singleton)

genes. For each trait, the raw p-values were multiplied by 562 to obtain Bonferroni-adjusted

p-values. Treating hypotheses with adjusted p-values adj-p < 0.05 as significant, we identified

87 significant pathways for SBP and 96 for DBP. Fig 3.2(A) and Fig S1 highlighted the significant

KEGG and Reactome pathways ordered by pathway IDs, so the closer they were, the more similar

their gene contents and functions. Among them, we found a few clusters of biological pathways,

36


including nucleotide and amino acid metabolism [Feig et al., 2013, Poggiogalle et al., 2019],

lipid and carbohydrate metabolism [Reaven and Hoffman, 1989, Zicha et al., 1999], cytokine

signaling [Chae et al., 2001] and immune signaling [Rodriguez-Iturbe et al., 2017] that had been

widely reported to be related to blood pressure. In addition, we also found pathways related

to neurodegenerative diseases and cellular level processes such as cell motility and cell cycle

worth further investigation in future studies. Interestingly, there were a large amount of similar

pathways (e.g. those related to nucleotide, amino acid and lipid metabolism, and related to

cytokine pathways) identified for both SBP and DBP traits, validating the high genetic correlation

between the two traits. On the other hand, pathways related to cardiomyopathy were only seen

for SBP while those related to ECM and cell interaction were only seen for DBP, showing the

specificity of the two traits.

We also ran CoMM, PrediXcan and MV-IWAS and applied the same cutoff to detect

significant genes and performed pathway enrichment analysis on these genes for a fair comparison.

For SBP, the enrichment analysis identified 8, 57 and 36 significant pathways with Benjamini-

Hochberg adjusted p-value<0.05 for CoMM, PrediXcan and MV-IWAS, respectively. For DBP,

the enrichment analysis identified 38, 40 and 163 significant pathways for CoMM, PrediXcan

and MV-IWAS, respectively. Important pathways related to lipid and glucose metabolism had

large p-values thus were missing from their top lists (see Fig S1 and Fig S2 for the Manhattan

plot and Venn diagram in Appendix A.0.2). This showed the power of performing multivariate

TWAS and highlighted the importance of incorporating pathway group information earlier in

the model to encourage pathway selection, as compared to the traditional univariate TWAS +

pathway enrichment analysis which lacks unifying biological themes. Due to the lack of feature

selection, the multivariate MV-IWAS method was subject to serious collinearity issue which

37


makes it unstable resulting in very different sets of detected pathways for SBP and DBP thus

was less reliable. As a sensitivity analysis of possible false positives generated from our method,

we also performed random grouping of the genes into groups of comparable sizes (repeat for five

times) and applied our method to each random grouped data and compared the significance of top

selected pathways to our current results. As shown, top pathways based on real pathway group

information from pathway database are far more significant than those with random grouping

validating our current results (Fig S3 in Appendix A.0.2).

The pathway level comparison was not all fair for multivariate vs. univariate TWAS

methods, so we also compared the different methods at gene level. For SBP, our method selected

285 significant genes while CoMM, PrediXcan and MV-IWAS selected 140, 184 and 208 significant

genes respectively with Bonferroni adjusted p-value<0.05; for DBP, we selected 312 genes while

CoMM, PrediXcan and MV-IWAS selected 165, 215 and 356 significant genes respectively (see

Fig S4 for the Venn diagram). Multivariate TWAS methods were in general more powered than

univariate TWAS methods, and the gain of power for our method mainly came from the group-

based selection that potentially aggregates signals in the same group(s). We picked two example

KEGG pathways and highlighted the selected genes within the selected pathways by our method

in the KEGG topology plots (Fig 2.2(B)). Significant genes including aldehyde dehydrogenase

(ALDH) genes in arginine and proline metabolism and AKR1B1 gene in glycerolipid metabolism

pathways, had been found to be related to blood pressure in previous studies [Kato et al., 2011,

Wang et al., 2016]. These findings facilitated our understanding of the genetic basis of blood

pressure and potential genetic regulatory mechanism that might be related to antihypertensive

drug targets.

38


(A)

(B)

ALDH9A1

ALDH4A1PRODH
LAP3

ALDH9A1
AKR1B1

DGKG, 
DGKQ

Arginine and Proline Metabolism Glycerolipid Metabolism

Figure 2.2: (A) Manhattan plot of -log10(p-value) from pathway level test results of TIPS sorted by KEGG IDs (each dot represents a pathway)
for SBP (upper) and DBP (lower) traits. (B) Topology plot of two significant KEGG pathways: Arginine and Proline metabolism and Glycerolipid
metabolism. Selected genes in the pathway are highlighted in red (SBP) and green (DBP).

39


2.4.2 TWAS analysis on white matter brain aging

Brain aging involved the gradual loss of structure and function of neurons and their connections,

leading to cognitive decline and increased vulnerability to neurodegenerative diseases [Gold

et al., 2012, Kantarci et al., 2014]. Previous studies employed a machine learning algorithm and

leverage multiple fractional anisotropy (FA) tract measurements obtained from diffusion tensor

imaging (DTI) data to predict the white matter (WM) brain age gap (BAG) as a marker of brain

aging [Cole and Franke, 2017, Cole et al., 2019]. The BAG is highly heritable [Lee et al., 2020,

Wen et al., 2024], but the genetic underpinnings of the BAG are not completely understood. In

this subsection, we applied our method to identify genes and pathways that influence the BAG

outcome using data from UKB.

We collected the DTI data from UKB imaging visit and preprocessed the data using ENIGMA

structural and DTI pipelines [Jahanshad et al., 2013]. The FA images were aligned onto a

standard-space white matter skeleton and FA measurements were derived from 39 white matter

tracts covering various brain regions [Smith et al., 2006]. Following previous papers [Mo et al.,

2023, Feng et al., 2023], we applied a random forest regression model to predict brain age from

these 39 regional FA measures and derived our outcome of interest BAG as the difference between

brain age and chronological age for n= 11,299 participants. The larger BAG, the faster one’s brain

ages as compared to their chronological age. The GWAS data was preprocessed as in the blood

pressure example and the GTEx reference panel data for the brain cortex tissue was used to train

the imputation model. As before, the curated KEGG, Reactome and Biocarta pathway database

were used to group the genes.

We identified 136 significant pathways for BAG with Bonferroni adjusted p-value<0.05.

40


As a comparison, CoMM, PrediXcan and MV-IWAS detected 35, 43 and 0 significant pathways

(Appendix), following the TWAS+pathway enrichment analysis procedure using the same cutoffs.

Leonardsen et al. [2023] listed a few major categories of pathways related to the genetics of

BAG including neural system, DNA repair, DNA metabolism, protein metabolism and immune

defense. Among them, only TIPS detected all the five categories of pathways (Fig 2.3(A)). At

gene level, we had a large overlapping proportion with the other two methods but also detected

more significant genes that are otherwise missed by other methods (Fig S5 in Appendix A.0.2).

NME4 and NT5E genes elected by TIPS in the “Pyrimidine metabolism” pathway (Fig 2.3(B))

were previously found to be related to brain health and neurodegenerative diseases [Rahman,

2009, Grimm and Eckert, 2017]. These findings helped identify the critical genetic risk factors

and the impacted biological pathways and functions of brain aging and improved our understanding

of the aging process.

2.5 Simulation study

2.5.1 Simulation setting

In this section, we conducted various simulation studies to show the performance of our

TIPS method. For benchmarking, we main evaluated the feature selection and statistical power

performance of our method as compared to other representative TWAS methods including CoMM

[Yang et al., 2019], PrediXcan-EN (PrediXcan+elastic net), PrediXcan-Lasso (PrediXcan+Lasso)

[Gamazon et al., 2015] and the multivariate MV-IWAS method [Knutson et al., 2020], the network

regression based NeRiT method [Jin et al., 2022], kernel machine based kTWAS method [Cao

et al., 2021] and its marginal+kernel version mkTWAS method [Cao et al., 2022a].

41


PrediXcanCoMMTIPSCategory of pathways

004Neural system

003DNA repair

002DNA metabolism

336Protein metabolism

545Immune defense

(A)

(B) Pyrimidine Metabolism

NME4

NT5E

UCK1, 
UCKL1

Figure 2.3: (A) Comparison of major categories of pathways detected in different methods. Number in the cells refers to the number of significant
pathways within each category; (B) Topology plot of selected KEGG pathway “Pyrimidine metabolism” by TIPS.

42


For the reference dataset, we assumed a sample size n1 = 300 and p = 900 SNPs

that are potentially regulating the expression of m = 180 genes. We assumed each gene was

independently regulated by p̄m = 5 different SNPs without cross-talks and each of these SNPs is

in a different locus (i.e. independent lead eQTLs). The m = 180 genes were further classified

into five pathway groups with 25 genes in each group without overlapping, and 55 scattered

genes not belonging to any pathways. To mimic the real genotype pattern, we first defined a

linkage disequilibrium (LD) matrix for each gene with fixed diagonal values equal to 1 and

off-diagonal values equal to 0.5, representing the pairwise LD between SNPs within the same

gene. We then followed from Han and Pan [2012] to simulate the genotype data (i.e. the

number of minor alleles for all SNPs) for each sample in the reference dataset using a latent

Gaussian model with MAF∈ (0.2, 0.5) and block-diagonal correlation matrix where each block

was defined by the aforementioned LD matrix. For a complete evaluation, we followed from

Cao et al. [2021] and considered different genetic architectures: additive (2 vs 1 vs 0 minor

alleles); heterogeneous (any vs none minor alleles); recessive (two alternative alleles vs others)

and compensatory (one alternative allele vs others). The matrix of genetic effects was drawn as

U ∼ MN (0, Ip,Diag(σ2
u)), assuming a common σ2

u for all genes. The gene expression matrix

Y of the reference dataset was simulated based on the first regression equation in (2.1), where σ2
1

varied depending on the scenarios (see below).

For the GWAS dataset, we assumed a sample size n2 = 600, the genotype data was

similarly simulated as in the reference dataset. The coefficient vector β was constructed to

reflect both pathway group-level and individual gene-level contribution to the trait: Group 1

that included non-associated genes (all β = 0); Group 2 that included genes with homogeneous

moderate effect (all β = 0.2); Group 3 that included genes with homogeneous weak effect (all

43


β = 0.02); Group 4 and 5 more heterogeneous that included genes with moderate effect, weak

effect and non-associated genes. The 55 scattered genes also included a mixture of moderate

effect, weak effect and non-associated genes. The phenotypic data was simulated based on the

second regression equation in (2.1).

Following Yang et al. [2019], we considered two heritability measures in simulating the

data. The first cellular level heritability ĥ2C = p̄mσ̂2
u

p̄mσ̂2
u+σ̂2

1
quantified the proportion of variance of

gene expression explained by genetic variants, where p̄m = 5 in our simulations. The second

organism level heritability ĥ2T = β̂T β̂σ̂2
u

β̂T β̂σ̂2
u+σ̂2

2

quantified the gene effect on the trait. We simulated

the following three scenarios with different values of σ̂2
1 , σ̂2

2 , and σ̂2
u to reflect varying cellular

and organism level heritability. By adjusting heritability parameters, it demonstrated the model’s

robustness in diverse biological contexts and its applicability in understanding gene-environment

interactions.

1. Scenario I: σ̂2
1 = 0.2, σ̂2

2 = 0.2, σ̂2
u = 0.02, ĥ2C = 0.33 and ĥ2T = 0.161.

2. Scenario II: σ̂2
1 = 0.2, σ̂2

2 = 0.2, σ̂2
u = 0.01; ĥ2C = 0.2 and ĥ2T = 0.091.

3. Scenario III: σ̂2
1 = 0.2, σ̂2

2 = 0.45, σ̂2
u = 0.01; ĥ2C = 0.2 and ĥ2T = 0.043.

We also considered additional simulation scenarios when the number of genes exceeds the

sample size of GWAS dataset (high-dimensional or high-D case), when the genes were correlated

and when MAF and heritability levels change.

Each simulation was conducted B = 1000 times. For all scenarios under investigation,

we quantified the feature selection and power by plotting the Receiver Operating Characteristic

(ROC) Curve (with Area Under Curve (AUC) values) and bar graphs. Five-fold cross-validation

44


was used to find the optimal value of regularization parameter for our method. AUCs for all

univariate TWAS methods (CoMM, PrediXcan, kTWAS and mkTWAS) were calculated based

on the ranking of genes by the p-values they provided. The true model was a multi-gene model

(e.g. MV-IWAS and TIPS), so we also evaluated the predictive accuracy of the model using the

Mean Squared Error (MSE). MV-IWAS did not include a penalty function so cannot work in

high-D scenario.

Table 2.2: Mean area under the ROC curve (AUC) for simulation Scenarios I–III under additive or heterogeneous genetic architectures with the
standard deviation shown in parentheses.

Scenario I Scenario II Scenario III

Additive

TIPS 0.953(0.041) 0.903(0.064) 0.833(0.047)
CoMM 0.753(0.042) 0.696(0.049) 0.668(0.034)
PrediXcan-EN 0.731(0.041) 0.667(0.029) 0.624(0.035)
PrediXcan-lasso 0.728(0.044) 0.669(0.028) 0.621(0.032)
MV-IWAS 0.631(0.044) 0.610(0.040) 0.604(0.043)
NeRiT 0.571(0.042) 0.541(0.043) 0.533(0.045)
kTWAS 0.501(0.055) 0.504(0.048) 0.499(0.048)
mkTWAS 0.501(0.046) 0.506(0.050) 0.502(0.049)

Heterogeneous

TIPS 0.966(0.051) 0.923(0.043) 0.845(0.033)
CoMM 0.763(0.043) 0.728(0.026) 0.655(0.039)
PrediXcan-EN 0.737(0.032) 0.691(0.055) 0.645(0.039)
PrediXcan-lasso 0.743(0.027) 0.678(0.051) 0.647(0.047)
MV-IWAS 0.612(0.048) 0.591(0.047) 0.583(0.038)
NeRiT 0.577(0.041) 0.554(0.043) 0.517(0.040)
kTWAS 0.494(0.048) 0.491(0.035) 0.493(0.044)
mkTWAS 0.495(0.048) 0.496(0.031) 0.498(0.031)

2.5.2 Simulation results

Figure 2.4 and 2.5 show the ROC curve and power comparison, and Table 2.2 shows the

corresponding AUC values for all TWAS methods in different simulation scenarios. TIPS has

45


Additive Heterogeneous

I

TIPS CoMM PrediXcan
-EN

PrediXcan
-Lasso

MV-IWAS NeRiT

kTWAS mkTWAS

II

III

Figure 2.4: ROC curve comparison for the simulation scenarios I-III with additive or heterogeneous genetic architecture.

46


Additive Heterogeneous

I TIPS

CoMM

PrediXcan
-EN

PrediXcan
-Lasso

MV-IWAS

NeRiT

kTWAS

mkTWAS

II

III

Figure 2.5: Power comparison for the simulation scenarios I-III with additive or heterogeneous genetic architecture.

47


higher AUC and power than other methods in all scenarios for different genetic architectures

(additive or heterogeneous in Fig 2.4 and 2.5, recessive and compensatory in Fig S6-S7). As

the heritability level drops from Scenario I to Scenario II and III, all methods have decreased

AUC and power but TIPS remains the best performer. Overall, the joint models (TIPS and

CoMM) that accounts for the imputation uncertainty outperform two-stage methods (PrediXcan-

EN, PrediXcan-Lasso, kTWAS and mkTWAS). The multivariate TIPS model outperforms univariate

models (CoMM, PrediXcan-EN, PrediXcan-Lasso) in feature selection and power. MV-IWAS is

a multivariate model but does not include any penalty function for feature selection thus performs

worse, and has higher MSE than TIPS (Table S14). NeRiT incorporates group information as

network edges and runs network regression, but it is easily subject to overfitting issue thus also

performs worse. The results look consistent when we consider gene-gene correlation (Fig S8),

for high dimensional scenario (Table S15), and with other heritability levels and MAF values

(Table S16 and S17).

2.6 Discussion

In this paper, we propose a novel pathway-guided “TIPS” method to perform multivariate

TWAS analysis and identify genes and pathways that contribute to complex polygenic traits. The

method is featured by its multi-gene group-based approach that examines hundreds to thousands

genes simultaneously in the context of biological pathways, capturing their collective impact on

the trait rather than individually. In addition, the method integrates the imputation and association

stages of TWAS together to improve efficiency and employs a sparse group lasso penalty for both

gene-level and pathway-level feature selection. We also develop a fast EM algorithm on penalized

48


likelihood for parameter estimation and a data splitting scheme coupled with likelihood ratio test

for post-selection inference at both gene and pathway levels.

In its current implementation, TIPS only considers cis-eQTLs that regulate the gene expression

in reference database. Incorporating trans-eQTLs into its applications improve the predictive

performance of genetically regulated expression but is also subject to heavy computational burden

due to the vast number of SNP-gene pairs that need to be tested. We have proposed one potential

solution when there are too many eQTLs included by performing pre-selection of SNPs most

predictive of the genes in GTEx using elastic net penalty as in PrediXcan. The applicability

of such a solution for our method when incorporating all trans-eQTLs needs to be evaluated

in future studies. In addition, though we consider the gene-gene correlation and grouping in

the association part, the independent gene regulation assumption in the imputation part could

be strong in real application. For example, it is not rare to see common genetic regulators and

regulatory mechanisms of different genes in proximity or functionally related. Future extension

could consider such complex multi-SNPs-to-multi-genes regulatory relationship.

The current method, tailors for individual-level GWAS data and continuous phenotype,

holds potential for adaptation to accommodate GWAS summary statistics and mixed types of

phenotype, broadening its applicability as in other TWAS methods [Barbeira et al., 2018, Yang

et al., 2020b]. In addition, the current TWAS method is only trained in one tissue and performs

single population analysis. Future works can be focused on extending the current TWAS method

towards a joint multi-tissue model [Hu et al., 2019] and joint multi-ancestry analysis [Chen

et al., 2023]. As a wide variety of phenotypes are being collected nowadays, testing multiple

phenotypes simultaneously tends to be more powerful than testing each phenotype at a time

by considering the correlation across phenotypes. The current TIPS method could be readily

49


extended to address multivariate phenotype, allowing for the investigation of a number of complex

traits all together. This will likely lead to a high-dimension-to-high-dimensional problem (a large

number of genes to a large number of phenotypes) that may require special care in modeling for

feature screening and selection [Ke et al., 2022, Canida et al., 2022].

There are a few existing online TWAS resources of disease susceptibility genes such as

TWAS-hub [Mancuso et al., 2017] [Mancuso et al., 2017] and webTWAS [Cao et al., 2022b], our

proposed TIPS method, on the other hand, is the first method that incorporates curated pathway

database in multivariate TWAS model and provides pathway-level information associated with

a disease or trait. It could potentially be a valuable tool offering more reliable and interpretable

results for TWAS analyses. An efficient R package called TIPS was developed (https://

github.com/nwang123/TIPS) to implement the method.

50

https://github.com/nwang123/TIPS
https://github.com/nwang123/TIPS


Chapter 3: High-Dimensional Multivariate Mediation Analysis Integrating Genetic,

Imaging, and Phenotypic Data

3.1 Overview

Mediation analysis is a fundamental tool for elucidating causal mechanisms in complex

systems. The emergence of high-throughput biological and neuroimaging technologies has introduced

multivariate exposures and mediators of increasing dimensionality, rendering univariate mediation

analysis methods approaches inadequate. In this work, we propose a novel aggregation-based

mediation framework that simultaneously models high-dimensional multivariate exposures and

mediators. Our method jointly reduces the dimensions of e