ABSTRACT Title of Dissertation: Application of Causal Inference in Large-Scale Biomedical Data Zhiwei Zhao Doctor of Philosophy, 2024 Dissertation Directed by: Professor Shuo Chen and Chixiang Chen Department of Mathematics This dissertation contains three projects that tackle the challenges in the application of causal inference on large-scale biomedical data. Project 1 proposes a novel mediation analysis framework with the existence of multiple mediators and outcomes. It can extract the mediation pathway efficiently and estimate the mediation effect from multiple mediators simultaneously. The effectiveness of the proposed method is validated through extensive simulation and a real data application focusing on human connectome study. Project 2 introduces a doubly machine learning based method, assisted by algorithm ensemble, for estimating longitudinal causal effects. This approach reduces estimation bias and accommodates high-dimensional covariates. The va- lidity of the proposed method is justified by simulation studies and an application to adolescent brain cognitive development data, specifically evaluating the impact from sleep insufficiency on youth cognitive development. Project 3 develops a new bias-reduction estimation that addresses unmeasured confounding by leveraging proximal learning and negative control outcome tech- niques. This method can handle a moderate number of exposures and multivariate outcomes in the presence of unmeasured confounders. Both numerical experiment and data application using UK Biobank demonstrate that the proposed method effectively reduces estimation bias caused by unmeasured confounding. Collectively, these three projects introduce innovative methodologies for causal inference in neuroimaging, advancing mediation analysis in neuroimaging, improving longitudinal causal effect estimation, and reducing estimation bias in the presence of unmeasured confounding. APPLICATION OF CAUSAL INFERENCE IN LARGE-SCALE BIOMEDICAL DATA by Zhiwei Zhao Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2024 Advisory Committee: Professor Shuo Chen, Chair/Advisor Professor Chixiang Cheng, Co-Advisor Professor Vince Lyzinski Professor Tianzhou Ma Professor Yan Li © Copyright by Zhiwei Zhao 2024 Acknowledgments I owe my heartful gratitude to all the people who have made this thesis possible and because of whom my graduate experience has been one that I will cherish forever. First and foremost I’d like to thank my advisor, Professor Shuo Chen for giving me an invaluable opportunity to work and explore on challenging and extremely interesting projects in a rising area over the past years. He has always made himself available for help and advice for all sorts of things and there has never been an occasion when I’ve knocked on his door and he hasn’t given me time. It has been a pleasure to work with and learn from such an extraordinary individual. I would like to thank my co-advisor, Professor Chixiang Chen. Without his extraordi- nary theoretical ideas and methodological expertise, this thesis would have been a distant dream. Thanks are due to Professor Vince Lyzinski, Professor Tianzhou Ma, and Professor Yan Li for agreeing to serve on my thesis committee and for sparing their invaluable time reviewing the manuscript. My cohort and also friends at Pittsburgh, Maryland, and Emory have enriched my graduate life in many ways and deserve a special mention. Their fruitful insights on my projects, uncon- ditional supports, and tireless encouragements are the most important motivation throughout this journey. I owe my deepest thanks to my family - my mother and father who have always stood by ii me and unconditionally supported me through my study. I would also like to thank my cousin Hannah, and to other families and friends. Without them, none of my work would have been possible. Lastly, I would like to acknowledge financial support from Dr. Emerson Wickwire during the past two years of my PhD study. It was a pleasure to collaborate with him. iii Table of Contents Acknowledgements ii Table of Contents iv List of Tables vii List of Figures ix List of Abbreviations xii Chapter 1: Introduction 1 1.1 Introduction of Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Causal Graphical Models and Identifications . . . . . . . . . . . . . . . . 2 1.1.2 Mediation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.3 Causal Inference with Time-varying Covariates . . . . . . . . . . . . . . 6 1.1.4 Unmeasured Confounding . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2 Large-Scale Biomedical Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.1 Human Connectome Project . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.2 Adolescent Brain Cognitive Development . . . . . . . . . . . . . . . . . 9 1.2.3 UK Biobank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 multivariate-mediator and multivariate-outcome mediation model (MMO) 11 1.3.2 Marginal Structure Ensemble Learning Model (MASE) . . . . . . . . . 12 1.3.3 Unmeasured Confounding with Multiple Outcomes and Exposures . . . . 12 1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 2: multivariate-mediator and multivariate-outcome mediation model (MMO) 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Multivariate mediation model . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Model estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.3 Inference for dense bi-clusters . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.1 Parameters and settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.2 Comparable methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 iv 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Chapter 3: Marginal Structure Ensemble Learning Model (MASE) 38 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.1 Marginal Structural Model with Time-varying Exposure . . . . . . . . . 41 3.2.2 Doubly Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3 Ensemble Learning For time-varying PS and ICEs . . . . . . . . . . . . . 45 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 4: Unmeasured Confounding with Multiple Outcomes and Exposures 59 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2.3 Comparison of Existing Methods . . . . . . . . . . . . . . . . . . . . . . 66 4.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.1 Estimation of S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Estimation of βsl with S . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.3 Estimation of βsl with Ŝ . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.1 Metabolic Dysfunction Effects on Brain White Matters . . . . . . . . . . 76 4.4.2 Subgroup Analysis on APOE4 Carriers . . . . . . . . . . . . . . . . . . 78 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Appendix A: MMO 82 A.1 Mediation Bi-luster and Effect Estimation . . . . . . . . . . . . . . . . . . . . . 82 A.1.1 Dense Bi-Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.1.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 A.1.3 Choice of Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 85 A.1.4 Causal Mediation Estimation . . . . . . . . . . . . . . . . . . . . . . . . 86 A.1.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A.1.6 Mediation Effect Estimation . . . . . . . . . . . . . . . . . . . . . . . . 88 A.2 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.2.1 Data Acquisition and Processing . . . . . . . . . . . . . . . . . . . . . . 90 A.2.2 Mediation Pathways . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 A.2.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.2.4 Selected Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.3.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 A.3.2 Additional Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.3.3 Low-rank mediators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 v Appendix B: MASE 99 B.1 Marginal Structure Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 B.1.1 DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 B.1.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 B.1.3 Proof of Lemma 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 B.2 Data Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.1 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 B.2.2 Other Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 B.2.3 Positivity Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 B.2.4 Other Summary Information . . . . . . . . . . . . . . . . . . . . . . . . 112 Appendix C: Unmeasured Confounding with Multiple Outcomes and Exposures 118 C.1 Proof of Lemma 1 and Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . 118 C.2 Extra Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.2.1 Numerical results for Simulation . . . . . . . . . . . . . . . . . . . . . . 119 C.2.2 Influence of Estimated Ŝ . . . . . . . . . . . . . . . . . . . . . . . . . . 119 C.2.3 Other De-bias Shrinkage Method . . . . . . . . . . . . . . . . . . . . . . 119 C.3 Application Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 C.3.1 Data Preprocessing and Region Names . . . . . . . . . . . . . . . . . . . 121 C.3.2 Estimation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Bibliography 127 vi List of Tables 2.1 Edge-wise results of extracting mediation pattern Uc ⊗ Vd by all methods. We demonstrate the false discovery rate (FDR), sensitivity (sens) and specificity (spec) across different settings for all comparable methods. ∗ represents a rounded number. 34 2.2 Simulation results for mediation effect estimation (step 2). We compare the es- timated mediation effects by Low Rank Model (medLRM) and MMO, with ref- erence to the estimated mediation effect based on the oracle model (with known mediating imaging factors). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Causal effect estimation (insufficient sleep vs. sufficient sleep) and 95% confi- dence interval (in parentheses) for age-corrected outcomes in ABCD study. * represents the significant CIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 Results for 500 Monte Carlo simulations. We compared the performance of all methods with sample size n = 1000, different dimensions of confoudners (p), and various effect sizes. The estimated SE’s are the averages of standard error estimates for the 500 simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.1 Sensitivity and False Discovery Rate (FDR) of the greedy L0 optimization. . . . . 71 4.2 The coverage probability of the 95% bootstrap confidence interval based on 500 simulations and 500 bootstrap samples. . . . . . . . . . . . . . . . . . . . . . . . 72 A.1 Performance of Different Function in order to select λ . . . . . . . . . . . . . . . 87 A.2 Simulation results (Step 1) for simulated datasets with non-normal mediators and outcomes: the accuracy of mediation pattern Uc⊗Vd extraction. We demonstrate the edge-wise false discovery rate (FDR), sensitivity (sens), and specificity (spec). The standard Cauchy distribution and Laplace distribution are used to generate the mediators and outcomes of the synthetic datasets with the sample size is n = 200. ∗ represents a rounded number. . . . . . . . . . . . . . . . . . . . . . . . . 95 A.3 Simulation results (Step 2) for simulated datasets with non-normal mediators and outcomes: mediation effect estimation. We perform Step 2 analysis based on the above results (Step 1 analysis results based on datasets generated by the Cauchy distribution or Laplace distribution) and compare the estimated mediation effects by Low Rank Model (medLRM) and MMO, with reference to the estimated me- diation effect based on the oracle model (with known mediating imaging factors). 96 vii A.4 Simulation results (Step 1) for simulated datasets with non-orthogonal mediators: the accuracy of mediation pattern Uc ⊗ Vd extraction. We demonstrate the edge- wise false discovery rate (FDR), sensitivity (sens) and specificity (spec). The correlations of non-orthogonal factors vary from 0.5 to 0.8 with sample size is n = 200. ∗ represents a rounded number. . . . . . . . . . . . . . . . . . . . . . . 96 A.5 Simulation results (Step 2) for simulated datasets with non-orthogonal mediators: the estimated mediation effects. We compare the estimated mediation effects by Low Rank Model (medLRM) and MMO, with reference to the estimated me- diation effect based on the oracle model (with known mediating imaging fac- tors). The correlations of simulated non-orthogonal factors vary from 0.5 to 0.8 withn = 200. ∗ represents rounded numbers. . . . . . . . . . . . . . . . . . . . . 97 B.1 Causal effect estimation (insufficient sleep vs. sufficient sleep) and 95% confi- dence interval (in parentheses) for age-uncorrected outcomes. The results are for the primary analysis (upper), and complete cases only data (lower). * represents the significant CIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 B.2 Causal effect estimation (insufficient sleep vs. sufficient sleep) and 95% confi- dence interval (in parentheses) for age-corrected outcomes in ABCD study. The results are for the analysis using complete cases only data. * represents the sig- nificant CIs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.3 Causal effect estimation (insufficient sleep vs. sufficient sleep) and 95% con- fidence interval (in parentheses) for age-corrected (upper) and age-uncorrected (lower) outcomes. * represents the significant CIs. . . . . . . . . . . . . . . . . . 113 B.4 Baseline demographic statistics. Summarized by the group of sufficient/insufficient (nine hours sleep) at the first visit measure for data used in the two models. The number of participants is counted for categorical variables. . . . . . . . . . . . . 116 B.5 Proportions of sufficient versus insufficient sleep at each visit. We provided such values for the original dataset, the dataset we used for Model 1 and that of Model 2.117 C.1 The mean absolute errors ×100 and infinity norm ×100 of estimates bias from different methods across scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . 120 C.2 The mean absolute value (MAE), infinity norm (|| · ||∞) of estimates of βsl when Ŝ is mis-specified. We considered strong effect of βsl = 5, r = 3 with normal residuals for all scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 C.3 Comparing the mean absolute errors ×100 and infinity norm ×100 of estimating bias using SCAD, adLASSO, LASSO, Direct estimates across scenarios. . . . . . 122 C.4 The names of regions of interests for the 39 white matter regions. . . . . . . . . . 124 C.5 The effect estimations of LDL and VLDL on the seven regions with the 95% confidence intervals. ‘-’ represents the effect estimate is not significant for that exposure factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 C.6 The effect estimations of LDL and VLDL on the seven regions with the 95% confidence intervals. ‘-’ represents the effect estimate is not significant for that exposure factor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 viii List of Figures 1.1 Three major types of relationship in causal DAGs. . . . . . . . . . . . . . . . . . 3 1.2 The illustration of the relationships, effects, and sources of confounders in me- diation analysis. Z∗, ∗ = 1, 2, 3 represent subsets of confounders observed for corresponding paths (see corresponding assumptions below). . . . . . . . . . . . 4 1.3 The illustration time varying exposures and covariates causal DAGs. . . . . . . . 6 1.4 Three approaches resolving unmeasured confounding problem. . . . . . . . . . . 7 2.1 We illustrate the multivariate mediator multivariate outcome mediation analysis: (a) is the mediation model based on observed data with massive and complicated mediation patterns; (b) is the mediation model with recognized systematic medi- ation patterns, showing X affecting a subset of outcomes (Y1) through a corre- sponding set of mediators (M1), the bold arrows represent relations between sets; and (c) is the mediation model with factorized mediators for a specific mediator set M̃c and outcome set Yd. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Selected Regions for CBF (mediators) and ReHo (outcomes): (a) is the six re- gions for CBF (mediators) including the six region names which are close to the bilateral primary cerebral arteries, (b)-(f) demonstrate the selected regions of ReHo (outcomes) in the areas of frontal, corona, temporal, parietal and midbrain. 28 2.3 The mediation analysis results for the data example. The mediation variable is systolic blood pressure (SBP). (a) is the heatmap of− log(p)-matrix based on the input data. (b) is the heatmap with detected systematic mediation patterns. (c) is the mediation results based on the mediating factor. β and θ effects are averaged across related outcome ROIs. Overall, the mediation proportion is high. . . . . . 29 2.4 Edge-wise bi-cluster simulation results for all methods: the scenarios on x axis correspond to the scenarios in Table 2.1 from left to right, top to bottom. . . . . 33 3.1 ABCD tabulate data structure. The solid lines represent the impact of variables to the others, the dashed line represents possible impact of previous measurements on the later ones. Notations A,Z, and Y are defined in section 3.2.1. . . . . . . . 42 3.2 Overview of MASE: the left top box is the procedure for estimating propensity scores (PS), the right box is the procedure for estimating the outcome regres- sions (ICE); The left bottom box it the illustration of estimating the ATE (θ̂). We illustrate the stratification estimate for ICE within the right box (orange box). . . 50 3.3 Forest plots for long-term effects estimates of insufficient sleep on cognitive de- velopment on ABCD data. The reference level is sufficient sleep. Estimates from different methods are represented using different types of points. . . . . . . . . . 54 ix 4.1 Illustration graphs of negative control bridge method [1] (Left) and the parallel outcome method [2] (Right). The lines without arrows represent the effect can be either way. The illustration is based on different line relationship colors. . . . . . 68 4.2 Box-plots for the estimates biases for the five methods. The results for p = 100 and p = 50 are very similar so we only visualized the p = 100 results. The metric used here is MAE in red boxes (right of each method) and || · ||∞ in blue boxes (left of each method). Each row of the plots represents scenarios for different effect sizes. They are (from top to bottom): Weak Effect, Strong Effect, Mixed Effect, and Uniform Residuals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Box-plot for the estimates biases for the five methods with estimated Ŝ. The metric used here is MAE for the red boxes (left of each method) and infinity norm for the blue boxes (right of each method). The above row is for n = 100 and the bottom row is for n = 500. The FDR are correspondingly in Table 4.1. . . 75 4.4 Visualization of the brain regions that are significantly influenced by HDL or LDL. The left three columns are for effects estimates of HDL with axial (left), coronal (middle), and sagittal (right) views. The last column is the results for effects from LDL with axial view. The top row is for whole dataset and the second row is for APOE4 non-carriers only. Blue regions represent the effects are negative and red regions represent the effects are positive. . . . . . . . . . . 78 4.5 Visualization of the regions in brain axial (left) and coronal (right) views from the subgroup analysis for APOE4 carriers. The regions are affected by VLDL and LDL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 A.1 Demonstration of a dense bi-cluster. The left figure is a − log(p)-matrix (weight matrix) before extraction, while the right figure is a p-matrix with re-ordered region indices to highlight the extracted dense bi-cluster. A dense pattern refers to that within an extracted bi-cluster the proportion of edges with suprathreshold p-values is much higher than the edges in the rest of the graph. . . . . . . . . . . 83 A.2 The Mediation Discovery procedure for determining the mediation pathway. (a) is the marginal correlation between SBP, CBF, ReHo. (b) is the conditional cor- relation between the three variables, conditional on the boxed variables. The conditional correlation between SBP and ReHo given CBF is not significant. (c) is all the possible DAGs based on the results in (b). The green box is the most possible DAG for our data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 A.3 Sensitivity analysis to assess the impact of the violated assumptions for our data example. The sensitivity analysis is performed based on the procedure described in [3]. The sequential ignorability holds when ρ = 0 (ρ is the correlation of resid- uals of mediators and outcomes conditional on the exposure). The dashed hori- zontal line is the estimated mediation effect with the valid assumption of sequen- tial ignorability. When ρ deviates further from 0, the bias is larger. The results suggested that our mediation results were generally robust to (mild to moderate) violation of this assumption (e.g., ρ = 0–0.375 ). . . . . . . . . . . . . . . . . . 94 x A.4 Results of mediation effect estimation based on mediators by orthogonal factor- ization method and a composite mediator based on the weighted sum method. Based on 1000 simulations, both methods perform well, while the factorization method slightly outperforms the weighted sum component method. . . . . . . . 98 B.1 We illustrate the Directed Acyclic Graph for longitudinal causal relationship for the first two visits. The DAG can be extended to more repeated measures with similar relationship between the treatment history and covariates history. . . . . . 99 B.2 Forest plots for long-term effects estimates of insufficient sleep on age-uncorrected cognitive development on (a) primary data analysis for ABCD study (left); and (b) ABCD data with complete cases only (right). The reference level is sufficient sleep. Estimates from different methods are represented using different types of points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 B.3 Forest plots for long-term effects estimates of insufficient sleep on cognitive de- velopment on ABCD data with complete cases only. The reference level is suf- ficient sleep. Estimates from different methods are represented using different types of points. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.4 Forest plots for long-term effects estimates of insufficient sleep on cognitive de- velopment on ABCD data with three time points measurements. (a) is the plot for age-corrected variables (left); and (b) is the plot for age-uncorrected variables (right). The reference level is sufficient sleep. Estimates from different methods are represented using different types of points. . . . . . . . . . . . . . . . . . . 114 B.5 Positivity assumption check with propensity score density check with insufficient (red) versus sufficient sleep (blue) groups. . . . . . . . . . . . . . . . . . . . . . 115 C.1 Correlation maps for all the metabolic biomarkers grouped into two clusters using kmeans. The first dense graph in the left bottom is for the first group (HDL) and the dense group on the top right is for the second (LDL&IDL). The darker the color, the stronger the correlation is. . . . . . . . . . . . . . . . . . . . . . . . . 125 xi List of Abbreviations ABCD Adolescent Brain Cognitive Development ACP Amish Connectome Project APOE4 Apolipoprotein ε4 gene ASL Arterial-spin Labeling ATE Average Treatment Effect BH-FDR Benjamini-Hochberg False Discovery Rate BOLD Blood Oxygenation Level-Dependent CBF Cerebral Blood Flow DAG Directed Acyclic Graphs DR Doubly Robust DTI Diffusion Tensor Imaging FA Fractrional Anisotropy FDR False Discovery Rate fMRI funcitonal Magnetic Resonance Imaging FWE Family-Wise-Error HCP Human Connectome Project HDL High-density Lipoprotein ICE Iterative Conditional Mean Estimation IDL Intermediate-density Lipoprotein IV Instrumental Variable KL Kullback-Leibler LASSO Least Absolute Shrinkage and Selection Operator LDL Low-density Lipoprotein LSEM Linear Structural Equation Model lTMLE longitudinal Targeted Maximum Likelihood MAE Mean Absolute Error MASE Marginal Structure Ensemble Learning Model MLS Multivariate Longitudinal Sequential MMO Multivariate-Mediator and Multivariate-Outcome mediation model MRI Magnetic Resonance Imaging MSM Marginal Structural Model (N)DE (Nature) Direct Effect (N)IE (Nature) Indirect Effect NCO Negative Control Outcome xii NCE Negative Ctonrol Exposure PS Propensity Score ReHo Regional Homogeneity ROI Regions of Interest SCAD Smoothly Clipped Absolute Deviation SBP Systolic Blood Pressure SURVA Stable Unit Treatment Value Assumption TMLE Targeted Maximum Likelihood UKB United Kingdom Biobank VLDL Very-low-denisty Lipoprotein WM White Matter xiii Chapter 1: Introduction Causal Inference is a widely used technique in fields such as economics, biomedical stud- ies, behavior science, and political science. Its goal is to determine the actual effect of or more causal variables (often referred to as treatments) on outcomes of interest. The primary challenge in causal inference is identifying the true causal effect in real-world applications, where other factors—such as confounding—can complicate the relationship. Researchers from various dis- ciplines, including statistics [4], epidemiology [5], econometrics [6], and computer science [7], have developed different methods to address these challenges. In statistics, causal inference has been a significant area of research over the past half-century (see a review in 8). To date, the research of causal inference have been focusing on three major areas: causal discovery based on graphical models [9, 10, 11, 12, 13], identification problem answering the question of if observational data can identify the causal effect (the g-formula, 14, the back door criteria 15, etc.), and estimation problem (the g-computation 16, the inverse probability treatment weighting 17, the marginal structural models 18, the targeted maximum likelihood estimation 19, and etc.). 1 1.1 Introduction of Causal Inference The main goal of causal inference is to understand the effect of a certain causal variable(s) on an outcome of interest. In real-world applications, identifying causal relationships can be far more challenging than detecting associations, as demonstrated by phenomena like Simpson’s paradox [20]). One of the most widely used frameworks is the so-called “counterfactual” out- come framework, first proposed by [21] and later developed by [22]. In this framework, let A represents the exposure of interest (the treatment variable) and Y be the outcome. The counter- factual outcome is defined as Y (A = a) representing the outcome we would have observed if the exposure A were set to A = a. This differs from the observed outcome Y |A = a, which repre- sents the outcome we observe in a real experiment when A = a. The key distinction is that we can only observe one outcome given a specific A = a, and we can never know the counterfactual outcome under a different value of A = a′. Causal inference is such a technique that seeks to understand the underlying data generating mechanism, bridge the gap between the counterfactual and real word, and find models to estimate the effects. 1.1.1 Causal Graphical Models and Identifications In causal inference, the causal graphical models are a powerful tool that offer both visual and mathematical approaches for studying the underlying causal relationships in data. Central to these models are Directed Acyclic Graphs (DAGs) [23, 24]. A DAG consists of nodes (rep- resenting variables) and directed edges (arrows that depict causal relationships between these variables). A key requirement for DAGs is that they must be ”acyclic,” meaning they cannot contain any loops, ensuring a one-way causal influence from one variable to another. Common 2 relationships depicted in DAGs include confounders (Figure 1.1 (a)), mediator (Figure 1.1 (b)), and collider (Figure 1.1 (c)). The arrowed lines represent causal relationships, while dashed lines represent underlying relationships when conditioning on a variable Z (a property of colliders, also known as Berkson’s paradox 25). In this dissertation, I will focus on the estimation problem, with a particular emphasis on applying causal inference to large-scale biomedical data. A Z Y A M Y A Z Y (a) Confounder (b) Mediator (c) Collider Figure 1.1: Three major types of relationship in causal DAGs. To bridge the gap between the counterfactual world and real world, we focus on the iden- tifiability of causal effects. Several key assumptions help achieve this identifiability. First, we assume consistency, meaning that the counterfactual outcome Y (A = a) is equal to the observed outcome with A = a (i.e., Y (A = a) = Y |A = a). Second, we assume conditional ignorability, which states that there is no unmeasured confounders between A and Y given the observed co- variates. Let Z represents the observed covariates, this assumption requires that A |= Y |Z. Third, we assume positivity, meaning that the probability of being exposed to any level of A = a is greater than zero given values of covariates, i.e., P (A = a|Z) > 0. Other assumptions, such as the Stable Unit Treatment Value Assumption (SUTVA, 26), which requires no interference be- tween subjects, and assumptions specific to Directed Acyclic Graphs (DAGs), like the the causal Markov condition [27] and faithfulness [24, 28], also play a role. However, these assumptions do not directly influence the common identification of the average treatment effect (ATE), which is the primary causal effect estimated in this dissertation. Therefore, we will focus on the three 3 major assumptions: consistency, conditional ignorability, and positivity. Here, we define the ATE as following (for binary exposure A as an example): ATE := E(Y (A = 1)− Y (A = 0)). (1.1) 1.1.2 Mediation Analysis One of the most important branches of causal inference is mediation analysis, an approach used to understand and explain the mechanisms through which an independent variable influences a dependent variable via one or more intermediate variables [29]. These intermediate variables, known as mediators (M , Figure 1.1 (b)), help elucidate the pathways through which causal effects operate. Mediation analysis is crucial for moving beyond simple causal relationships, enabling researchers to explore the underlying processes and mechanisms driving these effects. A M Y Z3 Z2 Z1 4 γ α β Figure 1.2: The illustration of the relationships, effects, and sources of confounders in mediation analysis. Z∗, ∗ = 1, 2, 3 represent subsets of confounders observed for corresponding paths (see corresponding assumptions below). Mediation analysis relies on several common assumptions, particularly regarding condi- tional ignorability. Using the same counterfactual outcome framework and defining the observed 4 confounders as Z, we assume the following: (I) Y (A = a,M = m) |= A | Z (II) Y (A = a,M = m) |= M | Z,A = a (III) M(A = a) |= A | Z (IV) Y (A = a,M = m) |= M(A = a′) | Z. (1.2) We illustrated such confounding relationships in Figure 1.2. It is required that there should be no unmeasured confounders between: (I) treatment and outcome (Z1); (II) mediator and outcome (Z2); (III) treatment and mediator (Z3); and (IV) mediator and outcome, considering the path- way affected by the treatment (path 4 in the figure). These conditions are known as sequential ignorability assumptions [30]. The positivity assumption is adjusted as: (I) treatment positivity: P (A = a|Z = z) > 0ifp(Z = z) > 0; (II) mediator positivity: P (M = m|A = a, Z = z) > 0ifP (Z = z) > 0. The consistency assumption remains the same as previously defined. The causal effect of A on Y can be decomposed into two components: the direct effect and the indirect effect. Specifically, we consider the natrural direct (NDE), which is the effect of A on Y that occurs directly, bypassing the mediator (fixed A, γ in Figure 1.2). The indirect effect (NIE), on the other hand, captures the effect of A on Y through the mediator M (fixed M , αβ in Figure 1.2). For binary A, we have the following decomposition: ATE = [E(Y (1,M(1)))− E(Y (1,M(0)))]︸ ︷︷ ︸ NIE + [E(Y (1,M(0)))− E(Y (0,M(0)))]︸ ︷︷ ︸ NDE (1.3) 5 1.1.3 Causal Inference with Time-varying Covariates Another challenging aspect of causal inference is estimating causal effects in longitudinal data. In this context, the effect of A on Y is dynamic, as it can be influenced by or interact with covariates that change over time. This evolving nature of both treatment and covariates adds complexity to the estimation process, requiring methods that can account for time-varying confounders and interactions. Z0 A0 Z1 A1 Y U0 U1 (a) Sequential randomization Z0 A0 Z1 A1 Y U0 U1 Z0 A0 Z1 A1 Y U0 U1 (b) Sequential conditional randomization (c) Unmeasured confounding Figure 1.3: The illustration time varying exposures and covariates causal DAGs. We illustrate three scenarios of causal relationships with time-varying confounders in Fig- ure 1.3. In these scenarios, U represents unmeasured confounding, and the graphs are as follows: 6 (a) sequential randomization trials with no covariates influencing the treatments A0, A1; (b) se- quential conditional randomization, where only the observed confounders (Z0, Z1) influence the treatments; and (c) unmeasured confounding, with U0, U1 influence the treatments. In practice, scenario (b) is the most commonly studied, under the assumption of sequential exchangeability. With Āk = (A0, A1, . . . , Ak), Z̄k = (Z0, Z1, . . . , Zk) being the treatment and covariates history, the assumption states that : Y (Āk = āk) |= Ak | Āk−1, Z̄k. The positivity assumption is adjusted as: P (Ak = ak|Z̄k, Āk−1 > 0, k = 1, . . . , K. Commonly used techniques in estimating the causal effect under this scenario include G-computation [31], structural nested model [32], and marginal structural model [18, 33, 34]. 1.1.4 Unmeasured Confounding Throughout the discussion, the identification and estimation of causal effects rely on the conditional ignorability assumption, which requires that there be no unmeasured confounding between the variables of interest. However, this assumption is often violated in observational studies [35], making it challenging to accurately estimate causal effects. A U M Y A IV U Y A U Y W (a) Front-Door Model (b) Instrumental Variable (IV ) (c) Negative Control (W ) Figure 1.4: Three approaches resolving unmeasured confounding problem. In recent years, several techniques have been developed to address the issue of unmeasured confounding, denoted as U . Three commonly used methods, illustrated in Figure 1.4, are: the front-door model [23, 36], which adjusts for unmeasured confounding through a mediator that is 7 the sole path through which A affects the outcome (i.e., complete mediation). The instrumental variable method [37, 38], where an IV influences A but is independent of U . An important appli- cation of the IV method is Mendelian randomization, which uses genes as IVs to estimate causal effects [39, 40]. And lastly, the negative control variables [1, 41, 42], where the NC variable is influenced by U and acts as a proxy for adjusting the effect of U on the causal relationship [43, 44]. However, these methods come with their own sets of assumptions, which may be more or less reasonable depending on the application. For example, the front door model requires no unmeasured confounding between A and M or between M and Y . The instrumental variable requires exclusion restriction assumption, meaning the IV influences the outcome only through its effect on A [45]. The negative control variables, depending on whether they are NC exposures or NC outcomes, require different exclusion relationships between A and Y (see 1 for details). All NC variables must also satisfy the condition of U-comparability, meaning they share a similar confounding structure [41]. 1.2 Large-Scale Biomedical Data 1.2.1 Human Connectome Project The Human Connectome Project (HCP) is a collaborative research effort designed to com- prehensively study how the brain’s structural and functional network connections contribute to cognitive and behavioral capabilities, ultimately influencing personality and mental health [46, 47, 48]. Specifically, we focus on a subset of the HCP known as the Amish Connectome Project (ACP, 49). It includes 450 participants aged 18-85 from the Amish population, which is characterized by greater genetic homogeneity, a consistent lifestyle, lower exposure to envi- 8 ronmental variation, and a reduced risk of substance misuse. Data collection follows the broader guidelines of the HCP and includes standard demographic information, structural, functional, and diffusion imaging scans, clinical data with genomic information, and behavioral data such as NIH Toolbox assessments, diagnostic details, and genotypes. We provide a detailed description of the cohort used in our study, along with the measurements and pre-processing procedures, later in Chapter 2 and Appendix A. 1.2.2 Adolescent Brain Cognitive Development The Adolescent Brain Cognitive Development (ABCD) study is the largest long-term study of brain development and child health in the United States. It recruits over 11, 000 children cross the country, following them from ages 9-10 into young adulthood. The primary goal of the study is to understand how various factors, including genetics, environment, physical activity, and sub- stance use, influence brain development and overall health during adolescence [50, 51]. The study collects a wide range of data encompassing neuroimaing, genetic, behavioral and cognitive assessments, environmental and lifestyle, and substance use and mental health data. Participants first enrolled at ages 9-10 with a collection of comprehensive set of baseline data. This is followed by annual behavioral and cognitive assessments, surveys, physical measurements, and environ- mental data, with neuroimaging data acquired every other year. In this dissertation, we focus primarily on the survey and questionnaire data, with further details provided in Chapter 3 and Appendix B. 9 1.2.3 UK Biobank The UK Biobank (UKB) is a large-scale biomedical database designed to enhance our understanding of the genetic and environmental factors that contribute to disease development. It is a prospective cohort study of approximately 500, 000 individuals, aged 40 to 69 at recruitment, from across United Kingdom [52, 53]. The UKB provides a comprehensive resource that allows researchers to explore the interplay between genetics, environment, and lifestyle factors in the development of diseases such as cancer, cardiovascular disease, diabetes, mental health disorders, and neurodegenerative diseases. The study collects data from various sources, including genetic information, lifestyle and environmental data, health records, biological samples, and imaging data from a subcohort of around 100, 000 participants. It also provides some repeated assessments on, for example, imaging, and questionnaire surveys. It also includes repeated assessments, such as imaging and questionnaire surveys. In our study, we focus on structural brain imaging data along with biological testing variables and demographic information. Further details on the data and analysis are provided in Chapter 4 and Appendix C. 1.3 Overview In this dissertation, I aim to address the application of causal inference in the context of large-scale biomedical data, with a particular focus on the estimation of causal effects. Three key research questions are explored: (i) the mediation analysis with the existence of multivariate mediators and multivariate outcomes; (ii) the longitudinal causal effect estimation with high- dimensional covariates; and (iii) the unmeasured confounding problem with multiple outcomes and moderate exposures. The following subsections provide the summaries of the motivation, 10 current challenges, proposed solutions, and brief conclusion for each project. 1.3.1 multivariate-mediator and multivariate-outcome mediation model (MMO) Multimodal neuroimaging data have attracted increasing attention for brain research. An integrated analysis of multimodal neuroimaging data and behavioral or clinical measurements provides a promising approach for comprehensively and systematically investigating the under- lying neural mechanisms of different phenotypes. However, such an integrated data analysis is intrinsically challenging due to the complex interactive relationships between the multimodal multivariate imaging variables. To address this challenge, a novel multivariate-mediator and multivariate-outcome mediation model (MMO) is proposed to simultaneously extract the latent systematic mediation patterns and estimate the mediation effects based on a dense bi-cluster graph approach. A computationally efficient algorithm is developed for dense bi-cluster struc- ture estimation and inference to identify the mediation patterns with multiple testing correction. The performance of the proposed method is evaluated by an extensive simulation analysis with comparison to the existing methods. The results show that MMO performs better in terms of both the false discovery rate and sensitivity compared to existing models. The MMO is applied to a multimodal imaging dataset from the Human Connectome Project to investigate the effect of systolic blood pressure on whole-brain imaging measures for the regional homogeneity of the blood oxygenation level-dependent signal through the cerebral blood flow. 11 1.3.2 Marginal Structure Ensemble Learning Model (MASE) Evaluating the effects of time-varying exposures is essential for longitudinal studies. The effect estimation becomes increasingly challenging when dealing with hundreds of longitudinally measured confounders. We propose a Marginal Structure Ensemble Learning Model (MASE) to provide a marginal structure model (MSM)-based robust estimator under such longitudinal set- ting. The proposed model integrates multiple machine learning algorithms to model propensity scores and a sequence of conditional outcome means such that it allows many confounders with potential non-linear confounding effects to reduce the risk of inconsistent estimation. Extensive simulation analysis demonstrates the superiority of MASE over benchmark methods (e.g., MSM, G-computation), yielding smaller estimation bias and improved inference accuracy. We further apply MASE to the adolescent cognitive development study to investigate the time-varying ef- fects of sleep insufficiency on cognitive performance. The results reveal an aggregated negative impact of insufficient sleep on cognitive development among youth. 1.3.3 Unmeasured Confounding with Multiple Outcomes and Exposures A key challenge in applying causal inference to real-world data is the presence of unmea- sured confounding. To address this issue, we propose a novel approach using the idea of proximal learning. The proposed model leverage the information across multiple outcomes with the ex- istence of multiple exposures based on linear structural equation models. Extensive simulation studies demonstrates the superiority of our method. We apply the proposed method to the UK Biobank data to study the influence of plasma metabolic dysfunciton on brain white matter. The results identified six brain regions influenced by metabolic biomarkers in the overall data, and 12 four distinct regions in individuals who are non-carriers of the apolipoprotein ε4 gene. A fur- ther subgroup analysis on the spolipoprotein ε4 carriers validate the effectiveness of our method. Additionally, we provide a review of existing debiasing methods for handling unmeasured con- founding in both methodology and comprehensive numerical experiments. 1.4 Organization of the Dissertation In the remainder of this dissertation, we address the three key questions surrounding the application of causal inference in large-scale biomedical data. In Chapter 2, we introduce the model, estimation, and application of the multivariate-mediator and multivariate-outcome medi- ation model, along with a thorough evaluation of its numerical performance. Chapter 3 presents the framework for the marginal structure ensemble learning model, including application results and numerical experiments. In Chapter 4, we present a proximal learning-based method for han- dling unmeasured confounding with multiple outcomes and exposures. This chapter includes a comprehensive numerical study, comparing our method to existing approaches in terms of both experimental performance and methodological differences, alongside two illustrative data appli- cations. In general, this dissertation explores the application of causal inference in large-scale biomedical data from three distinct but interrelated perspectives. The methods developed and applied in these studies pave the way for future advancements in causal inference applications and support clinical investigators in disease diagnosis and treatment selection. 13 Chapter 2: multivariate-mediator and multivariate-outcome mediation model (MMO) 2.1 Introduction The joint analysis of multiple types of neuroimaging data (i.e., multimodal imaging) has garnered increasing interest for brain research, as these data can characterize brain functions and structure from different yet complementary angles [54]. For example, functional magnetic res- onance imaging (fMRI) can capture neural responses to external stimuli. Resting-state fMRI is commonly used to assess functional connectivity between neural populations from different brain areas. Diffusion-weighted MRI evaluates the molecular function and micro-architecture. Recently, regional homogeneity (ReHo) has been widely discussed in the resting-state literature [55, 56]. ReHo is a measure of brain activity that evaluates the summarized functional connec- tivity between a voxel and its nearest neighbors. It is based on the blood oxygenation level- dependent (BOLD) signal. These imaging variables from multimodal datasets can be influenced by peripheral and behavioral conditions simultaneously and, at the same time, interact with each other. The current research is a multimodal imaging study investigating cardiovascular disease risks, such as the effects of systolic blood pressure (SBP) on ReHo via the influence of cerebral 14 blood flow (CBF). We develop the hypothesis based on previous findings of the physiological effects of SBP on CBF [57], and CBF on ReHo [58, 59]. This study uses two brain imaging modalities and thus, has two sets of spatially dependent multivariate variables. One way to sys- tematically study the underlying mechanism of SBP → CBF → ReHo is to use a multivariate- mediator and multivariate-outcome mediation model (MMO). We use multivariate mediation methods in the integrated analysis of SBP through multimodal imaging mediation. Statistical mediation analysis has been widely applied in neuroimaging studies [60, 61, 62]. Certain models have been adopted for image-based mediators. For example, [63] proposed an orthogonal mediator decomposition method and [64] used a sparse principal component analysis in a multiple-mediator analysis. Advanced models have also been developed to handle mediation analysis with multivariate exposures and multivariate mediators [65, 66]. However, there is still a methodological gap in mediation analysis where the mediator and outcome are both multivariate. To fill this gap, we propose a novel mediation model to handle multivariate-mediators and multivariate-outcomes. In our application, extracting the underlying mediation pathways is nat- urally challenging because the mediation pathways are tangled between imaging modalities with complex data structures. In our motivation data example, a small set of CBF mediators related to the primary brain arteries can mediate most of the effects of peripheral blood pressure (exposure) on localized ReHo measures (outcomes). Here, we mimic this neurobiological property (i.e., sys- tematic mediation patterns) by introducing the concept of dense bi-clusters. We use the following example to illustrate the concept of dense bi-clusters while referring the readers to Section 2.2.2.1 for the detailed introduction. In a dataset with 100 mediators and 100 outcomes, there exist 10000 potential pathways, among which 200 (2%) are positive pathways, and 9800 are negative. If pos- itive pathways are evenly distributed, a bi-cluter consisting of 10 mediators and 20 outcomes is 15 expected to cover around four positive pathways (10 × 20 × 2%). The 10-mediator-20-outcome bi-cluster is dense if it covers a large number (say 120) of positive pathways suggesting unevenly distributed positive pathways. The dense bi-clusters is particularly useful for our application because they can reveal the underlying complex mediation pathways, which is required for me- diation effect estimation. We develop a computationally efficient algorithm that is tailored to extract the latent mediation pathways among multivariate mediators and multivariate outcomes with guaranteed optimum convergence. Based on the extracted mediation dense bi-clusters, we estimate mediation effects by performing an orthogonal transformation for the selected media- tors. A new form of statistical inference is adopted to assess the significance of the mediation dense bi-clusters. We apply this approach to the Amish Connectome Project, which is a subcohort of the Hu- man Connectome Project, to investigate the influence of SBP on ReHo through CBF. Six CBF regions are identified as momentous mediators, and 59 ReHo regions are affected by SBP through them. These CBF regions are analogous to those found in previous studies, which proved that they can trigger changes to metabolism and higher-order information processing. Our findings are aligned well with previous research in the literature, since a negative influence is observed from SBP to the CBF latent factor, and there are positive effects between the CBF factor and ReHo regions. Our method is among the first to use a mediation analysis with high-dimensional mediators and outcomes. Unlike conventional multivariate three-step mediation inference (e.g., 61) or a marginal mediation model with Benjamini–Hochberg false discovery rate (BH-FDR) correction, MMO can substantially improve the statistical power and prohibit the finding of false positives by fully leveraging the latent organized mediation patterns. While the proposed frame- work is applicable to causal mediation analysis for datasets with manipulable causal variables and 16 mediators, we can also use it for pathway analysis in observational studies (i.e., causal variables and mediators are not manipulable). In the neuroimaging statistics literature, pathway analysis has been widely used to investigate the potential pathways among exposures - imaging variables - behavior outcomes [66, 67]. We performed the pathway analysis as our data example is collected from an observational study. This chapter is organized as follows. Section 2.2 describes MMO and the steps for dense bi-cluster extraction, mediation effect estimation, and inference. Section 2.3 applies our frame- work to the Amish Connectome data and explains the identified CBF regions and mediation effects. Section 2.4 implements simulations to demonstrate the empirical performance of our method. We conclude the chapter with a discussion in Section 2.5. 2.2 Methods 2.2.1 Multivariate mediation model We consider a setup with a single-exposure, multivariate-mediator, and multivariate-outcome mediation model, as illustrated in Figure 2.1. Our model includes p mediating imaging variables from the first imaging modality and q outcome imaging variables from the second imaging modal- ity. To perform the mediation analysis on a sample of n participants, let X = (X1, . . . , Xn) ′ ∈ Rn be the observed exposure vector for n subjects. Let M = {M(i)}pi=1 denote the set of mediat- ing imaging variables, where M(i) ∈ Rn is the i-th mediator. Similarly, define Y = {Y(j)}qj=1 to be the set of imgaing outcomes, where Y(j) ∈ Rn is the j-th outcome. Then, the univariate mediation analysis is denoted by X→M(i) → Y(j). However, this may not reveal the systematic mediation pattern nor lead to an accurate inference. Therefore, the goal of the current research is 17 to develop MMO to identify and estimate the underlying systematic mediation effects. (a) Mediation model with observed data (b) Mediation model with identified (c) Mediation model with systematic patterns mediating factors Figure 2.1: We illustrate the multivariate mediator multivariate outcome mediation analysis: (a) is the mediation model based on observed data with massive and complicated mediation patterns; (b) is the mediation model with recognized systematic mediation patterns, showing X affect- ing a subset of outcomes (Y1) through a corresponding set of mediators (M1), the bold arrows represent relations between sets; and (c) is the mediation model with factorized mediators for a specific mediator set M̃c and outcome set Yd. The systematic mediation effects may involve only a subset of the p mediating imaging variables and a subset of the q outcomes. We further assume that the systematic mediation pattern is in an organized structure. That is, the exposure X influences a subset of outcomes Yd = {Y(j) d } Jd j=1 (d = 1, . . . , D) only through the subset of mediatorsMc = {M(i) c }Ici=1 (c = 1, . . . , C), where Ic and Jd are the number of mediators and outcomes in Mc and Yd respectively. For example, in our real data analysis, we have n = 204, p = q = 107 as the input data, and 18 D = C = 1, I1 = 6, J1 = 59 based on Step 1 analysis (see Section 2.3). We denote a systematic mediator-outcome-set pair by {Mc,Yd} [Figure 2.1(b)], ⋃C c=1Mc ⊂ M and ⋃D d=1 Yd ⊂ Y . In practice, {Mc,Yd} pairs are unknown. We provide a graph model-based procedure for estimating {Mc,Yd} in Section 2.2.2. Here, we first present the mediation model with known mediation patterns ⋃ {Mc,Yd}. Given a systematic mediation set pair {Mc,Yd}, we can model the mediation X→Mc → Yd. In practice, the mediators in imaging data are highly correlated and thus obscure the identifi- cation of the mediation effect. To ensure the identifiability, we calculate a set of orthogonal latent factors M̃c = {M̃ (l) c }Lc l=1 ofMc by factorization model (see section 2.2.2.2), where M̃c is a vari- able set consisting of Lc orthogonal factors corresponding toMc. For example, M̃(l) c , M̃(l′) c ∈ M̃c (1 ≤ l < l′ ≤ Lc) are two orthogonal factors [Figure 2.1(c)]. Specifically, we present the MMO mediation model for each {M̃c,Yd} by a linear struc- tural equation model (LSEM) as: E(M̃(l) c |X = x) = γl0 + αlx, E(Y(j) d |X = x,M̃c = {m̃(1) c , . . . , m̃(Lc) c }) = υj0 + θjx + Lc∑ l=1 βljm̃(l) c , (2.1) where j = 1, . . . , Jd and l = 1, . . . , Lc. Lower-case variables (e.g., x and m̃(l)) denote manipu- lated values. Our method can be applied for the causal mediation analysis when the causal variable (or exposure) X and mediators M can be manipulated in experiments (see the specification and assumptions of causal mediation model in A.1.4). In most observational studies where X and M cannot be manipulated, we apply the MMO to identify the potential mediation pathways and 19 estimate the mediation effects. In the neuroimaging statistics literature, this approach is often referred to as pathway analysis [66, 67]. We consider mediation analysis and pathway analysis as interchangeable terms for the rest of the chapter. 2.2.2 Model estimation In this section, we estimate the parameters in MMO using a two-step procedure. In step 1, we aim to extract potential mediation pairs {Mc,Yd} from X→M→ Y. Then, we estimate the latent factors M̃c and the mediation effects in step 2. 2.2.2.1 Step 1: Mediation bi-cluster estimation We estimate ⋃ {Mc,Yd} by leveraging a graph model. Let bipartite graph G(U, V,E) denote all potential marginal mediation pathways of X → M → Y and let W be the weighted matrix of the edges in G. The node set U represents the mediator imaging variables with |U | = p, whereas the node set V denotes the outcome imaging variables with |V | = q. The edge set E, where |E| = p · q, refers to all potential mediation pathways. The weighted edge set W is calculated with the sample data. For example, wij = − log pij , where pij is the p-value for the mediation model X→M(i) → Y(j) [68]. The analysis of X→M(i) → Y(j) can be considered as a sure independent screening procedure for multivariate mediation analysis by [69]. In our model, this procedure aims to extract the latent mediation patterns instead of estimating the mediation effect. When systematic mediation patterns exist, we define a dense bi-cluster Uc⊗ Vd ⊂ G as the Cartesian product of Uc ⊂ U and Vd ⊂ V , and P (wij > wi′j′) = 1 for i and j from {Uc⊗Vd} and 20 i′ and j′ not in {Uc ⊗ Vd} (see formal definition of dense bi-cluster in the A.1.1). Our goal is to extract all dense bi-clusters {Uc ⊗ Vd} from G based on W from the sample data. The node sets of an extracted Uc⊗Vd become the mediation pair {Mc,Yd} for the above-mentioned mediation analysis in section 2.2.1 Since the combinatorial of {Uc⊗ Vd} from G is infinite, a sound heuristic is required to re- cover {Uc⊗Vd}. Therefore, for the sample matrix W, we tend to assign wij with high values into {Uc ⊗ Vd} while maximizing the proportion of high-value edges in each dense bi-cluster. This heuristic can be translated into an objective function that maximally includes informative edges with a set of subgraphs under minimal size. The maximal weight inclusion ensures high sensi- tivity, whereas the subgraph size penalty prohibits the finding of false positives. The objective function is given by: argmax {Uc⊗Vd} dλ(Uc, Vd) = argmax {Uc⊗Vd} ∑ c,d ∑ i∈Uc,j∈Vd [log(|wij|)− 1 2 λ log(|Uc||Vd|)], (2.2) where λ is a tuning parameter for the size penalty term. When λ is large, our approach tends to extract more parsimonious {Uc ⊗ Vd} with a reduced size and increased density. Estimating Eq. (2.2) is an NP hard problem [70]. Thus, we translate the objective func- tion into an adaptive dense directed subgraph detection problem using greedy algorithms [71]. Multiple dense bi-clusters {Ûc⊗V̂d} can be captured by an iterative procedure [70]. When imple- menting the greedy algorithms, the choice of λ is critical for extracting {Uc⊗ Vd} [72]. To avoid an ad hoc parameter selection and ensure maximal reproducibility, we develop a data-driven ap- proach to automatically determine λ based on the Kullback–Leibler (KL) divergence criterion [73]. 21 Recall that the KL divergence is given by: D(P ||Q) = ∑ i,j P (δij) log P (δij) Q(δij) , (2.3) which measures the discrepancy between two distributions. Specifically, for a systematic dense bi-cluster, let δij be an indicator variable for the edge set E. δij = 1 if i ∈ U and j ∈ V are connected, and δij = 0 otherwise. P and Q are two distributions of δij , which follows a mixture Bernoulli distribution P (δij). If there are dense bi-clusters in G: P (δij) =  Bern(π1), if {i, j} ∈ {Uc ⊗ Vd}, Bern(π0), otherwise, (2.4) where π1 > π0 are the parameters of the two Bernoulli distributions. In contrast, when {Uc ⊗ Vd} = ∅, δij follows a Bernoulli distribution Q(δij), where Q(δij) = Bern(π), (2.5) and π is uniform in G. The KL divergence measures the discrepancy between the distributions of the systematic dense bi-cluster mediation patterns (2.4) and the null model (2.5) [74]. The divergence reaches a maximum when the estimated {Ûc ⊗ V̂d} bi-cluster structures are the same as the underlying true patterns {Uc ⊗ Vd} (Theorem 1). This can guide the tuning. As we assume that there are systematic mediation patterns, a larger KL divergence indicates that the underlying systematic patterns {Uc⊗Vd} are better recovered from W [75]. Therefore, we maximize the KL divergence 22 in (2.3) to objectively select the tuning parameter λ by: argmax λ D(P{Ûc⊗V̂d(λ)}∥Q) = argmax λ ∑ c,d  ∑ i,j∈Ûc⊗V̂d(λ) ( δijπ1 log π1 π + (1− δij)(1− π1) log (1− π1) (1− π) ) + ∑ i,j /∈Ûc⊗V̂d(λ) ( δijπ0 log π0 π + (1− δij)(1− π0) log (1− π0) (1− π) ) . (2.6) In practice, δij is unknown and can be approximated by δ̃ij = I(wij > r) [76]. The choice of the threshold r depends on a prior distribution h0(r) (such as the mixture binomial model, as we assumed) and can be integrated out by ∫ D(P{Uc⊗Vd(λ)}||Q)h0(r)dr. Further, π1, π0, and π are also unknown and can be estimated by maximum likelihood estimation based on the estimated dense bi-cluster structures (2.4) (details are in the A.1.3 of Appendix). Last, we optimize λ using a grid search to maximize (2.6). We integrate the objective function optimization and objective tuning parameter selection in Algorithm 3 in the A.1.5. The output of this algorithm is {Ûc ⊗ V̂d}. By Theorem 1, we show that the optimum can be attained under mild regularity conditions. Theorem 1. Let {Uc ⊗ Vd} be the true underlying mediation dense bi-cluster pattern and {U ′ c ⊗ V ′ d} be the dense bi-cluster pattern different from {Uc ⊗ Vd}. Then, we have D(P{Uc⊗Vd}||Q) > D(P{U ′ c⊗V ′ d}||Q). (2.7) The proof of this Theorem is in the A.1.2 of the Appendix. 23 2.2.2.2 Step 2: Mediation effect estimation For each estimated {Mc,Yd} (i.e., Uc⊗Vd), we next estimate the mediation effects. We first calculate the orthogonal mediating factors from a set of correlated mediating imaging variables Mc to ensure mediation identifiability. As demonstrated in Figure 2.1(c), let Mc ∈ Rn×Lc be the observed matrix ofMc, c = 1, . . . , C. For the orthogonal factor matrix M̃c of M̃c, we apply the following factorization model: M̃c = Mcηc+ϵc. Here, η(c) ∈ RIc×Lc represents the corresponding loading, and ϵc ∈ Rn is the error term. The factorization model above can be solved with either classical low-rank methods or using the direction of mediation method designed for multiple mediators [63]. Although low-rank models are commonly used to estimate the aggregating effect of correlated multiple mediators (e.g., orthogonal factorization and sparse princinple component analysis by 64), the alternative approach (e.g., the weighted sum method) can also be applied (see A.3.3). Given the orthogonal mediating factors, we follow the commonly used imaging mediation procedure to estimate indirect effect (IE) and direct effect (DE). Specifically, with estimated {Mc,Yd}, the estimated indirect and total effects are: ÎEd = 1 Jd Jd∑ j=1 Lc∑ l=1 β̂ljα̂l, T̂Ed = 1 Jd Jd∑ j=1 { Lc∑ l=1 β̂ljα̂l + θ̂j}, (2.8) where β̂lj and α̂l are estimated using equation (2.1) based on the orthogonal factors. The media- tion proportion then will be: ÎEd T̂Ed . In practice, the numeric ranges of the mediating imaging variables and imaging outcomes 24 can vary, the directly estimated mediation effects are less informative about the level of the me- diation effect. Therefore, we adopt the commonly used partial correlation as the standardized mediation effect size (77, see details in the A.1.6). 2.2.3 Inference for dense bi-clusters The goal of our statistical inference is to test the systematic mediation pattern for each estimated dense bi-cluster Ûc ⊗ V̂d. Since Uc ⊗ Vd are not specified before the data analysis, the classical statistical inference methods for the fixed parameter(s) are not applicable. We consider Ûc⊗V̂d as a data-driven “cluster” object. In the large body of the neuroimaging statistics literature, permutation test-based family-wise-error (FWE) methods are widely used to test the statistical significance of clusters [78, 79]. In our application, we implement the FWE-based inference procedure as follows. First, we permute the labels of mediators and outcomes to generate B permutation datasets Db = (Xb,Mb,Yb), b = 1, . . . , B, where, for example, B = 10000. Next, we apply the MMO Algo- rithm 3 to each permuted dataset and the observed dataset. For each permutation dataset Db, the test statistic is calculated by: Tb = max {Ûc,V̂d} ∑ i∈Ûc,j∈V̂d − log(pbij)√ |Ûc||V̂d| , (2.9) where pbij is calculated from the mediation analysis Xb → Mb(i) → Yb(j). The test statistic integrates the mediation effect size, and the extent and the density of the mediation dense bi- cluster, and thus, it is superior than the commonly used cluster extent [80]. Last, we calculate the FWE q-values for the observed mediation dense bi-clusters based on the percentile ranks of their 25 test statistics among {Tb} from the permuted datasets. 2.3 Data Example We applied the proposed method to the multimodal brain imaging data from the Amish Connectome Project (ACP), which is a subcohort of the Human Connectome Project [49]. In this study, there were 94 male and 110 female participants (i.e. n = 204) with ages 39.3± 16.9. One goal of the Amish Connectome Project is to investigate the association between the risk factors for cardiovascular disease and brain structures and functions (i.e., heart and brain). Multiple clinical measurements for the risk factors for cardiovascular disease were collected, including SBP and total cholesterol. Multimodal imaging data were also acquired, including CBF, which was calculated from arterial-spin labeling imaging data [81], and ReHo, which was calculated from resting-state fMRI data [55]. Specifically, ReHo was derived from the BOLD signal using Kendall’s coefficient of concordance. In our data example, both mediators (CBF) and outcomes (ReHo) are region-level brain imaging measures based on a commonly used brain atlas of 107 regions (JHU-MNI, 82). Previous studies have revealed significant correlations between SBP and CBF and between SBP and ReHo [83, 84]. Covariates (e.g., age and sex) were adjusted in the analysis. However, it remains unclear how SBP can affect ReHo by influencing CBF, and this naturally requires a complex mediation analysis with high-dimensional mediators and outcomes. The details of the multimodal imaging data acquisition and preprocessing, and the validity of the studied pathways are provided in the A.2.1 of the Appendix. The two-step procedure (MMO) was applied to the multimodal imaging data with n = 204 and p = q = 107. We first performed mediation analyses to obtain the W matrix and then 26 extracted mediation dense bi-clusters using Algorithm 3. The KL criterion selected λ = 1.8. The results show one significant dense bi-cluster with six CBF ROIs and 59 ReHo ROIs (i.e., C = D = 1, I1 = 6, J1 = 59) with an FWE rate adjusted q-value of 0.019. Figures 2.3(a) and 2.3(b) illustrate the input matrix and outcome matrix for step 1, and Figure 2.2 shows CBF and ReHo ROIs overlaid onto a 3D brain imaging template. Next, we applied the factoriza- tion method to estimate the orthogonal vectors from the six CBF ROIs. Interestingly, only one vector was extracted. It explains 88% of the variance of the six mediators. Last, we estimated the systematic mediation effect. The mean of the mediation partial correlation coefficients for SBP→ CBF component→ 59 ReHo is −0.094 (a medium effect size, 85) with a standard devi- ation of 0.015. The negative mediation effect suggests that SBP can decrease ReHo in multiple brain areas by first reducing CBF in the six areas where CBF regions are positively associated with ReHo. The partial correlations between pairs of SBP, CBF, and ReHo are shown in Fig- ure 2.3(c). The average mediation proportion across all outcomes is 88.30%. The systematic mediation pattern discovered by MMO reveals the critical pathways for CBF: cingulate gyrus left (CingG L), cingulate gyrus right (CingG R), supramarginal gyrus left (SMG L), supramarginal gyrus right (SMG R), cingulum (cingulate gyrus) left (CGC L), and inferior occipital gyrus right (IOG R) [Figure 2.2(a)]. Specifically, for the six CBF regions, CingG L, CingG R, and CGC L are centred around the middle cerebral, while SMG L, SMG R, and IOG R are close to posterior cerebral arteries, which play a central role in affecting ReHo. Regional-specific changes partially influence ReHo scores related to neuropsychiatric illness and CBF, where CBF is itself influenced by SBP. The 59 ReHo regions cover a large proportion of the brain, including superior frontal gyrus left (SFG L), middle frontal gyrus right (MFG R), etc. in the frontal lobe [Figure 2.2(b)]; superior corona radiata left (SCR L), CGC R, etc. in the 27 (a) CBF Regions (b) ReHo Frontal (c) ReHo Corona (d) ReHo Temporal (e) ReHo Parietal (f) ReHo midbrain Figure 2.2: Selected Regions for CBF (mediators) and ReHo (outcomes): (a) is the six regions for CBF (mediators) including the six region names which are close to the bilateral primary cerebral arteries, (b)-(f) demonstrate the selected regions of ReHo (outcomes) in the areas of frontal, corona, temporal, parietal and midbrain. 28 (a) Mediation p-matrix (b) Mediation dense bi-cluster (c) Mediation effect by partial correlation Figure 2.3: The mediation analysis results for the data example. The mediation variable is systolic blood pressure (SBP). (a) is the heatmap of − log(p)-matrix based on the input data. (b) is the heatmap with detected systematic mediation patterns. (c) is the mediation results based on the mediating factor. β and θ effects are averaged across related outcome ROIs. Overall, the mediation proportion is high. 29 corona radiata [Figure 2.2(c)]; parahippocampal gyrus left (PHG L), superior temporal gyrus right (STG R), etc. in the temporal lobe [Figure 2.2(d)]; superior parietal lobule left (SPG L), SMG R, etc. in the cuneus [Figure 2.2(e)]; and insular left (Ins L) and putamen right (Put R) in the midbrain [Figure 2.2(f)]. These brain areas are involved in multiple cognitive, language, and motor functions. For example, CingG is related to metabolic reduction for higher executive functions [86]. The SMG, located in the parietal area, is highly correlated with information processing and metabolism [87]. The CBF from six regions supports ReHo in the above regions. Our findings suggest that a reduced CBF can lead to a lower level of ReHo. These explain, in part, how the risk factors for cardiovascular disease can adversely affect daily functions by influencing the CBF and, accordingly, ReHo regions. The details of the selected CBF and ReHo regions and the corresponding full names of the regions are provided in the A.2.4 of the Ap- pendix. We further performed a sensitivity analysis by following [3] to assess the effects of the potential violation of the mediation assumptions. The results suggest that our mediation results are generally robust to moderate violation of assumption (details are in the A.2.3). For comparison, we also applied the classical BH-FDR methods and three-step regression methods in [61]. However, no regions were identified under either method, and the systematic mediation pattern was not discovered. The BH-FDR is conservative when there is enough noise [88]. We also applied the Pathway Lasso method [89]. Although this method allows us to select only the CBF regions, we applied it to each outcome and took the union of the results. Conse- quently, 32 CBF regions were selected. They contain the regions selected by our method. 30 2.4 Simulation We simulated the multimodal imaging data for the mediation analysis, including a scalar exposure variable, multivariate mediating imaging variables, and multivariate imaging outcomes. Specifically, we simulated the exposure X, orthogonal mediator factors M̃, and outcomes Y with (X, M̃,Y) ∼ N(µ,Σ), where the dimension of the multinormal distribution was the sum of the number of latent factors and outcomes plus the exposure, and µ = (µX , µM̃, µY) ′ and Σ is covariance matrix (see details in A.3.1 in the Appendix). We then generated observed mediator variables M, as described in step 2. 2.4.1 Parameters and settings For simplicity, we fixed µ = 0. The parameters in the data simulation were the sample size n, effect size of the mediation, dimension of mediating and outcome variables, and the cardinality of the mediation structure. We let p = q = 100 and 300, n = 80 and 200, set the mediation effect as±0.24 and 0.16, and included two cardinalities |Uc⊗Vd| = 10×10 and |Uc⊗Vd| = 20×20. We assessed the performance of our method for combinations of different sample sizes, mediator and outcome dimensions, mediation dense bi-cluster sizes (cardinalities), and effect sizes. We also evaluated the performance of our method under the settings of non-normally distributed mediators and outcomes. Specifically, we considered the Cauchy distribution and Laplace distribution. Moreover, we tested our method under the settings with mediators generated from non-orthogonal factors (factor correlations varying from 0.5–0.8). The details of data generation and analysis results of these two settings are provided in A.3.2 in the Appendix. 31 2.4.2 Comparable methods We first assessed the accuracy of the multivariate mediator and multivariate outcome me- diation analysis using the results of step 1. The goal of the step 1 analysis is to recover the systematic mediation pattern. We compared the results for our method with those of existing methods, including univariate mass methods BH-FDR and a three-step LSEM from [61] and a pathway least absolute shrinkage and selection operator (Lasso) method (PathL) from [89]. Since pathway Lasso by default handles a single outcome, we aggregated selected mediators for each outcome. We calculated the sensitivity and FDR of the mediation patterns as the evaluation cri- teria by comparing the estimated Ûc ⊗ V̂d with the true Uc ⊗ Vd. Specifically, the sensitivity is ∑ c,d |(Ûc ⊗ V̂d) ∩ (Uc ⊗ Vd)|∑ c,d |Uc ⊗ Vd| , and the FDR is ∑ c,d(|Ûc ⊗ V̂d| − |(Ûc ⊗ V̂d) ∩ (Uc ⊗ Vd)|)∑ c,d |Ûc ⊗ V̂d| , where |(Ûc ⊗ V̂d) ∩ (Uc ⊗ Vd)| is the number of edges intersecting between the estimated set and the true set. We evaluated the accuracy of the mediation effect estimation for step 2, given the estimated mediation patterns. We compared our model with the ‘oracle model’, which directly estimates the mediation effects based on true latent mediating factors and a mediation low-rank factoriza- tion model (medLRM) that decomposes all mediating variables without leveraging the estimated mediation pattern. We calculated the estimation bias of the mediation effects of MMO and medLRM in contrast to the oracle model. We also calculated the coverage probabilities with 32 asymptotic 95% confidence intervals [90]. 2.4.3 Results We summarized the edge-wise results for mediation pattern detection in Figure 2.4 and Ta- ble 2.1. Our approach MMO achieved high sensitivity across all settings with a well-controlled FDR. These results indicated that MMO can accurately reveal the mediation pattern. In compari- son, the sensitivity rates of the competing methods were lower when the sample size or effect size was relatively small, although the FDRs were similar. The reason for this difference is mainly because MMO can lend strength to mediators and outcomes, and thus, it joins the weak signals in organized structures. When the sample size and effect size were larger, all methods had excel- lent performance. Moreover, the performance of MMO was more robust for both negative and positive mediation effects. (a) FDR Comparison (b) Sensitivity Comparison Figure 2.4: Edge-wise bi-cluster simulation results for all methods: the scenarios on x axis cor- respond to the scenarios in Table 2.1 from left to right, top to bottom. 33 Mediation effect size and Sample size Method Cluster Size = 10× 10 Cluster Size = 20× 20 FDR sens spec FDR sens spec BH 0 0.021(0.135) 1 0 0.155(0.359) 1 Effect=0.24 3-step 0.118(0.247) 0.371(0.459) 1∗ 0.014(0.101) 0.296(0.433) 1∗ n=80 PathL 0.222(0.245) 0.490(0.179) 0.973(0.042) 0.098(0.018) 0.365(0.127) 0.979(0.042) MMO 0.106(0.188) 0.955(0.129) 0.998(0.004) 0.057(0.116) 0.950(0.131) 0.997(0.001) BH 0.017(0.048) 0.728(0.434) 1∗ 0.001(0.013) 0.895(0.299) 1∗ Effect=0.24 3-step 0.149(0.162) 0.952(0.161) 0.998(0.003) 0.022(0.24) 0.974(0.131) 1∗ n = 200 PathL 0.129(0.102) 0.740(0.182) 0.984(0.017) 0.106(0.125) 0.629(0.086) 0.980(0.023) MMO 0.040(0.084) 0.994(0.028) 1∗ 0.004(0.020) 1 1∗ BH 0 0 1 0 0.017(0.128) 1 Effect=0.16 3-step 0.082(0.215) 0.204(0.368) 1∗ 0.009(0.025) 0.112(0.294) 1∗ n = 80 PathL 0.208(0.248) 0.530(0.142) 0.971(0.046) 0.106(0.181) 0.405(0.128) 0.976(0.043) MMO 0.136(0.208) 0.950(0.154) 0.997(0.005) 0.124(0.129) 0.937(0.152) 0.993(0∗) BH 0 0.607(0.481) 1 0.004(0.019) 0.758(0.409) 1∗ Effect=0.16 3-step 0.114(0.160) 0.899(0.207) 0.998(0.003) 0.037(0.082) 0.946(0.189) 0.998(0.005) n = 200 PathL 0.086(0.193) 0.460(0.135) 0.993(0.015) 0.036(0.095) 0.379(0.141) 0.996(0.009) MMO 0.041(0.076) 1 1∗ 0.047(0.078) 1 0.998(0.004) BH 0∗ 0.763(0.420) 1 0.007(0.027) 0.863(0.341) 1∗ Effect=-0.24 3-step 0.003(0.019) 0.150(0.359) 0.999(0.003) 0.0002(0.002) 0.111(0.313) 1∗ n = 200 PathL 0.167(0.214) 0.630(0.134) 0.981(0.027) 0.057(0.121) 0.445(0.107) 0.993(0.016) MMO 0.030(0.099) 0.999(0.009) 0.999(0.002) 0.023(0.070) 0.993(0.057) 0.999(0.004) Sample size = 200 Sample size = 80 p, q = 300 BH 0.017(0.060) 0.504(0.492) 1∗ 0 0 1 Effect=0.16 3-step 0.131(0.155) 0.712(0.437) 0.999(.0002) 0.025(0.033) 0.096(0.287) 1∗ Cluster Size = 30× 30 PathL 0.087(0.084) 0.667(0.096) 0.993(0.007) 0.238(0.149) 0.393(0.113) 0.985(0.011) MMO 0.076(0.129) 0.966(0.106) 0.999(0.003) 0.089(0.126) 0.867(0.219) 0.999(0.002) Table 2.1: Edge-wise results of extracting mediation pattern Uc⊗ Vd by all methods. We demon- strate the false discovery rate (FDR), sensitivity (sens) and specificity (spec) across different settings for all comparable methods. ∗ represents a rounded number. 34 We listed the estimates of the mediation effect size (step 2 of MMO) in Table 2.2. Since the true mediators and corresponding outcomes were implicit as the input to each method, we used the results from the oracle model analysis, which used true mediating factors M̃, as a reference. We then performed a low-rank mediation analysis with MMO and medLRM and compared the estimated mediation effect size with that of the oracle model. We calculated the average medi- ation effect size for all outcomes for each simulated dataset. In general, MMO outperformed medLRM across all settings because it can more accurately capture the true mediators. In addi- tion, our model is generally robust to the non-normally distributed mediators and outcomes (See Table A.2 and A.3 in the A.3.2 ). When non-orthogonal mediating factors present, the mediation pathway extraction by the Step 1 of our method remain accurate. However, the performance of the mediation effect estimation can vary due to non-orthogonal factors (See Table A.4 and A.5 in the A.3.2). Effect size and Sample size Method signal region = 10× 10 signal region = 20× 20 Mean Bias Coverage Prob Mean Bias Coverage Prob effect=0.24 n=80 ‘Oracle Model’ 0.295(0.055) 0.292(0.058) medLRM 0.073(0.093) 0.222(0.093) 15.9% 0.142(0.111) 0.149(0.105) 47.8% MMO 0.259(0.095) 0.036(0.078) 92.5% 0.285(0.058) 0.007(0.027) 98.9% effect=0.24 n=200 ‘Oracle Model’ 0.279(0.047) 0.285(0.039) medLRM 0.071(0.086) 0.208(0.081) 9.7% 0.229(0.086) 0.056(0.072) 74.6% MMO 0.266(0.052) 0.012(0.021) 98.8% 0.282(0.040) 0.003(0.005) 99.29% effect=0.16 n=80 ‘Oracle Model’ 0.218(0.047) 0.193(0.049) medLRM 0.008(0.021) 0.210(0.047) 0% 0.034(0.057) 0.160(0.061) 14.7% MMO 0.192(0.073) 0.026(0.052) 93.4% 0.174(0.069) 0.019(0.043) 94.2% effect=0.16 n=200 ‘Oracle Model’ 0.195(0.035) 0.193(0.035) medLRM 0.018(0.033) 0.178(0.044) 0.9% 0.028(0.041) 0.165(0.049) 5.0% MMO 0.186(0.041) 0.010(0.017) 83.9% 0.182(0.039) 0.011(0.017) 93.8% Effect=-0.24 n=200 ‘Oracle Model’ -0.250(0.044) -0.249(0.047) medLRM -0.015(0.032) 0.235(0.050) 0.7% -0.094(0.094) 0.156(0.092) 29.5% MMO -0.233(0.062) 0.017(0.044) 96.7% -0.233(0.068) 0.016(0.053) 96.5% Sample Size = 200 Sample Size = 80 Mean Bias Coverage Prob Mean Bias Coverage Prob p, q = 300 effect=0.16 Cluster Size = 30× 30 ‘Oracle Model’ 0.191(0.037) 0.194(0.051) medLRM 0.010(0.016) 0.181(0.040) 0.7% 0.025(0.047) 0.170(0.056) 8.2% MMO 0.174(0.038) 0.017(0.033) 95.7% 0.179(0.056) 0.016(0.031) 96.0% Table 2.2: Simulation results for mediation effect estimation (step 2). We compare the estimated mediation effects by Low Rank Model (medLRM) and MMO, with reference to the estimated mediation effect based on the oracle model (with known mediating imaging factors). 35 2.5 Discussion This chapter attempts to solve the mediation problem with high-dimensional-mediators and high-dimensional outcomes. This dual high-dimensionality can lead to a massive number of mediation pathways, which, thus, gives rise to numerical challenges for mediation analysis. We proposed a mediation dense bi-cluster structure to identify systematic mediation patterns under the bipartite graph framework, which reflects that the exposure affects multiple outcomes via influencing a corresponding set of mediators. We developed cluster-wise inference for set-to-set mediation patterns and statistical methods for mediation estimation. This chapter makes several methodological contributions to neuroimaging statistics. First, our method extracts multivariate-mediator and multivariate-outcome mediation patterns using a graph-based bi-cluster model. The model effectively reveals the underlying systematic mediation pathways, which can be recovered with the guaranteed optimality of pattern extraction. Second, our method provides an approach for systematic mediation effect estimation. Within a dense bi-cluster structure, we factorize the mediators and take the product of partial correlations as the standardized mediation effects. Our model can also handle dense bi-clusters sharing the identical outcome set by concatenating mediator sets to estimate mediating factors. Finally, we provide a computationally efficient algorithm for an integrated multimodal high-dimensional data analysis and an open-source toolkit for the mediation analysis. Our method has limitations. MMO cannot identify the sequential mediation model if the previous mediators (or outcomes) can influence later ones in the proper order [89]. In addition, estimating mediation effects on orthogonalized factors can lead to an unstraightforward inter- pretation. Last, we could extend the existing model to handle longitudinal multimodal imaging 36 data. In our analysis of application data, we first identify the systematic mediation pattern SBP→ CBF → ReHo. The CBF regions and ReHo regions can better reveal how cardiovascular risk factors affect brain-related functions through neural pathways related to CBF around vast areas of ReHo. The joint analysis of multimodal imaging data via a mediation model provides neu- rological insights into the influence of SBP on ReHo via CBF. These findings may lead to more effective treatments for brain diseases related to blood oxygen levels. In contrast, the traditional methods, including BH-FDR, may miss the mediation pattern. Therefore, our method may be an alternative tool for analyzing integrated multimodal imaging data. In summary, we developed MMO, a method for the joint analysis of multimodal neu- roimaging data in a mediation framework. The multivariate-mediator and multivariate-outcome mediation tool can also be applied to other multimodal high-dimensional data, for example, omics-imaging data and omics-omics data [91]. Our algorithm is scalable since the computa- tional cost for the above application is moderate. The software package for the proposed method is available at https://github.com/zhuivv/MMO.git. 37 https://github.com/zhuivv/MMO.git Chapter 3: Marginal Structure Ensemble Learning Model (MASE) 3.1 Introduction Recent large-scale longitudinal observational studies have collected extensive data, includ- ing hundreds of repeated measurements of exposures, covariates, and outcomes. For example, the Adolescent Brain Cognitive Development (ABCD) study, which investigates youth cognitive development, provides numerous variables encompassing cultural environment, physical, mental, neurocognitive, and genetic factors. To understand the influence of these exposures, it is imper- ative to uncover the underlying causal effects. The current literature offers methods to address limited number of confounders with linear confounding effects, including inverse probability weighting [92, 93], conditional mean imputation [32], and matching [94]. When confronted with high-dimensional confounders, machine learning methods have proven effective in handling the massive covariates with intricate and non-linear effects in estimating causal effects [95, 96, 97]. However, these approaches have typically focused on studies with a single visit and are inade- quate for longitudinal data. The implementation of machine learning methods in studies with time-varying exposure and confounders requires more evaluations and non-trivial efforts due to the complex treatment-confounder feedback, where time-dependent confounders might be influ- enced by past exposure. To estimate the effects of time-varying exposures in the presence of time-varying con- 38 founders, the Marginal Structural Model (MSM) has become a popular tool in longitudinal causal inference. This method constructs a pseudo-population and incorporates both confounders and time-varying exposures into weights. However, correctly modeling propensity scores is crucial and can be challenging when dealing with high-dimensional confounders or non-linear effects. To address these challenges, Doubly Robust (DR) methods have been introduced [34, 98, 99]. DR estimators are derived by subtracting the projection onto Hilbert space and generally involve two nuisance parameters: treatment estimator (e.g., propensity scores) and outcome estimator (e.g., iterative conditional mean) [100]. They are ‘doubly robust’ in the sense that if either the treatment model or the outcome model is correct, the final estimate will be consistent. More importantly, a doubly robust estimator will result in a faster convergence rate when nuisance parameters are all correctly estimated. This property enables the use of machine learning in estimating nuisance parameters, as the estimation uncertainty from machine learning becomes negligible under mild conditions [95, 101]. As a result, the combination of machine learning methods with doubly robust estimators offers a promising solution to tackle this challenge [102, 103]. Particularly, ensemble learning has gained increased popularity due to its robust and superior performance, achieved by integrating multiple machine learning models [97, 104]. However, there remains a gap in the application of causal machine and ensemble learning to longitudinal data, particularly for machine learning-assisted doubly robust MSM. Motivating Study: Our current research is inspired by an investigation into the influence of insufficient sleep on adolescent cognitive development, utilizing data from the ABCD study [105]. The ABCD study collects data from 11, 868 adolescent participants at ages 10, 12, and 14. During each visit, various measures are recorded, including sleep patterns, cognitive per- formance, and a range of psychographic variables. Insufficient sleep is a modifiable and health- 39 related factor that can adversely affect the cognitive development of young individuals. However, sleep insufficiency can be time-varying due to the increasing use of electronic devices along ag- ing. In addition, hundreds of time-varying covariates (e.g., environmental factors, nutrition, and socioeconomic status) can act as confounders. Therefore, it is challenging to achieve a consistent estimate of the effect of sleep insufficiency in a large-scale longitudinal study with numerous repeatedly measured confounders and potential non-linear confounding effects. To bridge the gap, we propose a Marginal Structure Ensemble Learning Model (MASE). MASE is developed based on the MSM solved by an efficient score, with the estimation of propensity scores and iterative conditional means as nuisance parameters. We develop a compu- tationally efficient ensemble model to integrate multiple machine learning algorithms to model these nuisance parameters. We also study the theory and provide the analytical solution for the inference of causal parameter estimations. In numerical studies, we observe that our method effectively handles high-dimensional confounders, appropriately accounts for the longitudinal data structure, and outperforms existing methods. More importantly, the application of MASE contributes significantly to bridging gaps in our understanding of the long-term relationship be- tween insufficient sleep and cognitive outcomes using ABCD data. Our method aligns with the asymptotic theory of doubly robust machine learning estimation in longitudinal setting developed by [106]. The main difference is that our method combined the ensemble learning method and considered the non-compatible problem for outcome regression [34]. The rest of the chapter is organized as follows. Section 3.2 describes the proposed model, including the estimation process and inference method. Section 3.3 showcases the application of our method to the ABCD study, outlining the effect estimation and presenting the inference results. Section 3.4 implements simulations to demonstrate the empirical performance of our 40 method with finite sample sizes. We conclude the chapter with a discussion in Section 3.5. Technical details and extra application results can be found in the Appendix. 3.2 Methods 3.2.1 Marginal Structural Model with Time-varying Exposure We consider a longitudinal study featuring a single binary exposure, multiple continuous outcomes, and high-dimensional confounders encompassing various data types. For i = 1, . . . , n, let Di = (Ait,Zit,Yit) represent the independent identically distributed data for subject i across visit t = 1, . . . , T . Specifically, Ait = ait ∈ {0, 1} denotes a univariate binary exposure vari- able, Zit = (Z1 it, . . . , Z pt it ) T is a pt-dimensional vector of observed confounders, with the number of covariates not necessarily being uniform at each visit. Further, Yit = (Y 1 it , . . . , Y q it) T repre- sents q outcomes, which can be either continuous or categorical. For any variable {vt}1≤t≤T , we denote v̄t = (v1, . . . , vt), t = 1, . . . , T as the historical measurements of variable v up to the t-th visit. We define the counterfactual outcome [107] for the j-th outcome at time T as Y j i (ĀiT = {ai1, . . . , aiT}) for subject i. The population mean of the counterfactual outcome is given by E(Yj(ĀT = {a1, . . . , aT})|Z0), where Z0 ∈ Rn×p0 is the baseline covariates matrix, and At ∈ Rn is the exposure vector. Figure 3.1 illustrates the directed acyclic graph (DAG) of our model, with all variables re- peatedly measured at two time points (using a single outcome for demonstration purposes). Be- cause both high-dimensional confounders and exposures are longitudinally measured and time- varying, estimating the aggregated causal effect is challenging. Classic methods, such as MSM, estimate the causal effect by leveraging propensity score weighting. Inspired by our data analysis, 41 which suggests that previous cognitive outcomes can affect subsequent results (e.g., the crystal- lized composite score), we propose a novel propensity score model that incorporates the effects of all time-varying confounders along with past outcomes: πi1(ai1, zi1) = P (Ai1 = ai1|Zi1 = zi1) πit(āit, z̄it, ȳi(t−1)) = P (Ait = ait|Āi(t−1) = āi(t−1), Z̄it = z̄it, Ȳi(t−1) = ȳi(t−1)), 2 ≤ t ≤ T. (3.1) To ease the notation, we redefine Z̄it = (Z̄it, Ȳi(t−1), t = 2, . . . , T . Then, the second PS function in (3.1) can be simplified as πit(āit, z̄it). Figure 3.1: ABCD tabulate data structure. The solid lines represent the impact of variables to the others, the dashed line represents possible impact of previous measurements on the later ones. Notations A,Z, and Y are defined in section 3.2.1. To ensure the identifiability of average causal effect τ(āt), we adopt the commonly used assumptions consistent with MSM. Specifically, in addition to the Stable Unit Treatment Value 42 Assumption (SUTVA, 108), we assume that, for each individual i and each outcome j, the fol- lowing hold: (1) consistency: Y j i (āiT ) = Y j iT if ĀiT = āiT ; (2) no unmeasured confounding (NUC): for all āiT and t, Y j i (āiT ) |= Ait|Zit, Āi(t−1) = āi(t−1); and (3) positivity: for all t and āit, πit(āit, z̄it) > 0. Throughout the chapter, we refer to these assumptions as the Multivariate Longitudinal Sequential (MLS) assumptions. We then provide the MSM for the j-th outcome as: E(Yj(āT )|Z0) = τ(āT ,Z0; θ j),∀j, (3.2) where τ(·, ·; ·) is a known function that models the j-th outcome based on exposures and co- variates at baseline, and θj is the vector of estimands. For example, we have τ(ā2,Z0; θ j) = βj 0 + βj 1a1 + βj 2a2 as a linear function with θj = (βj 0, β j 1, β j 2) ′, where a1, a2 correspond to the ado- lescent sleep time measured at visits one and two. Or moreover, with effect modification, we can have τ(ā2,Z0; θ j) = βj 0+βj z0z0+βj 1a1+βj 2a2+βj z1z0∗a1+βj z2z0∗a2. We will focus on the previous model with a simple linear form, and we are interested in estimating the average treatment effect (ATE) for population getting sufficient sleep all the time (treatment regime (0, 0)) versus popula- tion not having sufficient sleep at all (treatment regime (1, 1)) as: ATE := θjATE = βj 1 + βj 2,∀j. 3.2.2 Doubly Robust Estimator Propensity score (PS) weighting is a common approach for estimating the MSM equa- tion (3.2). The weights can be calculated as wit = 1/ΠT t=1πit(āit, z̄it). However, when massive time-varying confounders are present, the model specified for propensity score can become in- correct due to numerical challenges. These incorrect modeling of propensity scores can lead to inconsistent estimates of causal effects within the MSM framework. Therefore, we are moti- 43 vated from the doubly robust estimator [34] and aim to develop a causal machine and ensemble learning-based estimator mitigating the issue of model mis-specification. The doubly robust estimator combines the estimates of conditional means and propensity score weights, ensuring consistent estimates of causal effects within the MSM framework when either the estimate of conditional means or propensity scores is corretly specified. We first intro- duce an Iterative Conditional Mean Estimation (ICE) procedure: ηjT (āT , z̄T ) = E(Yj T |ĀT = āT , Z̄T = z̄T ) ηjt (āT , z̄t) = E(ηjt+1(āT , Z̄t+1)|Āt = āt, Z̄t = z̄t), t = T − 1, T − 2, . . . , 1 ηj0(āT , z0) = E(ηj1(āT ,Z1)|Z0 = z0), (3.3) for any j = 1, . . . , q. For simplicity, in the rest of this chapter, we will eliminate the superscript j for η’s. The doubly robust estimator for MSM is derived from the following estimating equation. Without loss of generality, we demonstrate the estimator with two timepoints. The estimating equation integrates the information from both conditional means and propensity scores [34]: E(S0 + S1 + S2) = 0, (3.4) 44 where: S2 = d(Ā2,Z0) π1(A1,Z1)π2(Ā2, Z̄2) {Y2 − η2(Ā2, Z̄2)} S1 = ∑ a2∈A2 d(A1, a2,Z0) π1(A1,Z1) {η2(A1, a2, Z̄2)− η1(A1, a2,Z1)} (3.5) S0 = ∑ a1∈A1 ∑ a2∈A2 d(ā2,Z0){η1(ā2,Z1)− τ(ā2,Z0; θ j)}. Here, τ(ā2,Z0; θ j) is given in (3.2), At, t = 1, 2 represents the exposure set, for example, A2 = {a1 = {0,1}, a2 = {0,1}}. Additionally, d(·) is any function with the same dimension as the parameter of interest θj . A simple example is to take d(·) = ∂τ(ā2,Z0; θ j)/∂θj . When τ(·, ·; θj) is a linear function, a closed-form solution for the estimating equation is available (see Appendix B.1) . For a non-linear τ(·, ·; θj), we can solve equation (3.4) using the Newton- Raph