ABSTRACT Title of Dissertation: INVESTIGATING BIOMOLECULAR RARE EVENTS WITH ARTIFICIAL INTELLIGENCE AUGMENTED MOLECULAR DYNAMICS Yihang Wang Doctor of Philosophy, 2022 Dissertation Directed by: Professor Pratyush Tiwary Department of Chemistry and Biochemistry Institute for Physical Science and Technology To decipher the biomolecular interaction mechanism play an important role in understanding the mystery of life. Understanding the mechanism of receptor-ligand interactions is of great importance both in context of fundamental biology and medical applications. Limited by the spatial or temporal resolution, wet-lab experiments may not be able to capture enough details to unravel the complicated interaction mechanisms. Molecular dynamics (MD) simulation, as computational tool to study many-body systems with atomic resolution, has emerged as a powerful tool to investigate the physical or biochemical properties of biomoleculars. Though MD simulation has its advantage in terms of its high spacial and temporal resolution over experimental methods, a big gap still remains between the time scale that can be reached by MD simulations and the time scale of the biological process that we want to study. In this thesis, I explore two frameworks to utilize the power of statistical mechanics, molecular dynamics, and matching learning to rectify this gap, and thus enable the simulation study of ligand-receptors interaction without overwhelming demand on computational resources. Firstly, I propose the reweighted autoencoded variational Bayes for enhanced sampling (RAVE) method, a new iterative scheme that uses the deep learning framework of information bottleneck to enhance sampling in molecular simulations. RAVE involves iterations between molecular simulations and deep learning in order to produce an increasingly accurate probability distribution along a low- dimensional latent space that captures the key features of the molecular simulation trajectory. RAVE determines an optimum, yet nonetheless physically interpretable, reaction coordinate and optimum probability distribution. Both then directly serve as the biasing protocol for a new biased simulation, which is once again fed into the deep learning module with appropriate weights accounting for the bias, the procedure continuing until estimates of desirable thermodynamic observables are converged. The usefulness and reliability of RAVE is demonstrated by applying it to two test-pieces, studying processes slower than milliseconds, calculating free energies, kinetics and critical mutations. I also systematically study the following questions: (a) the choice of a predictive time-delay in RAVE, or how far into the future should the machine learning model try to predict the state of a given system output from MD, and (b) for short time-delays, how much of an error is made in approximating the biased propagator for the dynamics as the unbiased propagator. I demonstrate through a master equation framework as to why the exact choice of time-delay is irrelevant as long as a small non-zero value is adopted. I also derive a correction to the objective function by reweighting the biased propagator, which better approximates the unbiased objective function without incurring extra computational overhead. To promote our understanding of RNA-ligand interaction at the molecular level, I use RAVE and collaborate with experimentalists to study the interplay between two ligands and PreQ1, which is a widely-studied model for RNA-small molecule recognition. I show that site-specific flexibility profiles from our simulations are in excellent agreement with in vitro measurements of flexibility using Selective 2? Hydroxyl Acylation analyzed by Primer Extension and Mutational Profiling (SHAPE-MaP). And with orders of magnitude simulation speedup attained by RAVE, I can directly observe ligand dissociation for cognate and synthetic ligands from a PreQ1 riboswitch system. The artificial intelligence-argumented simulations reproduce known binding affinity profiles for the cognate and synthetic ligands, and pinpoint how both ligands make use of different aspects of riboswitch flexibility. On the basis of the dissociation trajectories, I also make and validate predictions of pairs of mutations for both the ligand systems that would show differing binding affinities. These mutations are distal to the binding site and could not have been predicted solely on the basis of structure. Secondly, I develop a framework based on statistical mechanics and generative Artificial Intelligence to use simulations or experiments performed at some set of temperatures to learn about the physics or chemistry at some other arbitrary temperature. Specifically, I use denoising diffusion probabilistic models, and show how these models in combination with replica exchange molecular dynamics achieve superior sampling of the biomolecular energy landscape at temperatures that were never even simulated without assuming any particular slow degrees of freedom. The key idea is to treat the temperature as a fluctuating random variable and not a control parameter as is usually done. This allows us to directly sample from the joint probability distribution in configuration and temperature space. The results are demonstrated for a chirally symmetric peptide and single-strand ribonucleic acid undergoing conformational transitions in all-atom water. I demonstrate how we can discover transition states and metastable states that were previously unseen at the temperature of interest, and even bypass the need to perform further simulations for wide range of temperatures. At the same time, any unphysical states are easily identifiable through very low Boltzmann weights. The procedure while shown here for a class of molecular simulations should be more generally applicable to mixing information across simulations and experiments with varying control parameters. INVESTIGATING BIOMOLECULAR RARE EVENTS WITH ARTIFICIAL INTELLIGENCE AUGMENTED MOLECULAR DYNAMICS by Yihang Wang Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Professor Pratyush Tiwary, Chair/Advisor Professor John D. Weeks Professor Silvina R. Matysiak Professor Garegin Papoian Professor Jeffery B. Klauda Dedication Dedicated to my dearest mom and dad, who have been standing behind me. ii Acknowledgments I have never wondered I could have such fortune being loved and accompanied by so many great folks at stage of my life. It is my honor to express my gratitude to them in this thesis. First, I?d like to thank Prof. Pratyush Tiwary, my dear advisor. Before joining our lab, I had no idea what a MD simulation was. The past few years? experiences with Prof. Tiwary gave me the confidence and knowledge to decide what type of career to pursue. What I learned from him also goes way beyond just how to do science. It is my gratefulness to witness and learn how he set up the whole lab, facing and solving all the great challenges. As a graduate student, I do have the privilege of focusing on and enjoying learning and doing research in an undisturbed environment, which is undoubtedly due to his tireless efforts. I know his office door is always open whenever I have any questions about science or life. I want to thank Dr. Jay Schneekloth and the members in his lab. In my view, implementing experiments is one of the toughest and most impressive parts of the procedure of science. Being able to collaborate with such talented experimentalists and learn from them is a unique experience for me. One important component of my graduate studies is my gifted colleagues. Thanks for the intense discussions, neat ideas, funny jokes, proofreading my drafts, and making iii our office such a wonderful place to explore science. In particular, I would like to thank Dedi Wang for his brilliant ideas in improving RAVE to a new level. Also, I want to thank Sun-Ting Tsai, Zack Smith, Ziyue Zou, and Shams Mehdi, the people I bounce ideas off. I would also like to express my gratitude for the help and support from Prof. John Weeks, Prof. Silvina Matysiak, Prof. Garegin Papoian, Prof. Jeffery Klauda, and Prof. Chris Jarzynski. Thanks for being great teachers and serving on my committees, as well as giving me suggestions. All of them have been and will always be the model scientists that I want to be and pursue to be. I will never forget the fantastic time I spent with my friends at College Park. Special thanks to Zhiyu, Mingshu, Yixu, Yalun, and Xiangyu for being my closest friends. Without these lovely people like mirrors reflecting each beautiful moment of life, I might get lost in the anxiety and pressure of facing an unforeseen future. Also, I would like to thank Zhongyuan, Zhiyuan, Lingqi, Jiameng, Sangyu, and Jiahui for being willing to share their stories and listen to mine. Although we are far from each other in terms of physical distances, their optimism and enterprise keep pushing me forward. There is a long list of teachers and mentors I would also like to thank, for educating me with the basic tools to understand this world and nourishing my early interest in knowledge. In particular, I want to thank Mr. Shaobo Zheng, Mr. Dong Xiang, and Mr. Wuhui Zhang for their influence on my personality. Thanks to Prof. Jiansheng Wu and Prof. Lang Chen, who gave me the first glimpse of the colorful physics world. Finally, I owe my deepest thanks to my family. I want to express my gratitude to my parents and grandparents, who never hesitate to support my decisions. Their unconditional love is the main reason I can chase my dream with great courage. As I iv always know, there is a place to shelter me from the storm outside. To my dear maternal grandfather, sorry for not being by your side in the last days of your life. I really hope to share my happiness on this fantastic journey with you! v Table of Contents Dedication ii Acknowledgements iii Table of Contents vi List of Tables ix List of Figures x List of Abbreviations xii Chapter 1: Introduction 1 1.1 Understanding ligand-receptor interaction mechanisms . . . . . . . . . . 1 1.1.1 Biological meaning of ligand-receptor interaction . . . . . . . . . 1 1.1.2 A two-state model of ligand-receptor interaction . . . . . . . . . . 3 1.2 Molecular dynamics simulation . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Introduction to molecular dynamics simulation . . . . . . . . . . 5 1.2.2 Simulations of the thermodynamic and kinetic properties of biomolecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.3 Reaction coordinate . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Challenges in MD simulations . . . . . . . . . . . . . . . . . . . 9 1.3 Solving rare event problem with machine learning . . . . . . . . . . . . 9 1.3.1 Rare event problems . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Enhanced sampling methods . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Machine learning approaches for enhancing MD simulations . . . 15 1.4 Overview of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 2: Reweighted autoencoded variational Bayes for enhanced sampling (RAVE) 25 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2 Using latent variables of VAE as RCs . . . . . . . . . . . . . . . . . . . 27 2.3 Principle of Past?Future Information Bottleneck . . . . . . . . . . . . . . 28 2.3.1 Variational inference and neural network architecture . . . . . . . 30 2.3.2 Variational inference on unbiased and biased trajectories . . . . . 34 2.3.3 Patching it all . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Application of RAVE to two benchmark systems . . . . . . . . . . . . . 38 vi 2.4.1 Conformation transitions in a model peptide . . . . . . . . . . . . 38 2.4.2 Benzene dissociation from T4-L99A lysozyme . . . . . . . . . . 40 2.4.3 Predicting critical residues . . . . . . . . . . . . . . . . . . . . . 44 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Chapter 3: Understanding the role of predictive time delay and biased propagator in RAVE 51 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Dependence of predictive information bottleneck on ?t . . . . . . . . . . 52 3.3 Correcting the propagator and the information bottleneck when using biased trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4.1 System set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.2 RAVE set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.4.3 Calculated RC under different time-delays and approximation schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: Interrogating RNA-small molecule interactions with RAVE 66 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Ligand binding induces flexibility change in riboswitch aptamer . . . . . 68 4.3 Ligand binding affinity with riboswitch aptamer . . . . . . . . . . . . . . 70 4.4 Predicting critical nucleotides for mutagenesis experiments . . . . . . . . 75 4.5 Validating predicted critical nucleotides . . . . . . . . . . . . . . . . . . 78 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter 5: Mixing physics across temperatures with generative AI 86 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2 Temperature as a random variable . . . . . . . . . . . . . . . . . . . . . 88 5.3 Denoising diffusion probabilistic model . . . . . . . . . . . . . . . . . . 90 5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Chapter 6: Summary and future directions 107 6.1 AI-augmented MD simulations . . . . . . . . . . . . . . . . . . . . . . . 107 6.2 Study of ligand dissociation from RNA . . . . . . . . . . . . . . . . . . . 108 6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Appendix A: 112 A.1 MD setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.1.1 Alanine dipeptide . . . . . . . . . . . . . . . . . . . . . . . . . . 112 A.1.2 L99A T4L-benzene . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.1.3 AIB9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 A.1.4 GACC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 A.1.5 Architecture of RAVE . . . . . . . . . . . . . . . . . . . . . . . 115 vii A.2 Definitions of quantities in information theory and variational lower bound 116 A.2.1 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2.2 Shannon entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2.3 Cross entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 A.2.4 Kullback-Leibler Divergence . . . . . . . . . . . . . . . . . . . . 117 A.2.5 Exact decoder definition . . . . . . . . . . . . . . . . . . . . . . 118 A.2.6 Gibbs?s inequality and variational lower bound . . . . . . . . . . 118 A.3 PIB objective L? for unbiased trajectory . . . . . . . . . . . . . . . . . . 119 A.4 PIB objective L? for biased trajectory . . . . . . . . . . . . . . . . . . . . 120 List of Publications 121 Bibliography 122 viii List of Tables 2.1 Order parameters for T4-L99A lysozyme . . . . . . . . . . . . . . . . . . 42 2.2 List of order parameters used to construct RC and their weights in ?1 and ?2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Definition of dihedral angles . . . . . . . . . . . . . . . . . . . . . . . . 101 ix List of Figures 1.1 An illustration of a free energy profile . . . . . . . . . . . . . . . . . . . 7 1.2 Workflow of methods that use machine learning to analyze and enhance MD simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Network architecture used for learning predictive information bottleneck . 31 2.2 RAVE one alanine dipeptide . . . . . . . . . . . . . . . . . . . . . . . . 39 2.3 Definition of Order parameters for Benzene-Lysozyme dissociation . . . . 41 2.4 Order parameters weights as functions of time delay for Benzene- lysozyme dissociation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5 The 2-component Predictive Information Bottleneck for benzene- lysozyme dissociation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.6 Estimating benzene-lysozyme dissociation constant by Poisson fit . . . . 45 2.7 Free energy surfaces for Benzene-Lysozyme dissociation . . . . . . . . . 45 2.8 Critical residue analysis for benzene-lysozyme complex . . . . . . . . . . 47 2.9 Robustness test for RAVE on studying benzene-lysozyme dissociation . . 48 3.1 Free energy profile along each degree of freedom of the model system . . 61 3.2 Reaction coordinates learned by RAVE as functions of predictive time delay 63 4.1 Ligand-PreQ1 riboswitch interaction . . . . . . . . . . . . . . . . . . . . 68 4.2 SHAPE reactivities for the Tt PreQ1 riboswitch aptamer . . . . . . . . . 71 4.3 Weights for different order parameters for PreQ1 riboswitch . . . . . . . . 72 4.4 Free energy profiles of riboswitch with different ligands . . . . . . . . . . 73 4.5 Dissociation pathways of cognate and synthetic ligand of PreQ1 riboswitch 75 4.6 Critical nucleotides prediction . . . . . . . . . . . . . . . . . . . . . . . 76 4.7 The change of RMSD of each nucleotide during the dissociation process . 79 4.8 Affinity measurement of PreQ1 cognate synthetic ligands . . . . . . . . . 80 4.9 Free energy profile from 2?s unbiased simulation for wild tpye and mutated PreQ1 riboswitch . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.10 Hydrogen bond occupancy for wild type and mutated systems . . . . . . . 83 5.1 DDPM network architecture . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 AIB9 structure and free energy surface . . . . . . . . . . . . . . . . . . . 92 5.3 Samples of AIB9 from REMD and DDPM . . . . . . . . . . . . . . . . . 95 5.4 ?G between ground state and excited state in AIB9 . . . . . . . . . . . . 97 5.5 ?G between ground state and transition state in AIB9 . . . . . . . . . . . 97 x 5.6 Interpolation or extrapolation in DDPM . . . . . . . . . . . . . . . . . . 98 5.7 Projection of samples and free energy profile on dihedral angles ? and ? of A2 in GACC at 325 K . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.8 DDPM outperforms direct reweighting . . . . . . . . . . . . . . . . . . . 100 5.9 REMD samples and free energy profile the C2 in GACC . . . . . . . . . 102 5.10 REMD samples and free energy profile of the C3 in GACC . . . . . . . . 103 5.11 Free energy profiles of all dihedral angles in GACC . . . . . . . . . . . . 104 xi List of Abbreviations AI Artificial intelligence ANN Artificial neural networks DDPM Denoising diffusion probabilistic model FRET Fluorescence resonance energy transfer FIA Fluorescence intensity assay MC Monte Carlo MD Molecular dynamics MESA Molecular enhanced sampling with autoencoders MST Microscale thermophoresis PIB Predictive information bottleneck RAVE Reweighted autoencoded variational Bayes for enhanced sampling REAP Reinforcement learning based adaptive sampling RC Reaction coordinate SHAPE-MaP Seletive 2? Hydroxyl Acylation analyzed by Primer Extension and Mutational Profiling SRV State-free reversible VAMPnets TAE Time-lagged autoencoders VAC variational approach to conformation dynamics VAMP Variational approach for Markov processes VDE Variational Dynamic encoder VAE Variational autoencoder WT Wild type xii Chapter 1: Introduction 1.1 Understanding ligand-receptor interaction mechanisms 1.1.1 Biological meaning of ligand-receptor interaction One of the defining characteristics of living systems is their ability to respond to and interact with the environment. For example, cells can sense the change of ion concentration and open a channel through the membrane that allows specific ions to pass through. The molecular biology basis of this ability is often the interaction between biomolecular receptors and their cognate ligands[1, 2]. Usually, the binding of a ligand will change the biochemical properties of receptors such as stabilizing or unstabilizing their structures. These changes can either promote or inhibit the downstream processes and thus regulate the whole biological system. Similarly for nucleotides, recent studies have shown that structural RNA can react to the binding of small molecules to regulate the downstream gene expression processes[3, 4]. Different small molecules can trigger different cell responses by activating different signaling pathways and thus ensure that the whole living system can function properly. Moreover, different ligands can bind to the same receptor but trigger different pathways. Such a mechanism, which is usually referred to as functional selectivity, is both common 1 and important for cell signaling. For example, receptor tyrosine kinases (RTKs), a family of cross-membrane proteins, can trigger different responses such as phosphorylation of itself or other proteins when binding to different ligands. Such different reactions will further initiate a chain of downstream biomolecular reactions and finally pass the information to gene regulator biomolecules such as transcription factors and thus influence cell fates such as differentiation and proliferation[5]. However, we still lack a more detailed understanding of the underlying mechanism. In many cases, it still remains elusive why the receptor can recognize its cognate ligand and how the binding of specific ligand makes the receptors tend to adopt one structure over others? Another interesting phenomenon related to RTKs is the allosteric effect, in which the binding of ligands happens at the receptor domain lies outsides the membrane, while the conformational changes occur at the functional domain lies in the intracellular part of the membrane. Such a property of transferring the effect of ligand binding from one end of a molecule to the other end, though very common in ligand-receptor interactions, is not well-understood and is often referred to as the second secret of life[6, 7]. Besides its importance in answering fundamental biological problems, a better understanding of the ligand-receptor mechanism is also crucial for practical purposes. The disfunction of the signaling process, which is usually associated with the gain- or loss-of-function of the mutated or misfolding of biomolecules, can lead to various diseases, including cancers[8, 9]. Therefore, small molecules that can selectively bind to their target biomolecules and modulate their functionalities will have the potential to be used as drugs to cure diseases. However, the lack of understanding of the ligand- receptor interaction mechanism restricts us to use brute force screening methods, leading 2 to substantial financial costs in the search for small-molecule drugs[10]. With the emergence of experimental, computational, and theoretical tools, much progress has been made to understand the mystery of ligand-receptor interactions. At the same time, we have to admit that challenges still remain. In the following sections, I will first introduce basic concepts and methods in studying ligand-receptor interaction. Particularly, I will mainly discuss computational methods and related unresolved problems. 1.1.2 A two-state model of ligand-receptor interaction Understanding the behavior of systems consisting of over thousands or millions of atoms jiggling around is very challenging. One way to simplify this problem is to coarse-grain the system into two states: the bound state where the ligand and receptor form a complex and the unbound state where there are little or no interactions between ligand and receptor. Though it is just a simplified picture, it still permits a very profound understanding and has great explanatory power on what we observe in reality. One example is the binding affinity which is defined to characterize how likely a ligand can bind to the receptor. Macroscopically, we can describe the ligand binding and unbinding process by the reaction equation: kon R+ L ???? LR (1.1) koff Where R, L and RL represent receptor, ligand and ligand-receptor complex separately. The kon and koff are the binding rate and the dissociation rate. The dissociation constant 3 K = koffD also determines the ratio of concentration of species by the relationship Kkon D = [R][L] where the concentration is denoted by the square brackets. Also, according to the [LR] ?G law of mass action, we can also write down the dissociation constant as K k TD = e B = Punbind of where P is the equilibrium distribution of bound and unbound state and ?G is Pbind the free energy difference between bound and unbound state. These quantities play an important role in deciphering the ligand-receptor interaction mechanism and drug development. For example, a low KD value indicates that the ligand can bind to the receptor which relatively low concentration and thus associate with high binding affinity. An ideal drug should have low Kd with its target receptor but high Kd with other biomolecules to selectively binds to its target receptor. Moreover, a previous study of drug efficacy reveals that the residence time of the drug, which is the inverse of koff , is more relevant to the percent survival compared to its binding affinity[11]. All these quantities can be measured by experimental methods. For example, isothermal titration calorimetry (ITC)[12] measures the heat absorbed or released during the titration of the ligands to estimate the dissociation constant KD. Surface plasmon resonance (SPR) [13] monitor the refractive index change on a surface caused by the binding events to measure the binding or unbinding rates. Fluorescence polarization (FP) labels the ligand with fluorescent and measures the polarization of the light emitted from the fluorescent[14]. A higher degree of polarization indicates that the ligands are bound to the receptor as it rotates much slower than being unbound. However, limited by the spatial or temporal resolution, wet-lab experiments may not be able to capture enough details to unravel the complicated interaction mechanisms[15]. Sometimes we might want to go 4 beyond these quantities and look at the system in more detail. For example, we might want to ask the following questions: 1. Which types of interaction make certain ligand binds tightly to a receptor? 2. How do the interactions induce structural and functional changes in the receptor? To answer these questions, we need methodologies to monitor the ligand-receptor systems at very high spatial and temporal resolutions, which unfortunately might not be achievable with current experimental methods. But with the rapid development of computational power over the decades, molecular dynamics (MD) simulations provide an opportunity to overcome such limitations[16]. In many studies, MD simulations not only directly measure important quantities such as binding affinities and dissociation rates but also provide an atomic-level description of the binding/unbinding processes[17, 18, 19, 20]. In the next section, I will briefly introduce how we can use simulation methods to study biomolecular systems and discuss challenges in this field. 1.2 Molecular dynamics simulation 1.2.1 Introduction to molecular dynamics simulation Molecular dynamics (MD) simulation is a computational method to study the time evolution of molecular systems consisting of many atoms. In MD simulations, we consider a generic atomic system comprising N atoms, where the 3N position coordinates are denoted by X, and the interaction between atoms is described by the potential U(X. According to Newton?s laws of motion, the time evolution of atom coordinates should 5 follow the differential equation: d2X = ? 1 ?U(X) (1.2) d2t m However, a system follows Eq.1.2 will have conserved total energy and could not be a good model to describe what happens in reality. In reality, a molecular system is usually not isolated from the environment. It can interact with the environment by transferring heat, doing work or even exchange particles. In order to mimic such effects, algorithms called barostatting and thermostatting are introduced and are separately used to control the pressure and temperature in an MD simulation[21, 22, 23, 24]. Though differ in their implementation details, different versions of barostatting and thermostatting algorithms all bring stochastic to the simulations and make it possible to simulate different ensembles like canonical (NVT) ensemble and isothermal?isobaric (NPT) ensemble which are widely used in biomolecular simulations. 1.2.2 Simulations of the thermodynamic and kinetic properties of biomolecules When studying biological systems with MD simulations, we are typically interested in cases relevant to biophysical systems characterized by the existence of metastable states. For example, as showed in Fig.1.1, if we project the system along a collective variable (CV), we can see some low free energy basins, which are usually referred to as metastable states. During a simulation, a system usually spends extended amounts of time each metastable states and infrequently moves among those states. Typical objective 6 in performing MD is to evaluate the following two broad types of quantities for some generic low-dimensional CVs s(X): 1. equilibrium properties, such as the free energy F (s). 2. dynamic properties, such as the mean first passage time for escaping any metastable state. Figure 1.1: An illustration of a biomolecular system?s free energy profiles after projecting along one collective variable (CV). ? The free energy is related to the probability of the CV by P (s) ? e ??F (s) = dX ? [s? s(X)]P0(X), where P (X) ? e??U(X)0 is the equilibrium probability of a microstate X and ? = 1 is the inverse temperature. From a set of N samples kBT {x0,x1,x2, ? ? ?xN} from an MD simulation , the probability of the CV can be estimated by: ?N P (s) = ??[s? s(X)]? ? i ?(s? s(xi)) (1.3) N In the case of ligand-receptor interaction, if we use A and B to denote the bound and 7 unbound states which are two subsets of the state space. We can c?alculate the dissociation P (s)ds constant discussed in Sec. 1.1.2 by K = exp(?(F ? F )) = ?s?{s(x)|x?B}D B A . s?{s(x)|x?A} P (s)ds To estimate the transition rate between metastable states, we can first define ?A?B to be the time that it takes for a trajectory starts from A first reaches B: ?A?B = min {t ? 0 | x(0) ? A, x(t) ? B} (1.4) The first passage time from A to B is the expectation value of ?A?B over trajectories. Again, if A is the bound state and B is the unbound is state, koff = 1/?A?B and kon = 1/?B?A. 1.2.3 Reaction coordinate One important ingredient of the calculation mentioned above is the CV. In principle, we arbitrarily project the system on some variables and consider them as CVs. However, most of these projections are not very useful. They cannot serve as good variables to describe the relevant reaction processes and thus can not be used to define the metastable states or calculate the thermodynamic and kinetic quantities. Good CVs representing the reaction process are usually called reaction coordinates (RCs). Many studies focused on defining what a good RC for molecular systems is[25]. But even though the definition exists, calculating RCs can still be challenging. For example, we can define isocommittor[26] qa?b(x) = P (?B > ?A | x0 = x), which is the probability that a trajectory initiate at x will first commit to metastable state A instead of B. qa?b(x) can be calculated by initiating simulations starts at each point x in state space get the 8 statistics of which state they first commit to. This calculation is usually too expensive to be applied to molecular systems. 1.2.4 Challenges in MD simulations Besides the challenge in finding good RCs, to accurately simulate a biomolecular system to use computation results to reveal what is happening in reality, we need to reduce both the statistical and systematic errors[27]. The systematic error is caused by the deviation of classical molecular models from the real quantum world. How to adjust the force field to better approximate will not be the focus of this thesis. Instead, I will focus on discussing how to reduce the statistical error in MD simulation, which is caused by the insufficient sampling of the processes of interest. In Sec. 1.3.1, I will discuss the sampling problem in detail and show how it connects to the problem of finding good RCs. 1.3 Solving rare event problem with machine learning 1.3.1 Rare event problems On the application of MD simulations to study ligand-receptor interaction, one of the main challenges is to overcome the gap between the time scale that can be reached by MD simulations and the time scale of the events that we want to study. More specifically, the dissociation of a ligand from its receptor usually happens at a time scale from microseconds to seconds[28]. On the contrary, MD simulation, which has to numerically solve Newton?s law of motion at femtosecond ( 10?15 s), might need years to simulate an unbinding event with very powerful computational resources. Though 9 compared with less than two decades ago when the longest MD simulation performed was only 2 ?s[29, 30], MD simulation has become more efficient with the development of special-purpose supercomputers like ANTON2 (100 ?s per day for systems with a million atoms) [31] and better combination between MD software and hardware like Graphics processing units (GPUs)[32], it is still associated with high energy and time cost and could not support high throughput study of a large amount of systems. Moreover, if you want to simulate the system in a more accurate manner by considering some degrees of freedom of electrons[33, 34, 35, 36] or even fully considering the quantum effect[37], the gap of the time scales will be much bigger. To overcome these limitations and make MD applicable to generic systems with arbitrary complexity and timescales, algorithms need to be developed to enhance sampling when there are limited computational resources. In the following section, I will introduce some widely used enhanced sampling methods and discuss their limitations. 1.3.2 Enhanced sampling methods Different enhanced sampling methods have been proposed to solve the time scale problem. We can group these methods into three broad categories: 1. Raising the temperature of the system: By running the system at higher temperatures, we can enhance its fluctuation and thus accelerate the sampling of states. As an example, Replica-exchange molecular dynamics (REMD)[38] runs many replicas of the system at different temperatures. Every given number of simulation steps, configurations in different replicas are exchanged according to 10 probabilities that satisfy detailed balances: ( ( )) (Ei?E ) 1j ? 1 p(accept) = min 1, e kTi kTj (1.5) Here E is the total energy of the system, T is the temperature. Systems in the high-temperature replicas are more likely to explore the high free energy regions or cross the energy barriers, while the replica with the lowest temperature can still give the right ensemble of states. One signification drawback of REMD is its poor scaling with the system size. According to Eq.1.5, the acceptance probability becomes very small if the difference is much bigger than the thermal fluctuation. As a consequence, for systems of N particles, the temperature difference between neighbour replicas should be small enough to ensure that that the energy difference, which is proportional to N , is comparable to the thermal fluctuation, which is ? proportional to N . Therefore, it becomes impractical to apply REMD on large systems as it might require a huge amount of replicas. To overcome this problem, methods like Replica Exchange with Solute Tempering (REST)[39, 40] has been developed. In REST, instead of directly run many replicas with different temperature, systems with different Hamiltonians scaled by the control parameter ?m: ? EREST ?m ?m m (X) = Epp(X) + Epw(X) + Eww(X) (1.6)?0 ?0 Here the subscripts pp, pw and ww denote protein intramolecular energy, the interaction energy between the protein and water, and the self-interaction energy 11 between water molecules separately. Such a scaling operation can be considered heating only part of the system and almost equivalent to reducing the N , thus improving the acceptance rate of exchanging configurations between replicas. Other methods try to improve REMD includes introducing nonequilibrium switches between replicas to increase phase space overlap [41]. Besides needs for many replicas, REMD has been rarely applied to studying unbinding problems. Because raising temperature might induce conformational change of the system, leading to big perturbations on the binding or unbinding pathways. 2. Taking advantage of parallel computing: Instead of running a very long trajectory, methods based on parallel computing are developed to extract useful information from multiple short trajectories. In order to guarantee an unbiased and efficient sampling of the ensemble, these methods need to design smart strategies about how to initiate, terminate, and analyze the trajectories. As an example, the weighted ensemble (WE) sampling[42, 43] uses the idea of resampling to calculate the free energy and transition rate. In WE, many trajectories are run in parallel. Each trajectory can be split into multiple trajectories or terminated according to the number of trajectories in each pre-defined bin. These trajectories carry different weights to account for the effects of these splitting or terminating operations. By enforcing the number of trajectories in each bin, WE guarantees that the low probability regions can still be well sampled. Another example is milestoning sampling[44], which is used to get the transition rate between metastable states. 12 In the milestoning method, a sequence of hypersurfaces that are perpendicular to a RC is defined as milestones. Trajectories are initiated from each hypersurface and the distributions of first-passage-time of arriving at neighbour hypersurfaces are calculated. With these distributions, systems? long time kinetic property can be estimated. One disadvantage of methods of this type is that they are still relatively more expensive in terms of computational resources. Also, the choice of RCs is critical for the success of these methods. In WE, bins are usually defined by a partition of a space spanned by a few RCs; In milestoning, the assumption that dynamic along RC is much slower than the dynamic along the directions perpendicular to it should be satisfied[44]. 3. Modify the Hamiltonian of the system: The transitions between metastable states become exponentially rare as the free energy of transition states increases[45]. By adding external potentials to reduce the free energy differences between transition states and metastable states, these methods are designed to make it easier for the system to across the energy barriers in a simulation. Metadynamics[46, 47] is one example of this type, where during the simulation, biasing potentials of Gaussian form will be added to the system every given number of MD steps. As the biasing potential accumulating in the regions that have been frequently visited, it will be easier for the system to escape from free energy basins to sample more metastable states. Based on this framework, several strategies have been proposed to improve the performance of metadynamics. One widely used variant is well- tempered metadynamics, where instead of adding a Gaussian biasing potential with 13 a co[nstant height, the]new Gaussian biasing potential will be scaled by a factor exp ? 1 ?? ?Vn?1 (sn) where Vn?1 (sn) is the external potential added at previous1 steps and sn denotes the center of new Gaussian biasing potential. Therefore, the new biasing potential added to the system will become smaller and smaller on the places where large external potential has been added. This strategy brings a better convergence property and allow a simple way to estimate the unbiased average of observables: ? ? ? [ ]ds exp ? ?V (s, t) ?O(R)? = O(R)e?[V (s(R),t)?c(t)] 1 ??1 c(t) = log [ ] (1.7) V ? ? ds exp 1? ?V (s, t)? 1 in addition to this, as shown in reference [48], if the biasing potential is added at relatively low frequency to ensure that the biasing potential added on the transition states is negligible, the transition time between metastable states can also be estimated by: ?N ? = dte?V (s(ti),ti)transition (1.8) i where N is the total number of MD steps. I would like to point out that for Metadynamics, or similar methods like umbrella sampling[49], the basing potentials are defined as functions of RCs, which is discussed in Sec. 1.2.3. Only when the RCs capture the slow mode of the system will these methods be able to accelerate the sampling of rare events. As we can see, in most cases, the success of methods in categories (2) and (3) relies on the choice of RCs. However, in many studies, the RCs were selected in an intuitive, ad 14 hoc way. This is mainly due to the challenge of learning the RCs for a system with many degrees of freedom as discussed in Sec. 1.2.2. Many other methods were developed to solve problems of this kind. Their central theme is finding a simpler description of the systems to better understand their physical properties, such as free energy difference or characteristic time scale of different modes. However, most of these methods still suffer from the drawback that they need well-sampled unbiased MD trajectories as input. In summary, to use MD simulations to study long-timescale processes, we need first to know RCs that capture the system?s slow modes. However, methods developed to learn the slow modes require trajectories from simulations that have already given good sampling of the process. The main question that I will address to answer in this thesis is how to sample rave events in MD simulations when neither long trajectories nor optimal RC is available. 1.3.3 Machine learning approaches for enhancing MD simulations The rapid development of artificial intelligent (AI) and machine learning (ML) techniques has shed light on solving challenging problems both in daily life and scientific research. Compared with traditional methods, it has unprecedented ability to learn and find relationship from complex dataset. This makes it an ideal tool to study biomolecular systems. By ML here I mean methods incorporating artificial neural networks (ANN). A large category of ML methods developed for MD simulations follow some variant of the scheme shown in Fig.1.2. That is, ML is used to find a projection from the high dimensional structure space to a low dimensional feature space. In these methods, ML has 15 Figure 1.2: A schematic illustrating the typical workflow of some of the methods that use machine learning to analyze and enhance MD simulation. High dimensional data which describes the time evolution of the system in configuration space is used as the input of an ANN. The ANN is trained to project the input to a low dimensional space. Depending on the structure of NN and objective function, the low dimensional representation captures different features that are considered to be important, such as slow modes. In some methods, the feature learnt are used to further enhanced the sampling of MD simulation as shown by the arrow in the bottom. been used in various forms to analyze long MD trajectories and learn relevant slow modes. Ma and Dinner in 2005 used ANNs for constructing a RC[50], the idea being to sample configurations along transition pathways and tabulate an ensemble of committor values, with ANNs trained to fit CVs to these values. Recent work has revived and somewhat generalized this framework[51]. While a committor might be the most rigorous description of a slow mode, there are other reasonably accurate and computationally cheaper principles defining what constitutes slow modes. One of such a principle is based on the eigendecomposition of the time-propagation of the system. The dynamics in the full high-dimensional X-space can be safely assumed to be Markovian, and thus the time-propagation of the system in X- space can be described for a given lag time ? using a transfer operator K with eigenvalues 1 = ?0 > ?1 ? ... and corresponding eigenvectors ?0(X) = 1, ?1(X), ?2(x), ...[52]. 16 The eigenvectors with indices i ? 1 correspond to the slow modes of the system with corresponding timescales t = ? ?i log . For a molecular system with very high-?i dimensional dynamics, there no simple way to solve the eigenvalues and eigenvectors of their transfer operator. 1.3.3.1 Variational approach to conformation dynamics One formalism to solve the eigendecomposition problem is the variational approach to conformation dynamics (VAC)[53]. VAC calculates the eigenvalues and eigenvectors of the transfer operator starting with ?1, by solving a variational principle which states that any trial eigenvector ??1 will have a time-lagged autocorrelation ??? |K|??1 1? ? 1, as long as it is orthonormal to ?0, with equality holding iff ??1 = ?1. Thus we can approximate the first nontrivial eigenfunction ?1 by searching for a ??1 that maximizes its autocorrelation function subject to the orthonormal conditions. Ref.[53, 54] show how to calculate such matrix elements given unbiased MD data. Further modes can be learned in a similar manner but with more orthonormality conditions. Methods such as TICA[55] implement VAC directly by learning slow eigenvectors as linear combinations of pre- selected basis functions, and are at the heart of building MSMs. The Variational approach for Markov processes (VAMP) principle[56] generalizes the mathematical framework in the VAC principle to nonstationary and nonreversible processes, while keeping the same key underlying intent of learning a transformation of X in which the dynamics is as Markovian as possible. VAMPnets use ANNs to implement the VAMP principle and automate the various steps involved in construction of a MSM. Some improvements 17 on VAMPnets includes methods such as state-free reversible VAMPnets (SRV), which combine strengths of VAC and VAMPnets for equilibrium systems, and use ANNs to construct a full hierarchy of slow modes expressible as non-linear functions of input coordinates. Since the training of ML models involves finding optimal parameters for the model from a space with many local optimas. A smart design of objective function can make a model more likely to converge to an optimal one when there are limited amount of data. For example, slow modes s as per VAC may also be equivalently defined as a mapping s(X) that maximizes autocorrelation A(s) for a lag time ? : E [s?(Xt)s?(Xt+? )] A(s) = (1.9) ?(s(Xt))?(s(Xt+? )) where s? = s?E(s) is the mean-free latent variable and ?(s) is the standard deviation of s. Time-lagged autoencoders (TAE) use an encoder-decoder framework to find the slow component so defined. However, instead of calculating the precise expectation value in Eq.1.9, TAE approximates the slow mode by minimizing the reconstruction loss LR: ? LR ? LTAE = ?Xt+? ?D (E (Xt))?2 (1.10) t Here E is the encoder which maps the input configuration X to low-dimensional s and D is the decoder mapping it back to coordinate space. Maximizing Eq.1.9 and minimizing Eq.1.10 are identical if the encoder and decoder are linear[57]. 18 1.3.3.2 Variational Dynamic encoder Variational Dynamic encoder (VDE)[58] also uses an encoder?decoder pair with time-lagged data, but instead of having only a term related to reconstruction error LR, VDE includes two additional terms. Firstly, it takes the idea of variational autoencoder (VAE) to enforce a prior distribution of the latent variable st. The use of a Gaussian prior helps the distribution of s?t to be smooth, allowing meaningful interpolation between states in latent space. Secondly, an autocorrelation loss LAC = ?A(s) is introduced. It encourages the learning of modes with high autocorrelation and makes the training process easier to converge. Work from Olsson and Noe? has extended the notion of encoding a global molecular configuration, X, into encoding several local configurations X1, X2,..., Xj , each representing a partition of the original global molecular structure into a local substructure[59]. To model the time evolution of X the propagator or conditional distribution p(Xt|Xt?? ) is written in terms of the substructures: ? ? i j p(X |X ) ? e i Xt( j Jij(?)Xt??+hi(?))t t?? (1.11) where Jij is the coupling parameter between the ith and jth subsystem and hi describes the coupling between the ith subsytem with an external field. With the choice of this model, the problem of determining the coupling parameters can then be reduced to N logistic regression problems. A notable feature of this approach is that it seems capable of predicting molecular configurations that have not been incorporated into the training 19 data. The above motioned methods focus on learning the slow mode from MD data. Now I will review approaches that use ML to not just analyze existing MD generated structures and trajectories, but also actively enhance the sampling capacity of MD. In other words, these approaches use ML to not just learn from given data, but actually generate statistically accurate information when the underlying processes are so slow that they simply cannot be sampled in unbiased MD even with the best available computing resources. 1.3.3.3 Molecular enhanced sampling with autoencoders The molecular enhanced sampling with autoencoders (MESA) approach uses an ANN with non-linear encoder and decoder to learn the RC from input data, which itself is generated through umbrella sampling along trial RCs[60]. Every round of ML leads to an improved RC along which new umbrella sampling is performed. The iterations are continued until the free energy from umbrella sampling no longer varies with further iterations. Similar to MESA, nonlinear RCs learned by methods like VDE[61] can also be used to perform enhanced sampling, typically using TICA modes as input variables. 1.3.3.4 Neural networks-based variationally enhanced sampling The NN based variationally enhanced sampling (VES) method also stands out with respect to many of the other ML methods mentioned here as it does not try to learn slow modes, but instead tries to express the bias as a smoothly differentiable ANN potential as 20 a function of pre-selected small number of CVs[62]. To learn such a bias, it optimizes the objective function introduced in [63]. 1.3.3.5 Reinforcement learning based methods The reinforcement learning based adaptive sampling (REAP) method of Ref.[64] learns relevant CVs on-the-fly as exploration of the landscape is carried out. REAP starts with a dictionary of OPs v and associated trial weights. A round of unbiased MD is carried out, clustered into states, after which the weights are adjusted in order to maximize a reward function. In a nutshell, the reward function is designed to favor the least populated clusters and is iteratively adjusted as new clusters are visited. 1.3.3.6 Boltzmann Generators Boltzmann Generator is a very recent deep learning based approach that learns the equilibrium probability P0(X) without resorting to running long trajectories[65]. It leverages recent advances in generative modeling in which an invertible coordinate transformation Fxz(X) that maps microstate X onto a random variable z is learnt[66, 67]. If z follow a distribution that is straightforward to sample, we can pass a sample from P (z and map it back to a microstate X with F (z) = F?1zx xz . The probability density of z is Px(X) = Pz(z) | Jzx |, where Jzx is the Jacobian of Fzx. The invertible transformation is learned such that Px(X) is closed to the Boltzmann distribution, P (X) ? e??U(X)0 . Such invertible transformation replaces the Monte Carlo sampling on X-space to Monte Carlo sampling on z-space where the Monte Carlo moves can be easily designed. The model 21 can use this strategy to generate more samples in x-space and optimize the invertible transformation by learning from the new samples. Once the algorithm converges, Px(X will give a good approximation of P0(X and be used to generate samples. The author also propose a reweighting scheme to assign weights to samples. So even Px(X might slightly deviate from P0(X, the weighted samples can still be used to give estimation of the free energy profiles of the systems. Later improvement on Boltzmann Generators includes introducing stochastic transformations to support more flexible learning[68] and design a network structure that can adjust the learned distribution to satisfy the distribution at different temperatures[69]. 1.4 Overview of the thesis In this thesis, I will show my attempts at solving the problems mentioned in the previous sections. I have explored the possibility of taking help from artificial intelligence (AI) and statistical mechanics to develop frameworks that can accelerate the sampling of the configuration space of complex (bio)molecular systems, making it possible to calculate various thermodynamic and kinetic observables. I also applied these enhanced sampling methods to study the dissociation of ligand from proteins and RNA, and compare the computational results with experimental results to better understand the mechanism of the ligand-receptor interactions. A summary of these efforts is as follows. 1. In Chapter 2, I will discuss the method Reweighted autoencoded variational Bayes for enhanced sampling (RAVE), which can automatically learn a low dimensional description of systems with hundreds of thousands of atoms, and use 22 this information to sample the events that could not be studied with normal MD. RAVE is based the on the framework of information bottleneck[70, 71, 72]. I will use variational inference to reformulate predictive information bottleneck (PIB) as a neural network and take advantage of AI to overcome the challenges faced by other sampling methods. This AI-based sampling method is applicable to generic biomolecular systems and it s not limited by the lack of prior knowledge about the system. 2. In Chapter 3, I will focus on theoretical study of RAVE. I will discuss the choice of the important parameter time delay in the model, and show that it can be chosen from a relatively wide range without significantly change the RCs learned by the model. I will also derive a correction term to address of the problem of learning from biased MD. 3. In Chapter 4, I will apply RAVE to study the thermodynamics, kinetics and mechanisms of unbinding of two different ligands from PreQ1 riboswitch. In this study, by applying the AI argumented MD and with the help of experimental techniques, I aim to better understand the interplay between RNA structure and its interaction with different ligands. I will show how MD simulation can reproduce quantities measured in experiment. Moreover, I will make predictions about critical mutants based on what we learn from MD simulations and show how it has been validated in experiment. 4. In Chapter 5, I will explore using the generative AI to learn from very high dimensional data and generate samples. I use a generative AI model called 23 Denoising diffusion probabilistic model (DDPM) to improve the sampling quality of replica exchange sampling. I will show that by treating tempertuare as a random variable instead of a control parameter, we can learn a joint distribution to discover transition states and metastable states that are previously unseen at the temperature of interest. 5. In Chapter 6, I will give a summary of works presented in this thesis and suggest possible direction for future research. 24 Chapter 2: Reweighted autoencoded variational Bayes for enhanced sampling (RAVE) 2.1 Introduction To address this time scale problem discussed in the Sec.1.3.1, I developed an AI-based enhanced sampling method called Reweighted autoencoded variational Bayes for enhanced sampling (RAVE). In the following sections, I will introduce this method and show its application to different systems. I will first demonstrate the method on sampling alanine dipeptide and calculate the full dissociation process of benzene from L99A mutant of the T4 lysozyme protein[73, 74]. In the later system, with use of all-atom MD simulations taking barely a few hundred nanoseconds in total, and with the minimal use of prior human intuition as in other related methods, I used the method to obtain accurate thermodynamic and kinetic information for a process that takes few hundred milliseconds in reality. This method is based on the ansatz is that efficient sampling of energy landscapes of molecular systems has the same key underlying challenge as one faced by a fly as it goes about surviving[75], or the human brain trying to process how to catch a 25 moving baseball[70]. Namely, given limited storage and computing resources, which memories to preserve and which ones to ignore in order to be best prepared for various possible future challenges? This can be paraphrased as the ability to rapidly learn a low-dimensional representation of a complex system that carries maximal information about its future state. Since storing and processing large amounts of information can be computationally and thus energetically expensive for the brain, it has been suggested that neurons in the brain separate predictive information from the non-predictive background in a way that by encoding and processing a minimum amount of relevant information, the brain can still be maximally prepared of future outcomes. The past?future (or Predictive) Information Bottleneck framework introduced and developed in many forms [70, 71, 72, 75] involves implementing such neuronal models from an information theoretic basis that can originally be traced back to Shannon?s rate distortion theory. Reweighted autoencoded variational Bayes for enhanced sampling (RAVE) [76, 77] is an iterative ML-MD method motivated by the observation that many feature learning methods, in addition to classifying features, also provide the probability distribution in feature space [78]. The learnt features and their probability distribution can then respectively be used as RC and its fixed or static bias can then be applied to U(X) leading to more ergodic exploration. To learn these features and their probabilities, RAVE uses a past-future information bottleneck optimization scheme that outputs a minimally complex yet maximally predictive model. An ANN decoder is trained to predict the future state of the system instead of only trying to recover the input data, and a linear encoder is used to get an interpretable projection from the space of order parameters to the RC. The active enhanced sampling (AES) approach [79] is another approach similar to RAVE, that uses 26 well-tempered metadynamics [80, 81] and a stochastic kinetic embedding formalism. 2.2 Using latent variables of VAE as RCs Our first attempt on solving the rare event problem start with learning RCs by the unsupervised machine learning method Variational Autoencoder(VAE)[58] and construct a biasing potential as the learned RCs to speed up the simulation[76]. The VAE is comprised of an encoder, q(z|X), mapping the original high- dimensional data into its low-dimensional latent space representation and a decoder, p(X|z), mapping such a latent variable representation back into the original dataspace. By enforcing z to be close to a simple prior distribution such as Gaussian distribution, the z learnt by the model will be able to capture the data?s main features. In the context of the MD simulations, the latent variable representation will describe a low-dimensional manifold for the molecular simulation trajectories within configuration space and thus could serve as a good approximation of the RCs. So the learnt variables and their probability distribution can then respectively be used as RC and its static bias can then be applied to leading to more ergodic exploration. The RAVE protocol involves iterating between machine learning and MD, with each iteration learning a more refined low-dimensional RC representation and bias potential. These iterations then continue until these biasing parameters are converged and sufficient to sample the desired ligand dissociation. One benefit of this approach is that a time- independent bias is produced that can be used to launch multiple independent production MD runs, which is helpful in extracting kinetics information with statistical robustness. 27 In this proof-of-concept study, RAVE shows its usefulness on several model systems, including two analytical potentials as well as a hydrophobic buckyball-substrate system in explicit water. All these systems have extremely high barriers (between 5 kBT and 30 kBT), and using RAVE, we can obtain near-ergodic sampling and converged free energy profiles both accurately and efficiently[76]. However, I would like to note that there are some shortcomings if only use the framework of VAE to learn the RCs. By construction, VAE is designed to reproduce the original sample, so it complete ignores the kinetic of MD trajectories dataset. So, instead of learning the slow mode of a system, it might focus on learning the mode with high variance. Also, the restriction enforce the distribution of the latent variable to be closed to a Gaussian prior makes hard to learn free energy landscape with many basins. To overcome these limitations, I use a more general framework called information bottleneck to replace the VAE in RAVE and thus allow RAVE to find the RCs by learning from the kinetic of the system. 2.3 Principle of Past?Future Information Bottleneck Here I interpret the RC in molecular systems as such a past?future information bottleneck [71]. I develop a sampling method that for small biomolecules, simultaneously and with minimal use of human intuition, learns this bottleneck, its thermodynamics and kinetics. The central idea is that not all features of the past carry predictive value for the future. A complex model can be made to be very predictive, however it will often obscure physical interpretability and also end up capturing noise. In order to address this task, I set up an optimization problem and demonstrate how to solve it through 28 the principle of variational inference[82] implemented through deep neural networks. This makes it possible to learn a predictive information bottleneck[72], which we can interpret as the RC, that given a molecule?s past trajectory is maximally predictive of its future behavior. The net product is an iterative framework on the lines of Ref.[76] that starts from a short MD simulation, and given this data, makes an estimate of the RC, its Boltzmann probability, and its associated causal Green?s function valid for short times. This information is leveraged to perform systematically biased simulations with enhanced exploration of phase space, which can then be used to re-learn the RC along with its probability and propagator, and iterating between MD and variational inference until optimization is achieved. At this point we have converged estimates of the most informative degrees of freedom, associated metastable states and their equilibrium probabilities. Finally, through the use of a generalized transition state theory based framework on the lines of Ref.[83], we recover the unbiased kinetics for moving between different metastable states. I formalize this problem in terms of a high-dimensional signal X characterizing the state of a N-particle system under some generic set of thermodynamic conditions. I take X as some d generalized coordinates or basis set elements, where 1 ? d ? N . Let the value of this signal measured at time t, or the past, be denoted by Xt and at time t + ?t, or the future by Xt+?t. We call ?t the prediction time delay. We assume that Xt and Xt+?t are jointly distributed as per some probability distribution P (Xt,Xt+?t). The mutual information I(Xt,Xt+?t) (See appendix for more details) quantifies how much an observation at one instant of time t can tell us about an observation at another instant of time t + ?t. Furthermore, here we can restrict our attention to stationary 29 systems, hence omit the choice of time origin and write down Xt as X and Xt+?t as X?t. The principle of Predictive information bottleneck (PIB)[71, 72] postulates a bottleneck variable s which is related to X by an encoder function P (s|X). Given the bottleneck variable s, predictions of the future X?t can be made with a decoder P (X?t|s). PIB says that the optimal bottleneck variable is one which is as simple as possible in terms of the past it needs to know, yet being as powerful as possible in terms of the future it can predict correctly. This intuitive principle can be formally stated through the optimization of an objective function L which is a difference of two mutual informations: L ? I(s,X?t)? ?I(X, s) (2.1) The above objective function quantifies the trade-off between complexity and prediction through a parameter ? ? [0,?). 2.3.1 Variational inference and neural network architecture Typically, both the encoder P (s|X) and the decoder P (X?t|s) can be implemented by fitting deep neural networks [84] to data in form of time-series of X. This work stands out in three fundamental ways to typical implementations of the information bottleneck principle and in general of AI methods to sampling biomolecules[85, 86, 87, 88]. Firstly, I use a stochastic deep neural network to implement the decoder P (X?t|s), but use a simple deterministic linear encoder P (s|X) (see Fig. 2.1). The simple encoder ensures that the information bottleneck or RC we learn is actually physically interpretable, which 30 Figure 2.1: Network architecture used for learning predictive information bottleneck s. The decoder Q(X?t|s) is a stochastic deep neural network, while the encoder P (s|X) is of a simple deterministic and thus directly interpretable linear form. is notably hard to achieve in machine learning. On the other hand by introducing noise in the decoder, I can control the capacity of the model to ensure that the neural network can delineate useful feature from useless information instead of just memorizing the whole dataset. Secondly, now that our encoder is a simple linear model, I completely drop the complexity term in Equation (2.1) and set ? = 0. Due to a reduced number of variables, this leads to a simpler and more stable optimization problem. Finally, the rare event nature of processes in biomolecules makes it less straightforward to use of information bottleneck/AI methods for enhanced sampling. Here I develop a framework on the lines of [76] that makes it possible to maximize the objective function in Equation (2.1) through the use of simulations that are progressively biased using importance sampling as an increasingly accurate information bottleneck variable is learnt. { } The typical starting point is an unbiased MD trajectory X = X1, ...,XM with M 31 data points. We want to develop a low-dimensional mapping s of this high-dimensional space, that maximizes the objective function L = I(s(X),X?t). At the heart of this mutual information lies the calculation of the decoder P (X?t|s), which can in principle be done exactly using Bayes? theorem (See Appendix ). However this becomes impractical as soon as the dimensionality of X increases, due to a fundamental problem in statistical mechanics and machine learning: intractability of the partition function in high dimensions [89]. The principle of variational inference is one such elegant and powerful approach to surmount this problem [90]. Let?s consider a generic encoder given by some conditional probability P?(s|X) where ? is a set of parameters. Our objective then is to find the optimal RC or equivalently, the encoder ? which optimizes the PIB objective: ?? = argmaxL(?) (2.2) ? As mentioned above, this optimization problem is intractable for almost all cases of practical interest. However, it is possible to perform an approximate inference problem by assuming an approximate decoder Q?(X?t|s) parametrized by the vector ?. For any choice of ?, I make a straightforward use of Gibbs? inequality[82] to write down (See Appendix for derivation ): I(s,X?t) = H(P?(X?t))?H(P?(X?t|s)) ? H(P?(X?t))? C(P?(X?t|s)||Q?(X?t|s)) (2.3) 32 Here H and C denote Shannon entropy and cross entropy respectively. Take note that the first term in Equation (2.3) is independent of our model parameters and hence can be completely ignored from the optimization. Focusing on the second term in Equation (2.3), we thus obtain a variational lower bound on the predictive information bottleneck objective function: L ? L? = ?C(P?(X?t|s)||Q?(X?t|s)) (2.4) Thus L? is a tractable lower bound bound to the true Predictive Information Bottleneck objective function L, that involves a variational approximation through the trial decoder parametrized by ?. It has a simple physical interpretation. We are attempting to learn a decoder probability function Q that mirrors the actual Bayesian inverse probability function P in terms of predicting the future state X?t of the system, given knowledge of the RC s. The difference between the two probability distributions is calculated as a cross-entropy. By maximizing the right hand side of Equation (2.4) simultaneously with respect to the decoder and encoder parameters ? and ? respectively, we can then solve the actual optimization problem posited in Equation (2.4) rigorously and identify the optimal RC. It is clear that a model of a dynamical system X that attempts to capture just its stationary probability P (X) will be less informative and useful than one that captures the joint past-future probability distribution P (X,X?t). This is simply because the stationary probability can always be calculated by integrating P (X,X?t) over future outcomes X?t. What is however less clear is the choice of the time-delay ?t [87]. In 33 biomolecular systems, it is likely that there will be a hierarchy of time-scales and thus time-delays relevant to different types of structural and functional details. In principle, our formulation allows us to probe these various time-delays in a systematic manner. Here, for the purpose of enhanced sampling, I propose an approach for selecting ?t that is rooted in the reactive flux formalism of chemical kinetics[91, 92, 93]. This formalism applies to any system with stochastic transitions on a network of microstates with arbitrary, complex connectivity. Summarily, it states that the correlation function for a trajectory?s population in any given state can be partitioned into 3 parts: (a) an initial inertial part, (b) an exponential decay, and (c) an intermediate plateau region between (a) and (b). A key insight from this formalism is that capturing (c), i.e. the plateau part of a system?s state to state dynamics accurately is necessary and sufficient to capture the temporal evolution at any timescale. By paraphrasing this argument in the context of the present work, we propose to learn our PIB model for gradually increasing values of the predictive time- delay ?t, and stop when the calculated bottleneck variable converges. 2.3.2 Variational inference on unbiased and biased trajectories { I now sh}ow how to calculate L ? in practice. For a given unbiased trajectory X1, ...,XM+k with large enough M , we can easily show (See Appendix for derivation): M L? 1 ? = logQ(Xn+k|sn) (2.5) M n=1 where ?n is sampled from P (s|Xn) and the time interval between Xn and Xn+k is ?t. For practical rare event systems however, a typical MD trajectory will be trapped 34 in the state where it was started. Here we use our current best estimate of the PIB to perform importance sampling of the landscape, so that the system is more likely to sample different regions in configuration space, and use this enhanced sampling to iteratively improve the quality of the RC. However, the data so generated is biased per definition, and we nee{d to reweight o}ut the effect of the bias. We suppose that along with the time series X1, ...,XM+k we also{have been prov}ided the corresponding time-series for the bias V applied to the system V 1, ..., V M+k . We can then use the principle of importance sampling[94] to write our PIB objective function L? as follows (See Appendix for derivation): {? }M ?1?M L? = e?V n e?V nlog Q(Xn+k|sn) (2.6) n=1 n=1 where ? is inverse temperature. The above equation is however approximate, as it assumes P (Xn+k|snbiased ) ? P n+k nunbiased(X |s ). This is exact as ?t ?7 0, and can be expected to be reasonably valid for small ?t, where we expect that the system on average would not have diffused too far from its starting position at the beginning of that interval. If the bias varies smoothly enough that its natural variation length scale is smaller or comparable to this diffusion distance, then for small enough ?t we can indeed make the aforementioned approximation. This means that we select the smallest possible ?t at which the RC estimate plateaus. 2.3.3 Patching it all I will discuss our complete sampling algorithm, that accomplishes in a seamless manner, the identification of the RC together with the sampling of its thermodynamics 35 and kinetics. The first step is to perform an initial round of unbiased MD. This trajectory, expressed in terms of d order parameters {?1, ..., ?d} (where 1 ? d ? N ), is fed to a deep learning module (Fig. 2.1). The deep learning module implements the optimization of L? in Equation (2.6) through the use of multi-layer feed-forward neural network for the stochastic decoder Q, and a physically interpretable linear map for the deterministic encoder P (Fig. 2.1?). Unlike the decoder, the encoder has no noise term and always maps {?1, ?2, ..., ?d} to i ci?i, where {ci} denote the weights of different order parameters. I perform this optimization for gradually increasing values of the predictive time-delay ?t, and estimate RC s (given by the values of the weights ci) as seen by the first plateau in terms of when it ceases to depend on choice of ?t. This value of ?t is then kept constant for different rounds of our protocol. At this point we have an initial estimate of s and also its unbiased probability distribution P u(s). These are both used to construct a bias potential Vbias(s) for the next iteration of MD: Vbias(s) = k T logP uB (s) (2.7) With this bias potential added to the original Hamiltonian of the system, we run a biased MD simulation. This explores an increased amount of configuration space since we have applied a bias along our estimated slow degree of freedom, viz. the PIB or the RC. This next round of MD trajectory is again fed to the deep learning module, but this time each data point carries a weight w = e?Vbias to compensate for the applied bias. This now identifies an improved RC s and its unbiased probability through the use of importance 36 sampling: u ? ?w?(s? s(t))?bP (s) ? e??F (s) (2.8) ?w?b where the subscript b denotes sampling under a biased ensemble with weight w = e?Vbias and F (s) is the free energy along s. From here, using the bias as ?F (s) our algorithm can now enter into further iterations of MD?deep learning?MD?... This looping continues until both the RC s and the free energy estimate F (s) along the RC have converged. We have thus obtained an optimized reaction coordinate and its Boltzmann probability density, or equivalently the free energy. Through these we can directly demarcate the relevant metastable states and quantify their relative propensities. Furthermore, we can also calculate the transition rates for moving between these metastable states. The central idea is to keep all transition states between the different metastable states, as identified through the RC, devoid of any bias. As we show in examples, this can be easily achieved when implementing equation (2.8), by ensuring that any barriers in the unbiased probability distribution of the estimated RC are completely bias-free. Once we have done this, we take into account that by virtue of it being the PIB, the RC already encapsulates any relevant, predictive modes in the system. Thus the hidden barriers which have invariably been corrupted through the addition of such a bias do not have any predictive power for the dynamics of the system, and are thus not relevant to the process at hand. This then implies that (i) the biased dynamics preserves the state-to-state sequence one would have seen with unbiased dynamics, and (ii) through the use of a simple time rescaling calculation[83, 95] discussed in Sec.1.3.2, we can calculate the acceleration of rates achieved through biased simulations. We now demonstrate the use of the PIB 37 framework with two biomolecular case-studies, in both of which we simultaneously learn the RC, free energy and?kinetic rate constants. In each case the RC s is constructed as a linear combination s = i ci?i, where {ci} denote the weights of different pre-selected order parameters {?1, ?2, ..., ?d}. 2.4 Application of RAVE to two benchmark systems 2.4.1 Conformation transitions in a model peptide First we consider the well-studied alanine dipeptide system (Fig. 2.2(a)). This system, as characterized by its Ramachandran dihedral angles, can exist in different metastable states with varying stabilities and hard-to-cross intermediate barriers. However, due to its small size it serves as a reliable benchmark where we can perform longer than microsecond unbiased MD simulations to benchmark our PIB calculations. Here we choose {cos?, sin?, cos?, sin?} as our order parameters, where ? and ? are the backbone dihedral angles. To avoid problems related to periodic boundary conditions, I take trigonometric functions of dihedral angles, The PIB protocol used here is shown in Fig.2.2. In the initial round, I perform a short unbiased MD simulation (see Fig.2.2(a) for trajectory for technical details). As discussed in Sec. 2.3, the RC is then determined as the linear encoder of the trained neural network with smallest loss function. In Fig.2.2(b), we show how the weights rapidly converge as functions of predictive time delay and reach a plateau in less than 2 ps which we set as ?t. Fig. 2.2(c) shows the RC s as well as the bias V (s) learnt along it to be used in the next round of MD. With this biasing potential, I perform biased MD simulation as shown in Fig. 2.2(d). This trajectory 38 3 more PIB Biased MD iteraons (a) (b) (c) (d) Biased MD Analysis (e) (f) (g) (h) Figure 2.2: Past?future information bottleneck framework results on alanine dipeptide. (a) Unbiased simulation trajectory for dihedral angles ? and ?, in blue triangles and orange pluses respectively. The alanine dipeptide molecule is shown in inset. (b) Absolute weights for different order parameters in the first training round as a function of the predictive time delay ?t. Blue triangles, orange triangles, green circles and red squares correspond to cos?, sin?, cos? and sin? respectively. (c-f) Free energy along the adaptively learnt RC along with the corresponding biased trajectories for different training rounds. (g) Free energy along ?, ? after the RC has converged with energy contours every 4 kJ ?mol?1. (h) Kinetics from the post-training biased runs as well as reference unbiased runs. The two are essentially indistinguishable. Orange and blue dashed lines denote unbiased data respectively, while red and black solid lines show corresponding best fits. through the use of Equation (2.8) leads to a more complicated bias structure as shown in Fig. 2.2(e) along with the improved RC ?. Biased simulation with this new RC and bias as shown in Fig. 2.2(f) finally leads to escape from the starting metastable state. The final obtained RC is: s = 0.02 cos? + 0.97 sin? ? 0.25 cos? ? 0.02 sin?. It is known for alanine dipeptide that ? is more relevant than ? for capturing the conformational transitions, and the PIB based RC estimate agrees with that. Now that we have achieved back-and-forth motion in terms of the rare event we intended to study, I use this final RC and bias to perform multiple sets of longer simulations with no further refinement of the RC. This yields the free energy surface (defined as ?kBT logP (?, ?) where P is the unbiased Boltzmann probability) as shown in Fig. 2.2(g). This is in excellent agreement with previously published benchmarks for 39 this system[94]. At the same time, I use the acceleration factor to rescale the biased time back to the unbiased time. In Fig. 2.2(h) we show the cumulative distribution functions of the first passage time from the deeper basin as obtained through this approach, and through much longer unbiased MD runs which are feasible given the small size of this system. The distribution functions and their best-fit Poisson curves are nearly indistinguishable, and lead to excellent agreement in values of the escape rate constant, given by k = 5.2 ? 0.8?s?1 and 5.8 ? 0.9?s?1 respectively for biased and unbiased simulation. 2.4.2 Benzene dissociation from T4-L99A lysozyme Now I will apply our framework to a very challenging and important test case, namely the pathway and kinetic rate constant of benzene dissociation from the protein T4-L99A lysozyme in all-atom resolution[73, 74]. I will also demonstrate how the RC calculated through our approach can be directly used to perform a sensitivity analysis of the protein, and predict the most important residues whose mutations could have a significant affect on the stability of the protein-ligand complex. Such an analysis has direct relevance to predicting, for instance, the mutations in a protein which could lead to a pharmacological drug losing its efficacy. For this problem we choose 11 fairly arbitrary order parameters denoted {?1, ..., ?11}. As shown in Fig.2.3 and Table 2.1 , eight of these are ligand-protein distances while three are intra-protein d?istances. The RC is learnt as a linear combination of these order parameters, namely s = i ci?i. 40 Figure 2.3: Definition of Order parameters for Benzene-Lysozyme dissociation: Our 11 order parameters comprise 8 protein-ligand contacts and 3 protein-protein contacts. The protein-ligand contacts are implemented through the distances between the centres-of- mass of ligand to different protein atoms labeled in the plot. The protein-protein contacts are implemented through the distance between the centres-of-mass of ligand to centres- o-mass of helices l abeled in the plot. Further details are provided in Table 2.1 For this problem as well I start with a short unbiased MD simulation. As shown in Fig. 2.4, the weights of different order parameters in the RC learnt from this trajectory change as a function of the predictive time-delay ?t, but converge quickly. On the basis of this plot, I set ?t = 2ps for all further calculations. I then iterate ? using the same neural network architecture as for alanine dipeptide (Fig. 2.1) ? between rounds of learning an iteratively improved RC s1 together with its probability distribution, and 41 running biased MD using the iteration?s RC and probability distribution as bias V1(s1) (Equation (2.7)). After 9 rounds, the bias saturates as a function of training rounds. That is, no further enhancement in egodicity is achieved by performing additional rounds of the aforementioned iteration. This corresponds to the system reaching configuration where the previous PIB ceases to be effective. To learn a new PIB, I use the washing out trick from [96] to learn a second RC s2 conditioned on our knowledge of the first RC s1. In the next few rounds of learning?MD iterations, I (a) keep s1 and V1(s1) fixed, and (b) do not account for V1(?1) when using Equation (2.8). Through this I construct a bias V (s1, s2) = V1(s1) + V2(s2). In principle we can lift this assumption and learn more complicated non-separable V (s1, s2). In a few rounds of training s2, I observed OP Type Definition Weigh?t ci in Weigh?t in? optimized optimized s 21 = ci?i ?2 = c 2 i?i d1 Protein-ligand Y88CA?ligand 0.4930 0.6524 d2 Protein-ligand A99CA?ligand 0.5059 0.3984 d3 Protein-ligand L133CA?ligand -0.4002 -0.4853 d4 Protein-ligand L118CA?ligand 0.2335 -0.0792 d5 Protein-ligand V111CA?ligand 0.0300 0.0399 d6 Protein-ligand A130CA?ligand -0.4009 -0.3037 d7 Protein-ligand N140CA?ligand 0.1229 -0.1544 d8 Protein-ligand A146CA?ligand 0.2313 -0.2240 hd12 Protein-protein Helix 1 (A82- -0.2081 -0.0378 S90) ? Helix 2 (T115-123Q) hd23 Protein-protein Helix 2 (T115- -0.1114 -0.0073 123Q) ? Helix 3 (W126-A134) hd34 Protein-protein Helix 3 (W126- -0.0204 0.0670 A134) ? Helix 4 (K147-T155) Table 2.1: List of order parameters used to construct RC and their weights in ?1 and ?2. 42 spontaneous dissociation of the ligand from the protein. I then use the RC (s1, s2) (shown in Fig. 2.5) and its bias V (s1, s2) learnt to directly study the pathway and kinetics of ligand dissociation. d1 0.8 d2 d3 0.6 d4 d5 0.4 d6 d7 d8 0.2 hd12 hd23 0.0 hd34 0 2 4 ?t?ps) Figure 2.4: Order parameters weights as functions of time delay for Benzene-lysozyme dissociation. The scheme shows the absolute value of weights for 11 order parameters in the first round as a function of predictive time delay. For this I launch 20 independent biased simulations using (s1, s2) as RC and V1(s1)+V2(s2) as bias. I calculate the acceleration factor to recover the original timescale of the first passage time. As shown in Fig.2.6, I fit the cumulative distribution function to a Poisson process and get an escape rate constant of 3.3 ? 0.8s?1, which are in good agreement with other methods [20, 97, 98, 98]. I also obtain a range of free energies viewed as functions of different order parameters. These are in excellent agreement with previously published results [20, 96, 97, 98]. A comment I would like to make here concerns the magnitude of ?t especially 43 |Weights| 1.5 1 1 2 0.5 0 -0.5 -1 d d d d d d d d hd hd hd 1 2 3 4 5 6 7 8 12 23 34 order parameter Figure 2.5: The 2-component Predictive Information Bottleneck for benzene-lysozyme dissociation, where colors red and blue correspond to s1 and s2 respectively. The optimized weights for different order parameters are illustrated after scaling all weights to keep c1 = 1 in s1. in connection to the decorrelation time of the MD thermostat. Here I implemented a canonical ensemble using the velocity rescaling thermostat [22] with a time-constant of 0.1 ps. The predictive time-delay ?t is thus at least 20 times longer than the times for which the history of the thermostat would persist. Interestingly, as can be seen in Fig. 2.4 the estimate of the RC converges with longer time-delays which would be even more accurate from the perspective of not having thermostat-induced noise, but would be less accurate due to biasing related errors as explained earlier. 2.4.3 Predicting critical residues On the basis of the predictive information bottleneck that I have now calculated, we can directly predict which protein residues have the most critical effect on the system. To do so, the guiding principle is that the residues which carry higher mutual information with the PIB are more likely to have an impact on the stability of the system, for instance, 44 weight 1 data fit 0.8 0.6 0.4 0.2 (a) 0 1 2 3 10 10 10 residence time (ms) Figure 2.6: Poisson fit to the 15 independent transition times obtained for the benzene- lysozyme dissociation constant with net maximum bias V max + V max ?11 2 = 52 kJ ?mol . The koff reported in the main text is calculated as the inverse of the fitted time constant here, 303? 76 ms. Figure 2.7: Representative selected free energy surfaces for Benzene-Lysozyme dissociation obtained from our approach: (a) Free energy along d1 and d2. (b) Free energy along d2 and d3. (c) Free energy along d1 and d8. All energies are in units of kJ ?mol?1 with contours every 5 kJ ?mol?1. These surfaces are in excellent agreement with previous benchmarks on this system [96]. if these residues were to be mutated. I use the backbone dihedral angles to describe the motion of each residue. Here I denote them as ?i, ?i, where i is the index of a particular 45 Cumulative distribution function residue. We use I(?, ?) to quantify the correlation between dihedral angle ? (where ? could be ? or ?) and the unbinding process, where I(?, ?) is mutual information between a dihedral angle ? and the RC ?. For each residue, we have two dihedral angles. At the same time, for one dihedral angle, I calculate its mutual information with ?1 or ?2. So we have four quantities for each residue( I(?, ?1),I(?, ?1), I(?, ?2) and I(?, ?2) ). I rank the importance of each residue by the maximum of the four quantities. Here I only consider the parts of trajectory that is bias-free(?1 > 0.4 and ?2 > 0.3). Because we require that the energy barriers between metastable states should be bias-free to ensure we can get the correct reweighted dynamics of the unbinding process. However, we CAN only guarantee that the main energy barrier between bound and unbound state has zero bias when we perform biased MD simulation while bias can still be added on barriers between other metastable states. Looking at the unbiased region allows us to reduce the influence of the biasing potential on the dynamics of the system and focus more on the transition states. By performing such a scan of the mutual information between the predictive information bottleneck and the backbone dihedral angles of different residues, we can rank them as being most critical to least critical. As shown in Fig. 2.8, some of the important residues are (in order of decreasing relevance): Ser136, Lys135, Asn132, Leu133, Ala134, Phe114, Val57, Asp20, Leu118 and Val131. These residues can be classified in two broad groups. Firstly, we have group (a) comprising residues 114, 118, and 131?136 - together these contribute to breathing movement between the two helices through which the ligand leaves. Secondly we have group (b) comprising residues 20 and 57, which lie in different disordered regions of the protein, and have no obvious interpretation. The roles of groups (a) have been hinted at in previous works[20, 99, 100] 46 Figure 2.8: Critical residue analysis for benzene-lysozyme complex. The plot on left shows for every residue the maximal mutual information between the PIB and either of the Ramachandran angles ?, ? of that residue. The top 10 residues are highlighted through markers and in the right plot, illustrated relative to the ligand in a typical intermediate pose. and are thus yet another validation of our approach. The role of group (b) during the unbinding process remains to be seen. In order to demonstrate the robustness of this calculation, I performed the full MD? deep learning?MD?deep learning iterative protocol with a new set of order parameters (defined in Table.2.2) that considered new protein-ligand distances and completely excluded any protein-protein distances, as the selection of latter require a more significant role of human intuition in anticipating protein breathing for instance. In this new set of calculation, as shown in Fig.2.9, I again obtained same critical residues, including a residue from the same disordered region as in group (b) above. Whether the disordered 47 Order Type Definition Weigh?t ci in Weigh?t inparameter optimized optimized s1 = c 1 i?i s2 = c 2 i?i d1 Protein-ligand V87CA?ligand 0.0297 0.8604 d2 Protein-ligand N101CA?ligand 0.1448 0.2590 d3 Protein-ligand G110CA?ligand -0.0583 -0.0813 d4 Protein-ligand A119CA?ligand -0.0261 0.2256 d5 Protein-ligand A129CA?ligand -0.4076 0.0282 d6 Protein-ligand Y139CA?ligand 0.8846 -0.3548 d7 Protein-ligand V149CA?ligand 0.1593 -0.0921 Table 2.2: List of order parameters used to construct RC and their weights in ?1 and ?2. regions are biophysically relevant to the unbinding of the ligand with the existence of a long-range allosteric communication pathway, or if these residues are picked due to just noise from calculations, needs more detailed mutagenesis study in the future. Figure 2.9: (a) Alpha carbons used to define two sets of basis functions. Carbon atoms used in Table 2.1 are colored in blue and the carbon atoms used in Table 2.2 are colored in red. (b) The 2-component Predictive Information Bottleneck for benzene-lysozyme dissociation. Order parameters are defined by Table 2.2. (c) Critical residue analysis from new set of order parameters. 48 2.5 Discussion In this chapter, I introduced a new framework for the simultaneous sampling of the reaction coordinate, free energy and rate constants in biomolecules with rare events. This work is grounded in the Predictive Information Bottleneck framework, which is an information theoretic approach for building minimally complex yet maximally predictive models from data. Such a framework has previously been found useful for modeling fruit fly movement and human vision. Here I exploit the commonality between these diverse problems and that of sampling complex biomolecular systems, namely the need to quickly predict the future state of a system given noisy and high-dimensional information. This method implements this framework through the use of a unique linear encoder?stochastic decoder model, where the latter is a deep neural network with inbuilt noise. Here I demonstrated the applicability of the method by studying conformational transitions in a model peptide in vacuum and ligand dissociation from a protein in explicit water, with both systems in all-atom resolution. Through extremely short and computationally cheap simulations, I obtained thermodynamic and kinetic observables for slow biomolecular processes in excellent agreement with other methods, experiments and long unbiased MD. Last but not the least, by virtue of having captured the most predictive degrees of freedom in the system, I could also make direct predictions of how protein sequence can impact dissociation dynamics - namely, which mutations in the protein would be most deleterious to the dissociation process. I would also like to discuss here some limitations of this approach that I will address in the following chapters. As discussed in Sec. 2.3.2, since I didn?t consider 49 the correction for propagator, Eq. 3.8 is a good approximation only when ?t is small enough that the system on average would not have diffused too far from its starting position. But it is also required that ?t should be taken from the range where RC reaches a plateau. Though I showed that in the two benchmark systems, we can choose a relatively small ?t as the RCs quickly converge to a constant as ?t increase, we can still ask the question of whether this is generally true for other systems. In chapter 3, I will discuss the ?t dependence of the objective function and drive a new formulation which also considers that correction to the propagator. In addition to figuring out how to better approximate the objective function from biased simulation, one other issue that we will need to address is the choice of input for RAVE. Here I use of a simple linear encoder which preserves the interpretability of the RC. However this comes at the cost of using smartly designed non-linear basis functions, or order parameters, which can often be domain-dependent. For example, different classes of basis functions were needed here for alanine dipeptide conformation change (namely, torsions) and ligand dissociation (namely, distances). Using a linear encoder on distances for alanine dipeptide, or torsions for ligand dissociation, leads to no discernible enhancement in sampling as we iterate through rounds of deep learning and MD. Thus, bad choices of basis functions for the protocol can be ruled out at least in a heuristic manner by quantifying whether these led to more ergodic sampling or not. In Chapter 4, I will show that when applying RAVE to more challenging systems, we can use method AMINO[101] developed in our group to automatically select the basis functions. 50 Chapter 3: Understanding the role of predictive time delay and biased propagator in RAVE 3.1 Introduction In this chatper, I will take a closer look at the underlying formulation of RAVE with respect to crucial aspects which were introduced in our past work using intuitive arguments. Here I am going to put these aspects on a rigorous footing, identifying their shortcomings and when applicable introducing simple corrections. These are as follows. RAVE[77] involves learning the PIB through optimization of an objective function that minimizes model complexity while still maximizing predictive power. Firstly, a key parameter in constructing the RAVE objective function is the so-called predictive time- delay ?t, or how far into the future should the algorithm try to predict. In the previous chapter, I had shown through numerical benchmarks that the precise choice of ?t did not matter, as long as ?t > 0 and was kept small. That is, the predicted PIB changed but rapidly plateaued with increase in ?t. Here, by using a master Equation based formulation of the system?s dynamics, I will demonstrate rigorously why such a dependence on ?t is a direct and simple consequence of Markovianity along the learnt low-dimensional representation. In fact, the absence of a plateau in how the PIB depends 51 on ?t can even be considered a tell-tale sign of incorrect results from RAVE. Secondly, at the initial stage of RAVE, when the input trajectory is unbiased, the RAVE objective function can be computed exactly by sampling over the input trajectory. However, after the first iteration once the PIB is learnt and used to perform a round of biased MD, it becomes tricky to calculate the objective function exactly through sampling. In the application mentioned in Sec.2.4 and, I introduced an intuitive approximation, namely that at short timescales the propagator of the system along the putative PIB is invariant between biased and unbiased dynamics. Naturally this statement is exactly true as ?t 7? 0, but its fate for ?t > 0 is not at all obvious. Here, again by taking recourse in a master equation formalism, I will derive a framework for using biased sampling to estimate a reweighted, unbiased propagator needed in RAVE that is valid for small ?t > 0. I will show how it reduces to the original expression in limiting cases, and how to compute it. I also demonstrate through numerical examples when such a correction might matter and when it might not. 3.2 Dependence of predictive information bottleneck on ?t Maximizing the mutual information I(s,X?t) in Eq. 2.1 forces the learnt linear PIB or RC s to capture features that can be used to build a predictive model. In addition, introducing a slight time delay in RAVE i.e. setting ?t > 0 emphasizes the contribution of not just the features that are related to static distribution of the data, but specifically of the features that persist with time. As demonstrated for related work on time-lagged autoencoders[102], this amounts to learning slowly decorrelating aspects of the feature 52 space X. Two questions immediately arise after introducing such a non-zero time delay ?t. First, what precise value of ?t should we adopt when constructing the objective function? Second, how to approximate unbiased estimates of P (X?t|X) from a biased MD trajectory which provides us Pbiased(X?t|X)? The second question arises because access to unbiased estimates of P (X?t|X) is critical to calculating the objective function L in Eq. 2.1. In this work we answer, with relative rigor, both the questions above. For the first question, we demonstrate that given markovianity in the higher dimensional space X, the markovianity of the PIB follows quite naturally. In other words, there exists a range of non-zero time-delay ?t values for which the PIB is independent of the exact choice of ?t. Thus any small enough ?t can be used to construct the PIB. For the second question, we propose a new objective function which corrects for the influence of bias on the short- time propagator P (X?t|X), and test it on model systems. Reassuringly we find that the effect of the correction derived here is minimal, and our intuitive approximation of Pbiased(X?t|s) ? Punbiased(X?t|s) made in Sec. 2.4 was not a bad one. As discussed in Sec. 2.3, estimating the PIB, or minimizing the objective function L in Eq. 2.1 is the same as maximizing the cross entropy L? between P?(X?t, s) and P?(X?t|s): ? L? = ? P?(X?t, s) lnP?(X?t|s)dX?tds (3.1) where ? indicates the parameters of neural network. To understand how this objective function L? depends on the predictive time delay ?t, we need to analyze the ?t- 53 dependence of P?(X?t|s) and P?(X?t, s). To do so we start with a master equation framework for the propagator in the high-dimensional space X, where we assume markovianity holds true: P (X2, t+?t|X1, t) = ?(X2 ?X1)[1? a(X1, t)?t] +Wt(X2|X1)?t+O[(?t)2] (3.2) Here Wt(X?2|X1) is the transition probability per unit time from X1 to X2 at time t, and a(X1, t) = Wt(X2|X1)dX2. At this point we can make two assumptions: (a) the distribution is stationary so P (X2, t+?t|X1, t) can be denoted as P (X?t|X), and (b) ?t is small enough so higher order terms in Eq. 3.2 can be ignored giving: P (X?t|X) = ?(X?t ?X)[1? a(X)?t] +W (X?t|X)?t (3.3) In addition, as a direct consequence of the markovianity assumption in X space, the following property must hold true: P (X?t|s,X) = P (X?t|X) (3.4) This property reflects that X contains all the information needed to predict X?t, and in addition using knowledge of s can not improve the quality of prediction. With this 54 additional assumption and the use of Eq. 3.3, we have: ? P (X?t, s) = P (s,X)P (X?t|s,X) dX = P (s,X??t)[1? a(X?t)?t] + ?t P (s,X)W (X?t|X)dX (3.5) By rearranging Eq. 3.5, P (X?t, s) can be expressed in terms of W (X?t|X) and P (s,X): ? P (s,X)W (X?t|X)dX P (X?t, s) = ? a(X?t)?P (s,X)W (X?t|X)dX= (3.6) W (X?t? |X?t)dX?t? In Eq. 3.6 both the numerator and denominator have any dependence on the predictive time-delay ?t only through the transition probability per unit time W , which in turn per construction does not depend on ?t. As such, P (X?t, s) does not depend on ?t. As a direct consequence of this, the PIB estimated by maximizing L? in Eq. 3.1 as well should not depend on the choice of ?t, as long as ?t is small enough that the master equation formalism of Eq. 3.3 is valid. 55 3.3 Correcting the propagator and the information bottleneck when using biased trajectory So far the formalism assumes we have access to various unbiased estimates needed in order to maximize L? in Eq. 3.1. However, the whole framework in RAVE is based on iterations between machine learning and progressively biased MD, we need to be able to construct PIB from biased trajectories as well. I am going to address the question of how to do so, and develop a corrected form of Eq. 3.1. In Sec.2.4 we worked with biased trajectories by making the assumption: P n+k n n+k nbiased(X |s ) ? Punbiased(X |s ) (3.7) which led to the following L? for a trajectory with length N + k: {? }N ?1?M L? n n= e?V e?V lnQ(Xn+k|sn) (3.8) n=1 n=1 Here ?t equals the time elapsed in k MD steps and Q(Xn+k|sn) is the probabilistic mapping given by the approximate decoder to approximate the real posterior probability n P (Xn+k|sn). e?V is a reweighting factor to correct for the bias in stationary probability density estimate. The need for an approximate decoder arises from the principle of variational inference wherein L is bounded from below by L?, and L? will reach its maximum if and only if Q(Xn+k|sn) = P (Xn+k|sn). This guarantees that maximizing L? will also force the approximate decoder be as close to the real posterior probability 56 as possible. Thus in the limit that the neural network decoder is flexible enough to approximate the real decoder exactly, optimizing L and L? will give the same encoder. As such we can assume the decoder is exact when deriving the correction for the propagator. Eq. 3.8 corrects for only one of the two biased aspects of the trajectory: (i) it reweights out the effect of the bias on the sampling probability at any given moment of time, (ii) it however ignores the effect of bias on the short-term dynamical evolution of the system as it assumes Eq. 3.7. Note that in the limit ?t ? 0, Eq. 3.7 becomes exact. But as ?t increases, Pbiased(X|s) gradually deviates from Punbiased(X|s) due to the existence of bias. To correct for the bias in Pbiased(X|s), we can first consider the relationship of transition probability between biased and unbiased MD. By discretizing the Smoluchowski equation along X, the transition probability can be written as [103, 104]: | D(X?t) +D(X) {??[U(X?t)? U(X)]W (X?t X) = exp } (3.9) 2||X?t ?X||2 2 where U(X) is the equilibrium, unbiased free energy and D is the diffusion coefficient. In biased MD with bias V (s(X)) applied as a function of a putative RC s, the underlying free energy gets replaced by U(X) + V (s(X)). Note that the bias is function of RC s which itself is a well-defined function of basis functions or order parameters X that depend on atomic coordinates, so the biasing force added on atoms is continuously differentiable. Assuming that the diffusivity itself stays unchanged due to the addition of bias (a common assumption in dynamical reweighting algorithms such as Ref. [105]), we can then write down a relation between the biased and unbiased transition probabilities 57 in terms of the bias: ? e?V (s(X)) Wb(X?t|X) = W (X?V (s(X )) u ?t|X) (3.10)e ?t where the subscripts u and b denote unbiased and biased measurements respectively. Eqs. 3.3 and 3.10 together provide the relationship between Pb(X?t|s) and Pu(X?t|s): ? e?V (s(X)) Pb(X?t|X) = ?? Pu(XV (s(X )) ?t|X) if X?t ?= X (3.11)e ?t Pb(X?t|X) = 1? P ? ?b(X?t|X)dX?t if X?t = X (3.12) X??t=? X?t Finally, we can write down the bias-corrected expression for the objective function L? : N L? c ? = e?V n lnQ(Xn|sn) N n=1 N ? c ? V n+k+V n e? 2 lnQ(Xn|sn) N n=1 N c ? V n+k? +V n+ e lnQ(Xn+k2 |sn) (3.13) N n=1 where c is an irrelevant constant independent of the neural network parameters. The last term in Eq.3.13 is similar to Eq.3.8, which only considers the correction for sampling points and not their dynamical propagation. The summation of the first two terms can be interpreted as the correction for the propagator. To gain some intuition into these, let?s consider the effect of the biasing potential on the dynamics. It pushes the system from high bias region to low bias region, thereby spuriously increasing the 58 conditional probability for being found in low bias region, given that earlier the system was in high bias region. We thus want to reweight out this effect and only learn from the dynamics induced by the original potential, which is what Eq.3.13 achieves. If the trajectory goes from a state Xn with higher bias V n to a state Xn+k with lower bias V n+k, the summation of the first two terms in Eq.3.13 will be positive. Note that in Eq.3.13 the first two terms together in this regime encourage the decoder to generate features close to Xn while the third term favors the decoder to generate features closer to Xn+k. This will negate the spurious enhancement of probability of Xn+k due to the biasing profile. Similar argument applies for the case V n+k > V n. I will conclude this section by highlighting that is a rough approximation valid for short ?t, and it can not completely correct the effect of bias on dynamics. In the following examples I will discuss the advantages and limitations of this new formula, and also analyze how valid was the intuitive approximation Eq.3.7 made in Sec.2.4. 3.4 Results I am now going to demonstrate the nature and effect of the corrections derived in the previous section through a simple illustrative numerical example with exact results pertaining to the true reaction coordinate and its free energy. More specifically, I generated an unbiased trajectory on a model system and apply RAVE iterations to it. This system is simple enough that we could generate a long enough unbiased trajectory with sufficient sampling of different metastable states. RAVE calculations on this perfect unbiased trajectory serve as our benchmark. In parallel, I add a biasing potential along the 59 relevant slow mode of the system to generate a biased trajectory. Both of these trajectories are then subjected to different treatments as proposed in Sec. 3.3, and as explained in the remainder of this section, in order to judge how much of a difference our corrections really make. 3.4.1 System set-up The model system in this work has three degrees of freedom and a governing potential energy U given by: U(x, y, z) = 6(x2 ? 1) + 0.0375[(y ? z)2 ? 8]2 + 45(y + z)2 (3.14) Fig.3.1 shows the free energy profile along each degree of freedom and the corresponding trajectory from evolving Langevin dynamics[106] with an integration time step of 0.01 units at temperature kBT = 1. In agreement with the significantly higher energy barrier along x, the transitions along x happen much less frequently and it represents accurately the dominant slow mode in this system. In this work I will focus on the difference between optimizing Eq.3.8 and Eq. 3.13. Therefore, I perform the biased MD by using the slow mode x as the reaction coordinate and constructing the bias potential on the basis of its analytical free energy. The bias potential is constructed with a maximum bias of 3kT . The sum of the potential energy of the system and the bias potential is shown as a dashed line in Fig. 3.1 (a) and the biased MD trajectory can be found in supplementary information. 60 Figure 3.1: Free energy profile along each degree of freedom of the model system (a), (c) and (e) respectively) and the corresponding trajectories from Langevin dynamics ((b), (d) and (f) respectively). In (a) the sum of underlying and bias potentials are indicated with a dashed line. 3.4.2 RAVE set-up The input to RAVE comprised the three-dimensional time-series of {x, y, z} as obtained from Langevin dynamics. Whitening procedure from Ref. [107] was used to 61 reduce artifacts from high variance. The RC or PIB was learnt as a linear combination of {x, y, z}, thus characterized by corresponding three weights. Due to the possible non- convex and inherent stochastic nature of the optimization, there is no guarantee that a single training run gives us the most optimal RC. Indeed, here we trained 16 neural networks with randomly initialized weights, and found significant run to run variation in the output weights for {x, y, z}. Out of the 16 choices we selected the run that gave the smallest loss function. In practice, such a high number of trial runs is not needed, and a smaller number might very well be sufficient. 3.4.3 Calculated RC under different time-delays and approximation schemes We use the unbiased and the biased trajectories as input in RAVE with different values of the predictive time-delay ?t. Fig. 3.2 shows the weights of x,y and z in the RC as obtained for these different set-ups. For the long well-sampled unbiased trajectory, I use Eq.3.8 setting V n = 0 (Fig. 3.2 (a). This serves as a benchmark. For the biased trajectory as well, we first use Eq. 3.8 setting V n = 0 (Fig. 3.2 (b). This illustrates the effect of biasing on the different modes. Next, for the biased trajectory, I use Eq. 3.8 taking the bias into account for reweighting the stationary probability, but not the propagator (Fig. 3.2 (c). Finally, I use the biased trajectory correcting for the effect of bias on the stationary density as well as the propagator by making use of Eq.3.13 (Fig. 3.2 (d). The RAVE solution for unbiased trajectory shown in Fig. 3.2 (a) assigns the highest 62 Figure 3.2: Reaction coordinates learned by RAVE as functions of predictive time delay. (a) RC learned from an unbiased trajectory using Eq. 3.8 setting V n = 0. (b) RC learned from a biased trajectory using Eq. 3.8 with V n = 0. (c) RC learned from a biased trajectory with Eq. 3.8 which takes the bias V n = into account for reweighting the stationary probability, but not the propagator. (d) RC learned from a biased trajectory correcting for the effect of bias on the stationary density as well as the propagator by making use of Eq. 3.13. weight to x, which is in agreement with our expectation that x is the slowest mode. The weights for y and z always have different signs as they are anticorrelated according to potential energy U in Eq. 3.14. Here I just want to compare their relative importance so only the absolute values of the weights are shown all throughout Fig. 3.2. Fig. 3.2 (b) shows results after treating the biased trajectory without taking the bias into account i.e. through the use of Eq. 3.8 with V n = 0. Due to biasing along x and subsequent enhancement in fluctuations in this direction, now all 3 modes x, y and z are comparable in their timescales, and without any reweighting RAVE assigns equal weights 63 to the three degrees of freedom. This illustrates why reweighting out effects of bias is crucial in RAVE, and arguably in general when using inputs from biased simulations as training data in machine learning. In Figs.3.2(c) and (d) I use Eq. 3.8 and Eq. 3.13 respectively to correct for the effect of bias on the stationary probability and both stationary probability/propagator. Both Eqs. 3.8 and Eq. 3.13 in Figs. 3.2 (c) and (d) respectively give a remarkably similar profile to that obtained from the long benchmark unbiased simulation used as input in RAVE. Thus, at least for this model system, it appears that our intuitive approximation of Eq. 3.7 made in Ref.[77] was reasonable. 3.5 Conclusion In this chapter, I have revisited the parameter time delay in RAVE. Specifically, I first discuss the role of predictive time delay in RAVE demonstrating why its specific value is not relevant as long as a small non-zero value is taken. Secondly, I introduced a correction for the objective function in RAVE that corrects the effect of biasing potential on the dynamical propagator of the system. Our work is grounded in the master equation framework for the dynamics of the system in the true high-dimensional space. I prove that the RC learned from RAVE should not depend on the choice of time delay, as long as time delay is small enough that the master equation formalism of Eq.3.3 is valid. This explains why in 2.4, the RC converged quickly as a function of predictive time delay. Also, by introducing the correction for the transition probability in biased MD, I derive a new objective function, which not only reweights the static distribution as in Sec. 2.4, but also gives a better estimation of the transition probabilities. I find that apart from reducing 64 the number of outlier solutions, the correction does not significantly improve upon the intuitive approximation introduced in Sec.2.4. It remains to be seen if this holds true in more complex systems, but given that the correction Eq. 3.13 involves no computational overhead relative to Eq.2.6, I will use it as a default in the following chapters. 65 Chapter 4: Interrogating RNA-small molecule interactions with RAVE 4.1 Introduction RNA regulates diverse cellular processes, far beyond coding for protein sequence or facilitating protein biogenesis[3]. In this role, RNA can interact with protein[108] or small molecule factors [109, 110] to directly or indirectly impact gene expression and cellular homeostasis[4]. For example, riboswitches are sequences found primarily in bacterial mRNAs that contain aptamers for small molecules and regulate transcription or translation[111]. In humans, mutations in noncoding RNAs can directly impact neurodegenerative diseases and cancer[112, 113]. The regulatory role of bacterial riboswitches, coupled with the broader link between RNA and human disease has also led to interest in RNA as a therapeutic target for small molecules[114, 115, 116]. However, in comparison to proteins, methods to computationally and experimentally interrogate RNA structure are relatively less mature[117, 118]. Thus, more robust tools are needed to better understand RNA structure, dynamics, and recognition of small molecules. In spite of its clear relevance to fundamental science and drug discovery efforts, the highly dynamic nature of RNA makes it difficult to study due to the diverse conformational ensembles it adopts.[119, 120, 121] Luckily, riboswitches represent a valuable model for RNA-small molecule recognition. When ligands bind to riboswitch aptamers, the RNA undergoes 66 a conformational change that modulates gene expression[122, 123]. The cognate ligand for a riboswitch is typically associated with its upstream gene, often part of the ligand?s biosynthetic pathway.[123, 124] This feedback mechanism, paired with well-established structural analyses, make riboswitches excellent models for RNA-ligand binding since the target ligand is known and often highly specific[123]. Although we can use riboswitches to model RNA-small molecule interactions, visualizing riboswitch-ligand dissociation pathways with high spatiotemporal resolution remains a challenge for in vitro and in silico experiments. Such a femtosecond and all- atom resolution is hard to directly achieve in in vitro experiments, while the associated timescales are several orders of magnitude too slow for molecular dynamics (MD) simulations performed even on the most powerful supercomputers. Previously, Ref. [125] used Go?-model simulations of PreQ1 riboswitches, specifically transcriptional Bacillus subtilis (Bs) and translational Thermoanaerobacter tengcongensis (Tt) aptamers. Their work suggests that in respective cases the ligand binds at late and early stages of riboswitch folding, indicating that perhaps these riboswitches respectively fold via mechanisms of conformational selection and induced fit. Since Ref. [125], RNA force- fields have become more detailed and accurate, but the timescales involved in RNA-small molecular dissociation continue to stay beyond reach[126]. Encouraged by the success of applying RAVE on the two benchmark systems, I explore the possibility of using RAVE to better understand the ligand-riboswitch dissociation process. Particularly, in my collaboration with experimentalists, we combine (i) enhanced sampling methods combining statistical physics with artificial intelligence (AI)[47, 77, 101, 127] and (ii) experimental techniques to measure RNA flexibility 67 Figure 4.1: Conformational change of the PreQ1 riboswitch and chemical structures of PreQ1 and synthetic dibenzofuran ligands. and ligand binding[128, 129, 130]. The computational sampling techniques allow us to study the dissociation process in an accelerated but controlled manner, while the Seletive 2? Hydroxyl Acylation analyzed by Primer Extension and Mutational Profiling (SHAPE-MaP), microscale thermophoresis (MST) and fluorescence intensity assay (FIA) provide a rigorous validation of the computational findings. Specifically, I study the Tt-PreQ1 riboswitch aptamer interacting with its cognate ligand PreQ1 (7- aminomethyl-7-deazaguanine) and a synthetic ligand (2-[(dibenzo[b,d]furan-2-yl)oxy]- N,N-dimethylethan-1-amine)[131]. See Fig. 4.1 for an illustration of the riboswitch and chemical structures of the two ligands. 4.2 Ligand binding induces flexibility change in riboswitch aptamer The first test is to compare the nucleotide-specific flexibility from all-atom simulations to those from SHAPE-MaP measurements. I perform 2 ?s unbiased MD simulations (4 independent sets of 500 ns each for both ligands) each for PreQ1 riboswitch bound with the cognate and synthetic ligand respectively. See Appendix for details of simulation set-up. While these simulations are not long enough to 68 observe ligand dissociation, they can still give a robust measure of how the flexibility varies between the two different ligand-bound systems and provide nucleotide-level insight into how specific ligand binding induces different conformation changes in the riboswitch. Specifically, I compare the fluctuations of the distances between consecutive C2 atoms obtained from these simulations with SHAPE-MaP measurements. Our experimental collaborators performed SHAPE-MaP analyses using a PreQ1 riboswitch construct consisting of the entire riboswitch (both the aptamer domain and expression platform). Probing was performed using a recently reported SHAPE reagent, 2A3, that provides improved mutagenesis rates during reverse transcription[132]. I analyze data in the absence of any ligand, and in the presence of either cognate ligand (PreQ1) or synthetic ligand. The results is in good agreement with previous studies reporting SHAPE-MaP on PreQ1 riboswitches using different SHAPE reagents and a slightly shorter construct[133]. As discussed previously[134], base pairing interactions are the most important determinant of RNA structural constraints, and thus base-base distance fluctuations are a crucial metric of RNA backbone flexibility. Indeed, as shown in Fig. 4.2, we obtain good agreement between C2-C2 distances from unbiased MD and experimentally measured SHAPE flexibilities for all 3 systems: ligand-free PreQ1 aptamer, cognate-ligand-bound PreQ1 aptamer and synthetic-ligand-bound aptamer. I would like to note that though unbiased MD simulations capture most of the fluctuations measured by SHAPE, some of the peaks showed by SHAPE are not observed in MD, which might be due to the time scale limitations of unbiased simulations or the kinetic regime measured by SHAPE reactivity. In addition to the agreement between MD and SHAPE, Fig. 4.2 also shows that, for all three systems, nucleotides in the ranges A13- 69 C15, U21-A24, and A32-G33 are more flexible than other regions of the riboswitch[133]. The agreement between MD and SHAPE approaches provides solid validation that our simulations using classical force-fields[135] accurately model the behavior of complex RNAs. 4.3 Ligand binding affinity with riboswitch aptamer Next I aim to calculate the free energy profiles and reaction coordinates for ligand dissociation from PreQ1. As these timescales are far beyond unbiased MD, here I use RAVE to learn the dominant slow degrees of freedom and simulate the whole dissociation event. I use AMINO[101] as a pre-processing method to reduce the number of order parameters fed to RAVE. AMINO achieves the goal of reducing the redundancy of the OP set by using the idea of clustering. AMINO uses a mutual information-base distance metric to quantify the distances between OPs, i.e., those OPs that share similar information will be closer to each other. With this distance metric, k-medoids clustering is used to group each OP into different clusters and the OP from the center of each group is selected as output to represent other OPs in the same group. The number of groups is determined by measuring the rate?distortion functions. To describe the relative position of ligands to the riboswitch, I choose the pairwise distance between each heavy atom from ligand and the center of mass of each ribose of individual nucleotide. Four 500 ns unbiased trajectory is used as the input to learn the reduced set of OPs. The reduced set of OPs for riboswitch bound with cognate ligand and 70 Figure 4.2: SHAPE reactivities for the Tt PreQ1 riboswitch aptamer obtained using the 2A3 reagent (red circles joined by solid lines) are compared with the fluctuations of C2 ? C2 distances (blue triangles joined by dashed lines) for the PreQ1 riboswitch (a) in the absence of ligand, (b) bound to cognate ligand, and (c) bound to synthetic ligand. Pearson correlation coefficients R are shown in upper left corners. riboswitch bound with synthetic ligand are indicated in Fig. 4.3 After four rounds of such iterations, I obtain converged RCs for both systems, 71 Figure 4.3: Weights for different order parameters (OPs) learned by RAVE in the corresponding RC for riboswitch aptamer with (a) cognate ligand and (b) synthetic ligand. The OPs are named in the format ?ligand atom-residue name?. Ligand structures and atom names are shown in insets. expressed as a linear combination of different riboswitch-ligand heavy atom contacts. Fig. 4.3 shows different contacts and their weights as obtained from RAVE simulations. It is interesting to note that the RC for the cognate ligand has many more components than the RC for synthetic ligand, highlighting differences between the molecules. Most of these contacts in the RC are present also in the crystal structure (PDB 6E1W, 6E1U), except carbon 5?U2, carbon 5?A23 for the cognate ligand and carbon 1?A23 for the synthetic ligand (ligand atom numbering corresponds to numbering in PDB files). I then used this 72 Figure 4.4: Free energy profiles of riboswitch with different ligands. Coordination number 0 (see Eq. 4.1) indicates ligand fully dissociated from riboswitch. The error bars are shown by the highlighted colors. The approximate range of coordination number corresponding to the unbound pose is highlighted in light green color (C ? 5.6). optimized RC and ran 10 independent biased MD simulations each using well-tempered metadynamics[47] to sample the free energy profiles of the systems. The probability distribution was estimated by using the reweighting technique introduced in Ref.[136]. In Fig. 4.4 I show the free energy profiles as functions of the ligand-riboswitch coordination number for both systems. Note that for both systems there were 2 predominant pathways as indicated through arrows in Fig. 4.5. A key difference between the two paths is that in the first path, the ligand unbinds through the space in between stem 2 and the backbone connecting stem 1 and loop 1 (Fig. 4.5, blue arrows). In the second path, the ligand unbinds through the space in between loop 2 and stem 2 (Fig. 4.5, red arrows). In order to compare the binding free energies of the two systems, I combine the information 73 from both the pathways by plotting the free energy as a function of the ligand-riboswitch shared coordination number C. During the dissociation process, the coordination number gradually decreases to 0 irrespective of which path is adopted during dissociation. We can define the coordination number C as: N M 1 ?? (1C = ) N 4 (4.1) rij i=1 j=1 1 + r0 where N and M are the number of ligand and riboswitch atoms respectively. rij is the distance between ligand atom i and riboswitch atom j and r0 is set to be 3 A?. It can be seen from Fig. 4.4 that when bound to the riboswitch, the cognate ligand has higher C relative to when the synthetic ligand is bound to it, illustrating that the synthetic ligand makes overall fewer contacts per ligand atom with the riboswitch, as also observed in crystal structures (PDB 6E1W and 6E1U). Though I was not able to simulate the rebinding of ligands to get a converged free energy profile, the probability distributions used to construct these free energies are obtained by averaging over multiple independent dissociation events. This leads to the relatively small error bars shown in Fig. 4.4, which indicate that we can use the free energy profile to give a rough estimation of the strength of binding affinity. By considering C ? 5.6 as the unbound states, we can calculate that the binding affinity difference between bound and unbound state is 2?4 kcal/mol higher for the cognate ligand than that for the synthetic ligand. This is in quantitative agreement with relative affinity measurements (3.0 ? 1.4 kcal/mol) through fluorescence titrations as reported in Ref.[131]. 74 Figure 4.5: Dissociation pathways for (a)-(b) cognate ligand and (c)-(d) synthetic ligand. Stem 1 (S1), stem 2 (S2), loop 1 (L1), loop2 (L2), and loop (L3) are colored green, cyan, magenta, gray, and orange respectively. During the dissociation processes, the ligands at different time frames are colored from white to dark blue and white to red, for the top pathway and bottom pathway respectively. The dissociation pathways are indicated with arrows. The predicted mutation sites U22 and A32 are highlighted in red and blue respectively. 4.4 Predicting critical nucleotides for mutagenesis experiments Mutations in noncoding RNA can have disastrous consequences on the human body, leading to diseases such as cancer, dementia, and Alzheimer?s[137]. An explosion of 75 Figure 4.6: Critical nucleotides prediction. Change in RMSD for nucleotides U22 and A32 during the dissociation process, relative to RMSD variation in the apo state. RMSD profiles as function of coordination number are shown by blue dashed line and orange solid line for systems with cognate ligand and synthetic ligand separately. The change in RMSDs is calculated by subtracting the mean RMSDs of the corresponding nucleotide in apo system from the RMSDs measured in the biased simulations. research exploring the functionality, sequencing, and regulation of noncoding RNA has been published in recent years[138, 139]; however, uncovering the specific nucleotides 76 involved in target-ligand binding remains a mystery. In addition to its fundamental biochemical relevance, the discovery of nucleotide mutations can also shift the focus of drug design from small molecule therapeutics to highly specific, target-based therapies with lower side effect profiles and higher efficacy.[140] We are interested in not just nucleotides we could have predicted from looking at the static structures of ligand-bound complexes, but also in nucleotides distal from the binding site. Those nucleotides might play an essential role in determining the structural ensemble of a riboswitch. Thus, changing the properties of those nucleotides can alter the interaction between ligand and other nucleotides. Such critical nucleotides are hard to predict from a purely structural perspective, but our all-atom resolution dissociation trajectories are ideally poised to make such predictions. I analyze 10 independent dissociation trajectories each obtained for the synthetic ligand bound and cognate ligand bound systems to predict critical nucleotides in the riboswitch that play a role in the dissociation process. Predictions made in this section are then validated through mutation experiments reported in Sec. 4.5. From the collection of dissociation trajectories, we can observe that dissociation mainly happens through two pathways (Fig. 4.5) and that for the cognate ligand, 7 out of 10 simulations went through the top/blue pathway, while for the synthetic ligand, 6 out of 10 simulations went through the bottom/red pathway. To more rigorously quantify the role played by different nucleotides, I monitor the change in the RMSD of every nucleotide during the dissociation process, relative to how the respective nucleotide is in the bound complex. My motivation here is that the nucleotides showing greatest relative movement during the dissociation process are the ones most likely to be impacted once mutated. As shown in Fig.4.7, every nucleotide?s RMSD changes with the coordination number C 77 which captures the extent of dissociation. These changes indicate different conformations the riboswitch adopts during ligand dissociation. Through this procedure I identify U22 and A32, highlighted in Fig. 4.5, to be two such critical nucleotides for the cognate and synthetic ligands respectively. Fig. 4.6 shows the changes in the RMSDs of these two selected nucleotides during the dissociation process. For U22, in RAVE simulations the RMSD increases much more for the cognate ligand-bound system, than for the synthetic ligand-bound system. However, the reverse is found to be true for the nucleotide A32, which displays more enhanced movement during the dissociation of the synthetic ligand bound complex relative to the cognate ligand bound complex. On the basis of the above observations, we can predict that for the cognate ligand, mutating U22 should display more tangible effects on the strength of the complex than mutating A32. For the synthetic ligand, we predict an opposite trend ? mutating A32 should have a more pronounced effect on the interaction relative to mutating U22. 4.5 Validating predicted critical nucleotides through mutagenesis experiments In Sec. 4.4 I predicted on the behalf of the dissociation trajectories that mutating nucleotides U22 and A32 will have differing effects on the cognate ligand-bound and synthetic ligand-bound systems. Also, we would like to note that, as we can see from Fig. 4.6, the RSMD changes appear in the range of coordination number that corresponds to the bound state. These changes imply the role of these nucleotides in determining the bound state interactions and suggest that the effects of mutations can be reflected by 78 1.0 1.0 1.0 1.0 1.0 1.0 C1 U2 G3 G4 G5 U6 0.5 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 0 10 0 10 0 10 0 10 0 10 0 10 1.0 1.0 1.0 1.0 1.0 1.0 C7 G8 C9 A10 G11 U12 0.5 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 0 10 0 10 0 10 0 10 0 10 0 10 1.0 1.0 1.0 1.0 1.0 1.0 N13 N14 C15 C16 C17 C18 0.5 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 0 10 0 10 0 10 0 10 0 10 0 10 1.0 1.0 1.0 1.0 1.0 1.0 A19 G20 U21 U22 A23 A24 0.5 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 0 10 0 10 0 10 0 10 0 10 0 10 1.0 1.0 1.0 1.0 1.0 1.0 C25 A26 A27 A28 A29 C30 0.5 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.0 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 ?0.5 0 10 0 10 0 10 0 10 0 10 0 10 1.0 1.0 1.0 A31 A32 G33 cognate ligand 0.5 0.5 0.5 synthetic ligand 0.0 0.0 0.0 ?0.5 ?0.5 ?0.5 0 10 0 10 0 10 Coordination Nu ber Figure 4.7: The change of RMSD of each nucleotide during the dissociation process. RMSD as a functions of coordination number are shown by blue dashed line and orange solid line for systems with cognate ligand and synthetic ligand separately. changes in binding affinity. Here I report in vitro tests of our predictions. As shown in Fig.4.8, our experimental collaborators measure the equilibrium dissociation constant KD for six different complexes, namely the two wild-type complexes, U22A-cognate ligand, U22A-synthetic ligand, A32U-cognate ligand, and A32U-synthetic ligand. To experimentally test whether the A32U and U22A mutations impacted KD values, we used microscale thermophoresis (MST). KD values were measured using fluorescently labeled wild type, A32U, and U22A PreQ1 riboswitch aptamer constructs 79 ? RMSD (?) Figure 4.8: Affinity measurement of PreQ1 (cognate) and dibenzofuran (synthetic) ligands binding to WT Tte riboswitch and U22A and A32U mutants. MST of 5?-cy5- labelled (a) WT Tte, (b) U22A, and (c) A32U mutants in the presence of PreQ1 ligand. FIA of 5?-cy5-labelled (d) WT Tte, (e) U22A, and (f) A32U mutants in the presence of dibenzofuran. Error bars represent the standard deviation of three replicate experiments. for the cognate ligand. For PreQ1, we observed a KD of 11.7 ? 0.35 nM for the WT construct. In contrast, the A32U mutant had a KD of 22.5 ? 0.42 nM while the U22A mutant had a KD of 49.4 ? 1.5 nM. For the synthetic ligand, changes were observed in 80 fluorescence of the labeled RNA upon incubation with the ligand, and thus a fluorescence intensity assay (FIA) was used to measure KD values rather than MST. Using FIA we observed a KD of 1.6 ? 0.7 ?M for the wild type aptamer. In contrast, both the U22A and A32U mutant constructs displayed significantly weaker KD values (150 ? 1.3 ?M, and ? 100 ?M respectively). These KD measurements directly validate the findings from Sec. 4.4. There we had predicted that mutating U22 will effect the cognate ligand-bound system more than mutating A32, and indeed we find that KD for U22A is 5 times weaker than WT, while only 2 times weaker for A32U. At the same time, we had predicted that mutating A32 will effect the synthetic ligand-bound system more than mutating U22, and indeed we find that while the KD for U22A is 75 times weaker than WT, there is effectively no tangible association at all for A32U with the synthetic ligand. The in silico experiments also support that the mutations alter the bound state ensemble and give more insights into the interaction change in the bound state after the mutations. I performed MD simulations with the mutated systems to validate that the mutations changes the bound state ensemble of both systems. I first used PyMOL[141] to generate the mutants U22A and A32U for riboswitch with cognate or synthetic ligands. I then performed four independent unbiased simulations for each system, with each 500ns long. The structures were saved every 10 ps. In the 500 ns simulations, the ligands stay in the bound pose. The statics of samples from these simulations can be used to illustrate the influence of mutations on the bound state ensemble. Fig.4.9 shows the free energy profiles of these systems and clearly indicates the changes in the bound state ensemble brought by the mutations U22A and A32U. I also study the hydrogen bond occupancy between ligand and nucleotides from riboswitch in 2?s long unbiased MD simulation. This captures the 81 bound state ensemble better than just the crystal structure and shows how the mutations, even though distal, disrupt the hydrogen binding patterns. The number of hydrogen bonds is calculated by VMD[142] with an angle cutoff of?30 degrees and a distance cutoff 3.5A?. The hydrogen bond occupancy is defined as ?? = ni=1Ni/n where n is the number of total frame and Ni is the number of hydrogen bonds in frame i. Fig.4.10 shows that different hydrogen bonds between ligand-nucleotide pairs are formed in the bound state ensemble before and after the mutations. For example, the hydrogen bonds between cognate ligand and C16 disappeared after the mutation and the hydrogen bonds between cognate ligand and A28 mainly showed up in the A32U mutant. Also, for synthetic ligand, the hydrogen occupancy with U6 largely increases after the mutations while that with G4 decreases. So even though nucleotides 22 and 32 do not directly form hydrogen bonds with both ligands in the bound pose as seen in the crystal structure, interactions between ligands and other nucleotides are still altered when they are mutated Figure 4.9: Free energy profile from 2?s unbiased simulation for wild tpye and mutated systems with (a) cognate ligand and (b) synthetic ligand. The error bars are indicated by the shaded region around each curve. 82 Figure 4.10: Hydrogen bond occupancy for wild type and mutated systems calculated through 2?s long unbiased MD simulation with (a) cognate ligand and (b) synthetic ligand. The labels on the x-axis define the hydrogen bond pairs with the format ?donor- acceptor?. 4.6 Discussion In this chapter, I showed how one can obtain detailed and robust insights into RNA structural dynamics through AI-augmented molecular simulations that maintain all-atom resolution for RNA, ions and all water molecules. All my predictions are validated through a gamut of complementary and experimental techniques that measure RNA flexibility and ligand binding. I specifically focused on the Tt-PreQ1 riboswitch aptamer system in its apo state and bound to two different ligands, where I first showed that 2 ?s long unbiased MD simulations with classical all-atom force-fields for RNA and water 83 can provide the same flexibility profile as measured in SHAPE experiments. Since the timescales for ligand dissociation are far slower than MD can typically access, I used the RAVE simulation method[77] that enhances sampling along a low-dimensional reaction coordinate learned on-the-fly through the past-future information bottleneck framework. Through RAVE we obtain multiple independent dissociation trajectories for both cognate and synthetic liganded systems, demonstrating how the cognate and synthetic ligands invoke different aspects of the riboswitch?s flexibility in order to dissociate. On this basis, I were able to predict pairs of mutations which would be expected to show contrasting behaviors for the cognate and synthetic ligands. The ability to make such predictions accurately and efficiently is of vital importance, as mutations that can disrupt RNA structural dynamics and ligand recognition processes have been linked to numerous human diseases[143, 144]. My predictions for the Tt-PreQ1 aptamer were validated by designing mutant aptamers and performing affinity measurements. This work thus demonstrates a pathway to gain predictive insights, as opposed to retrospective validation, from enhanced molecular dynamics RAVE simulations that maintain femtosecond and all-atom resolution while directly observing processes as slow as ligand dissociation. While the focus in this work was on predicting the thermodynamically averaged aspects of riboswitch flexibility, in future work it should also be possible to directly calculate rate constants using complementary approaches and for more complicated RNA systems. More broadly, this work indicates that long time scale MD simulations can accurately model ligand dissociation from RNA, and provide validatable hypotheses about structural features that impact small molecule recognition of RNA. Thus, this strategy will be widely applicable not only to study natural systems 84 such as riboswitches, but to more completely understand the structural features and dynamics that govern how synthetic small molecules interact with RNA. 85 Chapter 5: Mixing physics across temperatures with generative AI 5.1 Introduction Using simulations or experiments performed at some set of temperatures to learn about the physics or chemistry at some other arbitrary temperature is a problem of immense practical and theoretical relevance. Here I develop a framework based on statistical mechanics and generative AI that allows solving this problem. Specifically, this framework uses generative AI that learns to efficiently sample such high-dimensional, complex probability distributions for N?body molecular systems valid across temperatures. There are two central ideas guiding this framework. Firstly, I do not treat the temperature T as just a control parameter. By noting that the temperature is a measure of the average kinetic energy, I instead work directly with the fluctuating kinetic energy, using equipartition theorem to define an instantaneous, effective temperature which we still call T . For finite N not yet in the thermodynamic ? limit, T so defined will display significant fluctuations proportional to 1/ N . Secondly, if we have experiments or simulations performed at temperatures T1, T2, ...TK , we can view them all together as being sampled from the same, but unknown, joint probability p(x, T ) for N?body configurations x. I then use a generative AI method, specifically denoising diffusion probabilistic models (DDPM)[145, 146] to generate many more 86 samples from p(x, T ), given the spare, high-dimensional dataset. By learning p(x, T ) I am referring to the ability to generate many more samples from p(x, T ) given the data set we have from simulations. DDPM has been shown to possess the ability to infer and learn the underlying relationship from complicated data i.e. high-dimensional and with complex correlations, but also noisy and sparse[145, 146]. DDPMs learn two diffusion processes expressed through stochastic deep neural networks, which are called noising and denoising. The forward diffusion or noising converts the samples from high-dimensional, structured and unknown probability distribution into simpler and analytically tractable white noise. The backward diffusion or denoising learns the mapping back from noise to meaningful data. The central idea is that it is easy to generate numerous samples from the noisy distribution, which can then be denoised back to structured data. The process has been demonstrated to be comparable or even outperform other generative AI models for generating high quality samples, and in its ability to model in an unsupervised way the underlying semantics behind meaningful data[145, 146]. Here I focus on REMD[38, 147] introduced in Sec.1.3.2. REMD uses information generated at a ladder of temperatures to swap configurations across temperatures. REMD has been extremely powerful over the decades for the study of molecular systems with rough energy landscapes for fundamental science and practical applications[148, 149]. Numerous advances have been introduced over this basic idea in order to make it more efficient computationally[39, 40, 150, 151, 152, 153, 154, 155] and it continues to be an area of very active research. I will demonstrate how DDPM applied to all-atom, femtosecond resolution REMD 87 simulations of a small peptide and a RNA nucleotide very significantly improve the quality of data generated across temperatures, including temperatures at which the simulation was never performed and even at temperatures outside the ladder of temperatures, significantly improves the estimates of free energies made through REMD and providing accurate sampling in parts of configuration that were not previously visited in the lowest temperature replica during REMD. These could be metastable states or transition states. I also show how this can be used to generate samples at temperatures not included in the ladder of replicas. The samples the model generate have thermodynamic relevance and correspond to correct Boltzmann weights as opposed to being just wild hallucinations[146, 156]. 5.2 Temperature as a random variable Here let?s consider the equivalent p(x, ?) where inverse temperature ? = 1 kBT and kB is Boltzmann?s constant. MD or Monte Carlo (MC) methods allow sampling configurations x as per the equilibrium probability p (x) ? e??U(x)? ? /Z, where U(x) is the potential energy of the N?body system and Z = dxe??U(x) is the partition function. For systems of practical interest in biology, chemistry and materials science, if T is not large enough, it becomes nearly impossible to sample reliably from p?(x) as many regions of interest in configuration space will have e??U(x) ? 0. To deal with this problem, in REMD one simulates K+1 replicas of the system at temperatures ? = ?0 > ?1 > ... > ?K . For low enough ?K or equivalently high enough temperature TK , the sampling from p? (x) ? e??KU(x)/ZK is expected to be more ergodic. OneK 88 then periodically exchanges conformations between consecutive pairs of replica with a Metropolis-type acceptance probability that depends on the potential energies of the two replicas and their temperatures. This way even the low-temperature ?0 replica can explore configurations that it would not have otherwise visited. I would argue that by treating the temperature just as a control parameter REMD is not making full use of the information gathered across the ladder of temperatures. In most current incarnations of REMD (excluding exceptions such as Ref.[157]), all that the higher temperatures do is to help the lower temperature replicas discontinuously appear in different locations of the configuration space. Here I take a slightly different view of the temperature in REMD. I would like to point out that this prospective can also be easily generalizable to other thermodynamic variables which show fluctuations for finite system size. By working with an approximate, effective temperature, we can treat it as a random variable instead of a control parameter. To do so we can work with an effective, instantaneous temperature of the system associated with its kinetic energy instead of the temperature of the heat bath that we expect the thermostat to enforce on average. More rigorously thus, we are sampling the joint p(x, ?) where the per-particle kinetic energy ? is related through its ensemble average to the temperature, i.e. ??? = 3 . I would like 2? to emphasize that the effective temperature is not the true thermodynamic temperature - however for the sake of simplicity I still use the symbol T or ? to denote this effective temperature or its inverse respectively. The motivation in doing so is that all replicas across different temperatures can be viewed as being sampled from the same joint probability p(x, ?) ? as opposed to different replicas sampling from respective p? (x). This change of perspective allows usK 89 to combine together the data collected from different temperatures as having arisen from the same, although intractble, probability distribution p(x, ?). 5.3 Denoising diffusion probabilistic model Our task at hand now is to learn the joint probability p(x, ?) given sampling that has already been performed in REMD across x?space and temperatures ? = ?0 > ?1 > ... > ?K . There are two main challenges in this. The first challenge comes from the curse of dimensionality: the memory or computational resources needed to track a very high-dimensional distribution function increase exponentially as the number of degrees of freedom increases. For instance, for REMD of a small 9-residue peptide in explicit water, which we study here, we exchange all 4749 atomic coordinates between the replicas, but for the purpose of analysis we set x as 18 Ramachandran dihedral angles. This means we already have a 19-dimensional space where binning procedures are out of the question. The second challenge comes from the sparsity of the data. Most of our samples come from high probability regions p(x, ?) and we have very few samples for low-probability states due to inefficient exchange betweem replica. In summary we have sparse sampling of data points in very high-dimensional (x, ?)?space and wish to construct p(x, ?) from this information so that one can create many more samples at any temperature of interest. DDPM can generate many more samples from p(x, ?). The main idea behind the use of DDPM here is to learn a simple and easy-to-sample-from distribution Psimple(x, ?) that approximates the true p(x, ?). For notation simplicity, we denote s = {x, ?} and refer to Psimple(s) henceforth. DDPM does this by learning to reverse a gradual, multi- 90 Figure 5.1: Lower panel shows the neural network architecture used in this work. It has the basic structure of the U-Net model[158]. During the diffusion process, which is indicated by the direction of blue arrows in the upper panel, noise is gradually added to the sampled data, in this case the picture of a very good boy, through a diffusion process labeled with diffusion step ti. This changes the sampled distribution (for example, more pictures of dogs) to a simpler isotropic Gaussian distribution from which one can easily generate more samples. An AI model is then trained to reverse such a diffusion process and starting from sampled noise, learn to generate images similar to the input image by following the direction indicated by the orange arrows. It takes an 1-d array s?, which is the noisy sample at diffusion step ti, as input and outputs the parameterization of a Gaussian distribution to get the reversed transition kernel q(s, ti?1 | s, ti). Each residue block consists of three components: two 1-d convolution operations with kernel size 3 and a group normalization[159] between them. The diffusion step ti is added to each convolutional block after being transformed by the sinusoidal position embedding[160]. The final conv label denotes the convolution operation with kernel size equal to 1. Max- pooling reduces size of the features by half, while up-sampling uses the transposed convolution to expand the size of features. step noising process that starts with the relatively limited number of samples generated from the distribution p(s) and diffuses to the simpler distribution Psimple(s) that is easy- to-sample. For instance, Psimple(s) could be an isotropic Gaussian. In addition to learning this noising process, DDPMs also learn the reverse denoising process which allows us to go back from samples generated using Psimple(s) to samples that would have been generated from the underlying P (s). Both the noising and denoising processes are modeled using diffusion processes that convert probability distributions to one another, 91 and are implemented using the architecture shown in Fig. 5.1, which is based on the standard architecture for DDPMs as described in Ref.[146]. Figure 5.2: (a) AIB9 in left-handed (L) and right-handed (R) conformations in explicit TIP3P water. (b) Free energy profile of residues (with residue 5 as an example) with two stable states labeled as L, R and two excited state labeled as La, Ra. When all residues are in L states, the peptide chain is in left-handed state and similar for right-handed state. Free energies in (b) here as well as all through the manuscript and Supporting Information are provided in units of kBT . The noising diffusion process carried out in the space s = {x, ?} that converts p(x, ?) to the simpler Psimple(x, ?) can be decomposed into M discrete steps denoted by corresponding transition probabilities p(s?, ti+1|s, ti) where i ? [0,M ], P (s, t0) ? p(x, ?), and P (s, tM) ? Psimple(x, ?). In DDPM, this noising diffusion process that converts sampled data to essentially noise, is set to be an Ornstein?Uhlenbeck (OU) process in which the transition probability follows an simple Gaussian form. One can then easily generate samples from P (s, t0) ? p(x, ?). The tricky bit now is to convert these samples back to the original distribution. In Ref.[145], it was shown that this transition reversed diffusion kernel P (s, t |s?i , ti+1) can also be written in a Gaussian form. A deep neural network (Fig.5.1) is then trained using variational inference to learn the approximate reversed transition kernel Q(s, t |s?i , ti+1) ? P (s, ti|s?, ti+1). Thus by generating samples from a normal Gaussian distribution, which we can easily do in 92 large numbers, and then passing these through the reversed transition kernel, we can generate samples that follow the target distribution p(x, ?) as desired. Note that instead of learning the joint probabilities P (s1, s2), it can be advantageous to learn the conditional probability P (s1|s2). This can be done through the protocol in Ref.[145] by adding a delta function to allow only a subset of s to change during the nosing and denoising process, i.e. ?(s2 ? s?2)P (s1, s , t |s? , s?2 i 1 2, ti+1). This is very useful when for instance we are interested in generating samples only at a certain temperature or only in certain regions of the configuration space. In the most general form of diffusion probabilistic models (DPM)[145], the reversed transition kernel Q(s, ti|s?, ti+1) is considered as Q(s, t ? 1 | s?, t) = N (s; ??(s?, t), ??t(s?, t)) and the neural network is trained to learn the mean ??(s?, t) and variance ??(s?, t). In practice, however, there are many different ways to choose the Gaussian distribution parameterization. In Ref.[146], DDPM was introduced with a new parameterization approach to reduce the complexity of the training task. DDPM got its name because such a design makes the learning task resemble a denoising score matching procedure[161]. In Ref.[146], it was shown that with such a design, DDPM can generate samples of a quality that are comparable or even better than other generative models. 5.4 Results I going to demonstrate how the above protocol can be applied to mix data collected from different temperatures and configurations in REMD and significantly improve the quality of sampling. Specifically, I consider the following two challenging tasks: 1. Can we improve the sampling quality for the lowest-temperature replica with more 93 accurate probability estimates than directly seen after REMD? This includes being able to generate samples in low probability regions such as transition and metastable states, and reliably estimating their free energies. 2. Can we generate samples at temperatures that are not even included in the replica ladder? This would include temperatures within the range of the replica ladder and also extrapolation to temperatures outside the range. 5.4.0.1 Peptide conformational transitions To demonstrate the performance of DDPM, we first study a small peptide chain Aib9 in explicit TIP3P water[162] using CHARMM36m force-field[163]. This 9 residue system (Fig.5.2) (a) displays rich and complex conformational dynamics[164] including the transitions between fully left-handed and right-handed helices. However, even in 4 ?s unbiased MD at 400K, one can see only around 2?3 transitions between these two dominant equiprobable conformations, and even fewer, if any, transitions to the higher energy metastable states. To improve the sampling of this system, I perform REMD with 10 replicas at geometrically spaced temperatures ranging from 400K to 518K, with temperature increased by 3% for each replica. The attempt of exchanging configurations was made every 20 ps, with acceptance rate around 1% ? 2% between neighboring replicas, which is intentionally kept lower than what one usually has in REMD. This is because I want to show that even in the extreme cases where the number of atoms N is so large that replicas do not have enough overlap, or if one wants to reduce the number of replicas to save computational resources, our DDPM can still do a decent 94 job of complementing REMD and reconstructing the true probability distribution at any temperature of interest. To benchmark our results, I ran unbiased MD at 400 K for 4 ?s and at 500 K for 0.6 ?s. Figure 5.3: Samples (dots) from REMD (left) and DDPM (right) at 400K. The Boltzmann weights for different samples are indicated through their free energy (contour lines, separated every 0.74 kbT ). (a) Samples generated from DDPM and free energy profile projected on dihedral angles of residue 5. (b) Samples generated from DDPM and free energy profile projected on dihedral angles of residue 8. DDPM was able to generate samples in states that are not present in the training dataset for both residues, indicated with thick black arrows in the right panels. To quantify the quality of sampling, we can focus on the 18 dihedral angles corresponding to all 9 residues (?1,?1,?2,?2, ? ? ??9,?9). As shown in the 95 Ramachandran plot in Fig. 5.2 (b), the free energy surface along any pair of dihedrals is mainly characterized by four metastable states: two equiprobable low-energy ground states (L, R) and two excited states (La, Ra). The Ramachandran plots for all nine residues in the system look qualitatively similar but the middle residues of the peptide are known to be less flexible with higher energy barriers.[165, 166] I thus focus on the sampling for residues 5 and 8 as shown in Fig.5.3. I train the DDPM on a REMD trajectory. At the lowest temperature of interest (400K), this trajectory has not yet achieved sufficient sampling. As shown in Fig.5.3, DDPM successfully mixes information from all temperatures and configurations, and generates samples in states that are not present in the training dataset for both residues, indicated with thick black arrows in the right panels of Fig.5.3. Specifically, for residue 5 we can see that the transition states between state R and state La, which were not being sampled in the 400K replica, are populated in samples from DDPM. For residue 8 the improvement is even more striking as the state Ra which was simply not sampled in REMD now gets populated after DDPM. To further quantify the improvement gained due to DDPM, I compare the free energy differences between different configurations from both REMD and DDPM against much longer reference unbiased MD at 400K. As shown in Fig.5.5, the free energy difference calculated from samples generated by DDPM are much closer to the that from the reference MD. Thus, DDPMs are able to accomplish the first task highlighted above. I also want to highlight that while some spurious samples are generated, seen through dots outside the free energy contours in Fig. 5.3, their Boltzmann weights are very low. Thus, our model ?dreams? new configurations without hallucinating spurious configurations. I am now moving to the second task from the list above, and test DDPM?s 96 9.6 10 T=400K?Reference?MD2 8.0 T=400K?DDPM?? 6.4 8 T=400K?REMD 0 4.8 6 3.2 4 ?2 1.6 2 ?2.5 0.0 2.5 0.0 res1 res2 res3 res4 res5 res6 res7 res8 res9 ? Figure 5.4: Comparing ?G calculated from samples of REMD, DDPM and long unbiased MD. See the text for precise definitions of various states. The free energy difference between the ground state L (orange box) and excited state Ra (green box). Green empty crosses are reference values from 4 ?s long unbiased MD at 400 K. Green filled plus signs show values after REMD and DDPM, while orange circles show values after REMD without DDPM. 9.6 T=400K?REMD 2 12 T=400K?Reference?MD8.0 T=400K?DDPM??T=500K?Reference?MD 6.4 10 T=500K?DDPM?? 0 4.8 8 3.2 ?2 1.6 6 ?2.5 0.0 2.5 0.0 res1 res2 res3 res4 res5 res6 res7 res8 res9 ? Figure 5.5: Comparing ?G calculated from samples of REMD, DDPM and long unbiased MD. See main text for precise definitions of various states. The free energy difference between the ground state L (orange box) and the transition state (purple box) at 2 different temperatures, 400K and 500K . ability to generate samples by interpolating or even extrapolating across temperatures not considered in the ladder of replicas. In the first example, I use it to generate samples at 500 K, as in the training set with 10 replicas at geometrically spaced temperatures between 400 and 518K, there is no replica with temperature 500 K. As shown in Fig.5.5, the ?G calculated from samples of DDPM is in good agreement with that from the reference MD at 500 K. In the second example, I completely removed the samples from 400 K replica in the training set and use it train a new model, which is then used to generate samples at 400 K. Even though in the training set, the lowest temperature is 412K, the model can 97 ? ? ?GL? TS??(kbT) ?GL?Ra??(kbT) make good prediction of free energy difference between states as shown in Fig. 5.6. This finding is particularly encouraging as in general, sampling at lower temperatures tends to be harder than at higher temperatures. 10 T=400K ?GR? TS?Reference?MD 8 ?GR? TS?DDPM?? 6 ?GR? La?Reference?MD 4 ?GR? La?DDPM?? 2 0 res1 res2 res3 res4 res5 res6 res7 res8 res9 Figure 5.6: Free energy differences between different metastbale states ?G calculated from samples from DDPM by extrapolating to a temperature at 400 K, which is lower than any temperature included in the dataset. The DDPM model was trained with REMD trajectories with 9 replicas starting at the lowest temperature 412K. Green empty crosses and purple empty triangles are reference free energy differences ?G from 4 ?s long unbiased MD at 400 K. Green filled plus signs and purple diamonds show the energy differences from DDPM. ?GL-Ra is the free energy difference between ground state L (?2.2 < ? < ?0.1, ?1.6 < ? < 0.9) and excited state Ra (?1.6 < ? < ?0.2, 1.4 < ? < 3.14); and ?GL-TS is the free energy difference between the ground state L (?2.2 < ? < ?0.1, ?1.6 < ? < 0.9) and a transition state (?1.8 < ? < ?0.8, 0.8 < ? < 1.5) RNA conformational transitions. As a second example to illustrate the general applicability of our DDPM+REMD approach, I test this framework on the sampling of RNA conformational ensemble. Rare RNA structures have been previously shown to be biologically relevant [167, 168, 169], but estimating the conformational ensemble still remains computationally intractable using traditional sampling techniques [170]. RNA dynamics may occupy a wide range of timescales - from several hours for conformational changes that require breaking base pairs, to picoseconds for more continuous deformations [169]. As a consequence, identifying rare transient structures and estimating their contribution to the RNA ensemble 98 ?G??(kbT) Figure 5.7: Projection of samples and free energy profile on dihedral angles ? and ? of A2 in GACC at 325 K. The Boltzmann weights for different samples are indicated through their free energy (contour lines, separated every 0.74 kbT ). (a) The structure of GACC. (b) Samples from REMD. (c) Samples from the benchmark 5?s long unbiased MD. (d) Samples from DDPM. The metastable states that are not present in the training dataset are indicated with thick black arrows. has proved to be difficult. In this example, I consider a GACC tetranucleotide as our model system (Fig. 5.7(a)). As a single-stranded RNA consisting of four nucleotides labeled G1, A2, C3 and C4, GACC has been previously used as a challenging test system for REMD based sampling methods for its conformational flexibility [171] Despite the fact that GACC has been widely studied, it still remains challenging to effectively sample possible configurations and is a good system to test new methods[170]. For example, in 99 a previous study, in order to get the converged structural populations, a multidimensional replica exchange molecular dynamics (M-REMD) simulation was performed with 192 replicas with around 1 ?s of simulation time per replica, thus totalling almost 192 microseconds of all-atom simulations.[171] Here we show that with DDPM, we can better estimate the free energy landscape using fewer computational resources, totalling only 12 microseconds of all-atom simulations adding all replicas. I trained our DDPM Figure 5.8: DDPM outperforms direct reweighting (Eq. 5.1). Free energy surface (FES) in kBT along dihedral angles ? of (a) A2 and (b) C3 from GACC calculated from REMD trajectories after direct reweighting (orange dash line) and DDPM samples (blue solid line) at 277 K. Similarly, (c) and (d) show respective plots at 297K for A2 and C3 respectively. (c) and (d) also shows benchmark results from 5 ?s unbiased MD (red dots) at 297 K. Unbiased MD at 277K was far too slow to provide any information. model on 250 ns REMD trajectories from 48 replicas with temperatures ranging from 277 K to 408 K (see Appendix A.1.4 for MD simulations setup for details). In each frame of the trajectory, the structure of GACC is characterized by 6 dihedral angles for 100 dihedral definition ? P (H5T) - O5? - C5? - C4? ? C4? - C3? - O3? - P (H3T) ? O4? - C1? - C2? - C3? ? C1? - C2? - O2? - HO?2 ? C5? - C4? - C3? - O3? ? O5? - C5? - O4? - C3? Table 5.1: Definition of dihedral angles each nucleotide (defined in Table5.1). In these REMD simulations, the sampling is not sufficient, especially for replicas at very low temperatures. Fig.5.9 and Fig.5.10 shows the free energy profiles of GACC projected on the ? and ? angles of A2 and C3 at different temperatures. We can see that expected high energy metastable state indicated by black arrows in Fig.5.9 and Fig.5.10 were not sampled in REMD. In contrast, as shown in Fig.5.7 and Fig.5.11 for different target temperatures, DDPM successfully generates the ensemble of such high free energy states that were never even visited at low temperatures. Here as well, any spurious configurations can be seen through dots outside the free energy contours in Fig. 5.7 with negligible Boltzmann weights. It is also natural to ask if information at different temperatures could be combined through a Boltzmann-type reweighting where samples from any temperature ? and potential energy U(x) can be treated using weight to esitimate the free energy at temperature ?0,: w = e(???0)U(x)?0 (5.1) The reference point of potential energy U(x) is determined such that the minimum of potential energy in each replica is equal to 0. This way we can estimate the free energy by simply mixing samples from different temperatures without the use of any generative 101 Figure 5.9: Projection of REMD samples and free energy profile on dihedral angles ? and ? of the A2 in GACC at each replica. AI. Fig 5.8 shows the results for such an operation and compares them with results from DDPM and long unbiased MD when possible. For ?A2 at 297 k, the direct reweighting method gives a good estimation of the free energy of metastable state around ?A2 = 1 but 102 Figure 5.10: Projection of REMD samples and free energy profile on dihedral angles ? and ? the C3 in GACC at each replica. almost entirely misses out on any sampling of the barriers, thereby heavily overestimating their heights. For other dihedral angles, even the free energy of the metastable states is overestimated from the direct approach. In conclusion, such a reweighting operation 103 Figure 5.11: Free energy profiles of all dihedral angles in GACC calculated from 4?s unbiased MD (in blue dash line), 250ns REMD (in green solid line) and samples generated from DDPM (in orange solid line) very significantly overestimates free energy difference between metastable states, thus unequivocally demonstrating the usefulness of the DDPM approach. At the temperature where long unbiased MD is possible, the use of direct reweighting (Eq. 5.1) gives an energy scale a least 10 kBT higher than DDPM or unbiased MD, while there is near perfect agreement between the latter two. 5.5 Discussion I have presented a generative AI-based approach that combines physics from simulations performed at different temperatures to generate reliable new molecular configurations and accurate thermodynamic estimates at any arbitrary temperature even if no actual simulation was performed there. The central idea is to not work with the 104 thermodynamic temperature ? of the system as a parameter set by the heat bath, but instead work with an effective temperature, calculated from the instantaneous kinetic energy of the system. This effective temperature shows non-trivial fluctuations for a finite size system, and on an average equals the thermodynamic temperature. Given sparse sampling from the high-dimensional space comprising configurational space coordinates and the effective temperature, we train a generative AI model that generates countless more samples of configurations at any temperature of interest. While the idea is generally applicable, here I demonstrate its usefulness in the context of the widely used replica exchange molecular dynamics (REMD) framework to improve the sampling of REMD through a post-processing framework. I show how this significantly improves the quality of sampling at low temperatures and even generate samples in states where have not even been visited in the replicas, and at temperatures not considered in the ladder of replicas. I also show that this approach is far superior to direct reweighting scheme as we do not compute weights involving exponentials of the full system?s kinetic or potential energy, thereby avoiding well-known issues with the convergence of exponentially averaged quantities [172]. The generative AI framework of DDPM used here belongs to the broad class of flow-type methods, which have been shown to have the ability to generate samples from high dimensional space with many interdependent degrees of freedom. Compared with other flow type models such as normalizing flow[65, 173] that use deterministic functions to map from an easy-to-sample distribution to target distribution, the stochastic nature of DDPM avoids the restriction of preserving the topology of configuration space and thus allows the learning of significantly more complicated distributions. At the same time the 105 design of the transition kernels in DDPM reduces the learning task to just learning means of Gaussian kernels. This makes the training easier compared to other methods[68] while at the same time being able to learn more complicated transition kernels. I also believe that the generative AI model, while ?dreaming? thermodynamically relevant structures at different temperatures, avoids the so-called hallucinations suffered by other generative AI models, i.e. the model do not generate meaningless, unphysical structures with significant thermodynamic weights [156]. I believe this is through the use of relatively simple transition kernels, which avoids overparameterization of the model, and through utilizing molecular basis functions instead of all-atom coordinates, which reduces the space that needs to be sampled. The issue of generating out-of- distribution samples that has been problematic in other methods attempting to generate molecular structure, such as the Boltzmann generator, is usually avoided by discarding the translational and rotational degrees of freedom and reweighting the samples[65, 68]. However, calculating the weight of each sample requires knowledge of all the coordinates of a system which may also become an issue when the system contains explicit water molecules; in such a configuration space the samples will again become sparse. Finally, I would point out that this work shows the possibility of learning generative models in the space of generic thermodynamic ensembles, by following the simple recipe that control parameters can also be viewed as fluctuating variables. As long as one is not in the thermodynamic limit ? something we do not have to worry about in molecular simulations ? this should be thus a practical and useful procedure for problems far beyond replica exchange molecular dynamics. 106 Chapter 6: Summary and future directions 6.1 AI-augmented MD simulations I presented in this thesis two AI-based computational methods to solve the time scale problem in MD simulation study of biomolecular systems. In the first method RAVE, I showed that based on the framework of information bottleneck, we could use a neural network to learn the system?s RCs. By enhancing the fluctuation of RCs, we could accelerate the transition between meta-stable states and thus sample the system better with limited computational power. One significant advantage of this method is that it doesn?t require much prior knowledge. By combing OP preselection methods like AMINO[101], RAVE only requires an initial structure as input and automatically learns the optimal RCs by iterating between machine learning and enhanced MD. In the second method, I used a generative model to address the problem of using high temperature simulations to infer observations about low temperatures. I demonstrated how using generative artificial intelligence to mix information from simulations conducted at a set of temperatures and generate molecular configurations at any temperature of interest, including temperatures at which simulations were never performed. Particularly, the numerical results on a small peptide chain and a nucleotide 107 chain show that if we treat the instantaneous temperature T as the random variable and learn its joint distribution with CVs, we can use the learned generate configurations with correct Boltzmann weights at the target temperatures. Since the model can infer the underlying relationship between samples from different temperatures, they can infer the distribution of data in poorly sampled regions, thus significantly enhancing the sampling of REMD. As a method that doesn?t require much prior knowledge, REMD can be easily applied to a wide range of systems. The combination of DDPM and REMD can further reduce the computational cost and make it possible to study systems with a bigger size. And it is very suitable to be applied to biomolecular systems where the dynamics are complicated, and simple RCs could not be found. Such as intrinsically disordered proteins. It also has potential in studying conformational ensemble or structure optimization. I also believe the framework is extensible to generic simulations and experiments for mixing control parameters other than temperature. 6.2 Study of ligand dissociation from RNA With the AI-augmented sampling method RAVE, I investigated the dissociation of two ligands from PreQ1 riboswitch. Unlike ligand dissociation from protein which is widely studied with MD simulation, there are few computational studies of dissociation of ligand from nucleotide due to the lack of solved structure and relatively poor force field parameterization. In this collaborative study with experimentalists, I used MD simulation to provide a high spatiotemporal resolution description interaction between ligands and PreQ1 riboswitch both in its bound pose and during the dissociation pathways. 108 The MD simulation successfully compares the stabilization of different regions of PreQ1 riboswitch, which is also in good agreement with experimental SHAPE measurements. From the dissociation trajectories simulated by RAVE, I found that the existence of two unbinding pathways also gets affinity estimation that is in good agreement with the experiment. Analyzing the trajectories provided insights on how different nucleotides interact with the ligands, and I made prediction mutations that could change the binding affinity based on these observations. Both in vitro and in silico mutagenesis experiments validated the prediction and revealed how the mutations change the binding ensemble of ligand-RNA complexes. This study indicates the potential of using MD simulations to model and study the interplay between structural features and small molecule recognition of RNA. Such a protocol is also transferable to studying features and dynamics that govern how small synthetic molecules interact with RNA. 6.3 Future work Based on the studies mentioned in this thesis, there are many directions worth exploring in the future. These include improving methods themselves and applying them to more challenging problems, and I will summarize them as follows. Firstly, there is ample need and space for improvement if we want RAVE to be a method that the broad scientific community can apply to generic systems with arbitrary complexity. One potential improvement might be designing new structures for the encoder to enable the learning of more complicated features. A linear encoder has the advantage that we can directly use the linear combination coefficients to interpret 109 how different OPs contribute to the RC. Also, the success of applying RAVE to different systems indicates that these linear RCs can give a relatively good approximation of the underlying reaction process. However, this might not be true for more complicated cases. We need multiple linear RCs or a nonlinear RC to capture more complex features in such cases. But I would like to note that introducing a more complicated encoder might also come at the cost of losing both the interpretability and stability of the solutions. One possible solution to this dilemma requires us to develop a strategy to address carefully when one or even two reaction coordinates are insufficient to describe the process of interest. This will allow us to gradually add complexity to the encoder to capture only the essential physics. Secondly, with the abundant data generated from these enhanced sampling methods, new analysis tools are required for us to interpret the mechanism hiding behind the high dimensional data. The strength and mechanism of the ligand-receptor interaction are the consequence of many features of these molecules. Only a small subset of those features is studied in my study of the dissociation events. Many other features such as the geometry properties of the ligands and receptors, the presence of stacking interactions, and solvent effects are also crucial in determining how ligand and receptor react with each other. Being able to dig information from a long list of such features will help us to understand their different contribution better or even learn new mechanisms. Machine learning is an ideal tool for such an analysis and has already been used to guide human intuition.[174, 175, 176] At the same time, the involvement of enhanced sampling method further enriches this analysis by allowing us to look a the distribution of different features or adding the dimension of time. Moreover, MD simulations can also serve as a flexible 110 but powerful experimental platform to validate the relationship learned by AI model. Finally, the power of DPPM in learning complicated distribution and making physical interpolation can also be applied quite generally to data coming from simulations or experiments at different temperatures. It should also be extensible to mixing data from control parameters other than temperatures - possibly including concentration, pressure and volume. Also, it can be extended to learn the system?s kinetic from biased simulation. For example, we can apply DDPM on learning a distribution P (Xt+1|Xt+1, ?) where the ? is the control parameters such as the temperature or the magnitude of external potential. Such a learned model can then generate a new trajectory under desired conditions. Moreover, in an ongoing project, I use DDPM to build a model which can convert the time series of one observable to the time series of another observable. More specifically, DDPM is train to learn the conditional distribution P (s2N |s1N , s1N?1 ? ? ? s10) where s1 and s2 are two variables and the subscripts denote the time steps. This method is grounded on the Takens delay embedding theorem[177] and its probabilistic counterpart[178]. It will be particularly useful in experiment as we can use this model to predict the hard-to-measure variables by measuring the time series of easy-to-measure variables. For example, we can use the fluorescence resonance energy transfer (FRET) signal[179], which reflects the distance between too ends of a molecule, to estimate other quantities that are more of interest but can not be directly measured in experiment. 111 Appendix A: A.1 MD setups A.1.1 Alanine dipeptide I follow [63] to set up our simulation for alanine dipeptide in vacuum. The simulations are performed with the software GROMACS 2016/GROMACS 5.0[180, 181], patched with PLUMED 2.4[182]. I constrain bonds involving hydrogens using the LINCS algorithm[183] and employ an integration time-step of 0.002 ps. The temperature is kept constant at 300K using the velocity rescaling thermostat (relaxation time of 0.1 fs)[106]. I employ no periodic boundary conditions and no cut-offs for the electrostatic and non-bonded Van der Waals interactions. The integration time step was 2 fs and order parameters were saved every time step. In each round, four independent simulations with different initial randomized velocities as per Maxwell-Boltzmann distribution were performed to improve the quality of the free energy sampling. 112 A.1.2 L99A T4L-benzene I follow [96] to set up our simulation for L99A T4L-benzene. The simulations are performed with the software GROMACS 2016/GROMACS 5.0[180, 181] patched with PLUMED 2.4[182]. The simulations are done with the constant number, pressure, temperature (NPT) ensemble with temperature 298 K and pressure 1.0 bar. Constant pressure is maintained using Parrinello-Rahaman barostat while the constant temperature is maintained using the v-rescale thermostat (modified Berendsen thermostat)[22]. The simulation box with periodic boundary condition is filled with TIP3P water. The side lengths of the box are 10A? and there are around 10, 000 water molecules. The interaction is described by the force field CHARMM22*. The integration time step here as well was taken to be 2 fs. Similar to the the study of alanine dipeptide, in each round, four independent simulations with different initial randomized velocities as per Maxwell-Boltzmann distribution were performed to improve the quality of the free energy sampling. A.1.3 AIB9 The simulation of AIB9 was setup by following a previous study[166]. The PDB file was taken from the authors with permission, and the simulations are done with the CHARMM36m all atom force field [184] using TIP3P water molecules[162], a Parrinello-Rahman barostat[185], and a Nose-Hoover thermostat[21, 186] under the NPT ensemble. Simulations were performed using GROMACS 2016.[180] The structures of AIB9 were saved every 0.2 ps and the dihedral angles were calculated using PLUMED 113 2.4[187]. A.1.4 GACC The simulations of GACC were done with the AMBER ff12 all atom force field, using TIP3P water molecules[162], a Parrinello-Rahman barostat[185], and a Bussi- Parrinello velocity rescaling thermostat[106] under the NPT ensemble. The simulations were performed using GROMACS 2016[180]. The AMBER force filed was chosen because it exhibits more conformational variability compared with CHARMM force filed in a previous study[188]. The structures of GACC were also saved every 0.2 ps and the dihedral angles defined in Table S1 were calculated using PLUMED 2.4[187]. The PDB file of the GACC structure that served as the starting point for our simulation was generated using PyMOL. The initial structure was assumed to be at a temperature of 10K, and the systems energy was minimized with positional restraints of 25 kcal mol?1A??2 in a two step process - first the steepest descent algorithm was applied for 1000 steps followed by the conjugate gradient algorithm for another 1000 steps. Next, the system was equilibrated to 150K over 100ps under the NVT ensemble with positional restraints of 25 kcal mol?1A??2. Then, the GACC was equilibrated from 150K to 277K at 1 atm under the NPT ensemble for 100 ps with positional restraints of 5 kcal mol?1A??2. Finally, a long 5 ns equilibration was performed over 5ns at 298K under the NPT ensemble with positional restraints of 0.5 kcal mol?1A??2, after which the system was copied to 48 replicas, and each replica was equilibrated to its target temperature under the NPT ensemble with positional restraints of 0.5 kcal mol?1A??2. 114 The replica temperatures were geometrically spaced temperatures ranging from 277K to 408K, with temperature increased by 1% for each replica. The attempt of exchanging configurations was made every 10 ps, which is determined by checking the time correlation function of the potential energy. A.1.5 Architecture of RAVE A densely-connected layer without activation function was used to linearly encode the order parameters X to reaction coordinates s. Gaussian noise was added to s before being passed to the decoder. Decoder consisted of 2 hidden layers and an output layer which were all densely-connected. ELU was used as the activation function for hidden layers. We assume Q?(X?t|s) = N (X?t; f?(s), ?2). f?(s) corresponds to the decoder part of the neural network, which maps states on the reaction coordinate to states in order parameter space. With this assumption, maximizing the objective function is equivalent to minimizing the mean square error between X?t and network prediction f?(s). Hyper-parameters in RAVE included the variance of Gaussian noise, the number of neurons in hidden layers, initializer of weights of each layer, and the learning rate for the RMSprop algorithm [90]. In all three case studies, all these hyper-parameters are set to be the same. The variance of Gaussians was kept 0.005. Each hidden layer had 128 neurons. The leaning rate was set to be 0.003. Initial weights of each layer were randomly picked from a uniform within range [?0.005, 005]. The transferability of hyper- parameters between different systems without much tuning reflects that this method is not very sensitive to the choice of neural network hyper-parameters. From our experience, we 115 suggest that the choice of the variance of Gaussian should not be too big as it will wash out meaningful features. For more complicated systems, a deeper (more layers) or wider (more neurons in each layer) decoder might be needed. The tuning of learning rate was done by looking at how order parameter weights change during the training process. If the learning rate was too high, order parameter weights did not converge. Independent simulations often explored different configurations due to the finite and small simulation time. We considered the trajectory with the highest variance to have maximum ergodic exploration and used it to train the reaction coordinate for the next round. Similar to other non-convex optimization problems, the results could have converged to a local minimum or even a saddle point. To safeguard against such spurious solutions learnt by the neural network, we performed independent training runs with random initial weights in layers. The RC was then determined as the linear encoder of the trained neural network with smallest loss function. We will be refining this procedure in future work as strictly speaking a lower loss function is no guarantee of reaching a better solution. A.2 Definitions of quantities in information theory and variational lower bound In this section I will define various terms introduced this thesis. In all of these I use P (X), P (X,Y) respectively to denote the probability distribution of a random variable X and the joint probability distribution of two random variables X and Y. Other variables are the same as defined in the main text. 116 A.2.1 Mutual information This is a commonly used information theoretic measure to describe how much information is shared between two random variables X and Y . It is defined as: ? P (X,Y) I(X,Y) = P (X,Y) ln dXdY (A.1) P (X)P (Y) A.2.2 Shannon entropy The Shannon entropy for a random variable X is defined as: ? H(X) = ? P (X) lnP (X)dX (A.2) A.2.3 Cross entropy The cross entropy between two probability distributions P (X) and Q(X) is given by: ? C(X,Y) = ? P (X) lnQ(X)dX (A.3) A.2.4 Kullback-Leibler Divergence The Kullback-Leibler (KL) DKL(P ||Q) divergence between two probability distributions P (X) and Q(X) is given by: ? P (X) DKL(P ||Q) = P (X) ln dX (A.4) Q(X) 117 A.2.5 Exact decoder definition The exact decoder can be defined as the the conditional probability distribution of X?t given a RC s. It can be calculated by using Bayes theorem: ? P (s,X?t) ? ? P (s|X)P (X,X| ?t)dXP (X?t s) = = (A.5) P (s) P (s|X)P (X,X? )dXdX??t ?t A.2.6 Gibbs?s inequality and variational lower bound Here I will provide the missing details of various derivations given in the Sec. 2.3. The bottleneck function L defined in the main text for our neural network architecture is given by a difference of two Shannon entropies[82]: L = I(s,X?t) = H(P (X?t))?H(P?(X?t|s)) (A.6) Gibbs?s inequality[82] guarantees that that the KL-divergence between two probability distributions is always larger than 0. We thus have: ? ? P?(X?t|s) D?K?L(P?(X?t|s))||Q?(X?t|s)) ? P??(X??t, s) ln dX| ?tdsQ?(X?t s) = P?(X?t, s) lnP?(X?t|s)dX?tds? P?(X?t, s) lnQ?(X?t|s)dX?tds =?H(P?(X?t|s)) + C(P?(X?t|s), Q?(X?t|s)) ? 0 (A.7) Only in the limit that our approximate decoder is exactly the same as the exact inverse- Bayes decoder, DKL(P?||Q?) equals 0. By combining Eq.A.6 and Eq.A.7 , we get the 118 relationship: L ? L ?DKL(P?(X?t|s))||Q?(X?t|s)) = H(P (X?t))? C(P?(X?t|s), Q?(X?t|s)) ? H(P (X ??t)) + L (A.8) Here, H(P (X?t)) only depends on the data set and is independent of the parametrization of the encoder and the decoder. Thus the term H(P (X?t) can be completely ignored while optimizing the parameters ? and ?. Maximizing the objective function L? = ?C(P?(X?t|s)) is then equivalent to maximizing the lower bound of L, which is the expression stated in Eq.4 in the main text. A.3 PIB objective L? for unbiased trajectory { } For a u{nbiased trajectory X 1,X2} ?, ...,X M+k , for given ?, we can get corresponding s1, s2, ..., sM using si = i cis{i. We also have a sampl}ing of the states of the system after corresponding ?t intervals: X1+k,X2+k, ...,XM+k . Together the pair (si,Xi+k) sampled from the dataset follows the distribution P?(X?t|s). With this, we have: L? = ?? C?(P?(X?t|s), Q?(X?t|s)) = P?(X?t, s) lnQ?(X?t|s)dX?tds 1 ?M = log Q(Xn+1|sn) (A.9) M n=1 119 A.4 PIB objective L? for biased trajectory { } F{or a biased traje}ctory X 1,X2, ...,XM with corresponding biasing potential values V 1, V 2, ..., V M , the unbiased probability distribution of X can be calculated by: ? ?(?Xi ?X)e?V iP (X) = i (A.10) e?V ii The encoder P (s|X) and decoder P (X?t|s) are taken to be independent of the bias. The first assumption is strictly true, while the second is valid for small enough ?t as explained in the main text. Therefore ? ? L? = { P?(X}?t|?)P (?) lnQ?(X?t|?)dX?td??M ?1?M ?V n= e e?V n log Q(Xn+1|?n) (A.11) n=1 n=1 120 Publications from Doctoral Research 1. Ribeiro, Joa?o Marcelo Lamim, Bravo, Pablo, Wang, Yihang, and Tiwary, Pratyush. ?Reweighted autoencoded variational Bayes for enhanced sampling (RAVE).? The Journal of chemical physics 149, no. 7 (2018): 072301. 2. Ribeiro, Joao Marcelo Lamim, Tsai, Sun-Ting, Pramanik, Debabrata, Wang, Yihang, and Tiwary, Pratyush. ?Kinetics of ligand?protein dissociation from all-atom simulations: Are we there yet?.? Biochemistry 58, no. 3 (2018): 156-165. 3. Wang, Yihang, Ribeiro, Joao Marcelo Lamim, and Tiwary, Pratyush. ?Past?future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics.? Nature communications 10, no. 1 (2019): 1-8. 4. Wang, Yihang, Ribeiro, Joao Marcelo Lamim, and Tiwary, Pratyush . ?Machine learning approaches for analyzing and enhancing molecular dynamics simulations.? Current Opinion in Structural Biology 61 (2020): 139-145. 5. Wang, Yihang, and Tiwary, Pratyush. ?Understanding the role of predictive time delay and biased propagator in RAVE.? The Journal of Chemical Physics 152, no. 14 (2020): 144102. 6. Smith, Zachary, Wang, Yihang, Ravindra, Pavan, Cooley, Rory, and Tiwary, Pratyush. ?Discovering protein conformational flexibility through artificial- intelligence-aided molecular dynamics.? The Journal of Physical Chemistry B 124, no. 38 (2020): 8221-8229. 7. Pant, Shashank, Wang, Yihang, Smith, Zachary, Tajkhorshid, Emad, and Tiwary, Pratyush. ?Confronting pitfalls of AI-augmented molecular dynamics using statistical physics.? The Journal of Chemical Physics 153, no. 23 (2020): 234118. 8. Wang, Yihang, Parmar, Shaifaly, Schneekloth, John S., and Tiwary, Pratyush. ?Interrogating RNA-small molecule interactions with structure probing and AI augmented-molecular simulations.? bioRxiv (2021). 9. Wang, Yihang, and Tiwary, Pratyush. ?From data to noise to data: mixing physics across temperatures with generative artificial intelligence.? arXiv preprint arXiv:2107.07369 (2021). 121 Bibliography [1] B. Alberts, J.H. Wilson, A. Johnson, T. Hunt, J. Lewis, K. Roberts, M. Raff, and P. Walter. Molecular Biology of the Cell. Garland Science, 2008. [2] Emilio Gallicchio and Ronald M. Levy. Recent theoretical and computational advances for modeling protein-ligand binding affinities, volume 85. 2011. [3] A focus on regulatory rnas [editorial]. Nature Cell Biology, 21(5):535?535, 2019. [4] Anthony M. Mustoe, Charles L. Brooks, and Hashim M. Al-Hashimi. Hierarchy of rna functional dynamics. Annual Review of Biochemistry, 83(1):441?466, 2014. PMID: 24606137. [5] Xing Du, Yi Li, Yuan-Ling Xia, Shi-Meng Ai, Jing Liang, Peng Sang, Xing-Lai Ji, and Shu-Qun Liu. Insights into protein?ligand interactions: mechanisms, models, and methods. International journal of molecular sciences, 17(2):144, 2016. [6] Aron W Fenton. Allostery: an illustrated definition for the ?second secret of life?. Trends in biochemical sciences, 33(9):420?425, 2008. [7] Hesam N Motlagh, James O Wrabl, Jing Li, and Vincent J Hilser. The ensemble nature of allostery. Nature, 508(7496):331?339, 2014. [8] F Ulrich Hartl. Protein misfolding diseases. Annual review of biochemistry, 86:21? 26, 2017. [9] Piya Lahiry, Ali Torkamani, Nicholas J Schork, and Robert A Hegele. Kinase mutations in human disease: interpreting genotype?phenotype relationships. Nature Reviews Genetics, 11(1):60?74, 2010. [10] Jack W Scannell, Alex Blanckley, Helen Boldon, and Brian Warrington. Diagnosing the decline in pharmaceutical r&d efficiency. Nature reviews Drug discovery, 11(3):191?200, 2012. [11] Robert A Copeland. The drug?target residence time model: a 10-year retrospective. Nature Reviews Drug Discovery, 15(2):87?95, 2016. 122 [12] Michael M Pierce, CS Raman, and Barry T Nall. Isothermal titration calorimetry of protein?protein interactions. Methods, 19(2):213?221, 1999. [13] Priyabrata Pattnaik. Surface plasmon resonance. Applied biochemistry and biotechnology, 126(2):79?92, 2005. [14] William J Checovich, Randall E Bolger, and Thomas Burke. Fluorescence polarization?a new tool for cell and molecular biology. Nature, 375(6528):254? 256, 1995. [15] Xing Du, Yi Li, Yuan-Ling Xia, Shi-Meng Ai, Jing Liang, Peng Sang, Xing-Lai Ji, and Shu-Qun Liu. Insights into protein?ligand interactions: mechanisms, models, and methods. International journal of molecular sciences, 17(2):144, 2016. [16] Ron O. Dror, Robert M. Dirks, J. P. Grossman, Huafeng Xu, and David E. Shaw. Biomolecular simulation: A computational microscope for molecular biology. Annual Review of Biophysics, 41(1):429?452, 2012. [17] Ron O. Dror, Albert C. Pan, Daniel H. Arlow, David W. Borhani, Paul Maragakis, Yibing Shan, Huafeng Xu, and David E. Shaw. Pathway and mechanism of drug binding to g-protein-coupled receptors. Proc. Natl. Acad. Sci., 108(32):13118? 13123, 2011. [18] Yibing Shan, Eric T Kim, Michael P Eastwood, Ron O Dror, Markus A Seeliger, and David E Shaw. How does a drug molecule find its target binding site? J. Amer. Chem. Soc., 133(24):9181?9183, 2011. [19] Dow P Hurst, Alan Grossfield, Diane L Lynch, Scott Feller, Tod D Romo, Klaus Gawrisch, Michael C Pitman, and Patricia H Reggio. A lipid pathway for ligand binding is necessary for a cannabinoid g protein-coupled receptor. J. Biol. Chem., pages jbc?M109, 2010. [20] Jagannath Mondal, Navjeet Ahalawat, Subhendu Pandit, Lewis E Kay, and Pramodh Vallurupalli. Atomic resolution mechanism of ligand binding to a solvent inaccessible cavity in t4 lysozyme. PLoS computational biology, 14(5):e1006180, 2018. [21] Denis J Evans and Brad Lee Holian. The nose?hoover thermostat. The Journal of chemical physics, 83(8):4069?4074, 1985. [22] Giovanni Bussi, Davide Donadio, and Michele Parrinello. Canonical sampling through velocity rescaling. J Chem Phys, 126(1):014101?101107, 2007. [23] H. J. C. Berendsen, J. P. M. Postma, W. F. van Gunsteren, A. DiNola, and J. R. Haak. Molecular dynamics with coupling to an external bath. The Journal of Chemical Physics, 81(8):3684?3690, 1984. [24] D Quigley and MIJ Probert. Langevin dynamics in constant pressure extended systems. The Journal of chemical physics, 120(24):11432?11441, 2004. 123 [25] Baron Peters. Reaction Coordinates and Mechanistic Hypothesis Tests. Annual Review of Physical Chemistry, 67(1):669?690, 2016. [26] Peter G. Bolhuis, David Chandler, Christoph Dellago, and Phillip L. Geissler. T RANSITION P ATH S AMPLING : Throwing Ropes Over Rough Mountain Passes, in the Dark . Annual Review of Physical Chemistry, 53(1):291?318, 2002. [27] Martin Karplus and Gregory A Petsko. Molecular dynamics simulations in biology. Nature, 347(6294):631?639, 1990. [28] Maya Shamir, Yinon Bar-On, Rob Phillips, and Ron Milo. Snapshot: timescales in cell biology. Cell, 164(6):1302?1302, 2016. [29] Karina Mart??nez-Mayorga, Michael C Pitman, Alan Grossfield, Scott E Feller, and Michael F Brown. Retinal counterion switch mechanism in vision evaluated by molecular simulations. Journal of the American Chemical Society, 128(51):16502? 16503, 2006. [30] Ron O Dror, Robert M Dirks, JP Grossman, Huafeng Xu, and David E Shaw. Biomolecular simulation: a computational microscope for molecular biology. Annual review of biophysics, 41:429?452, 2012. [31] David E Shaw, Peter J Adams, Asaph Azaria, Joseph A Bank, Brannon Batson, Alistair Bell, Michael Bergdorf, Jhanvi Bhatt, J Adam Butts, Timothy Correia, et al. Anton 3: twenty microseconds of molecular dynamics simulation before lunch. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1?11, 2021. [32] Matt J Harvey, Giovanni Giupponi, and G De Fabritiis. Acemd: accelerating biomolecular dynamics in the microsecond time scale. Journal of chemical theory and computation, 5(6):1632?1639, 2009. [33] Thomas A Halgren and Wolfgang Damm. Polarizable force fields. Current Opinion in Structural Biology, 11(2):236 ? 242, 2001. [34] Justin A Lemkul, Jing Huang, Beno??t Roux, and Alexander D MacKerell Jr. An empirical polarizable force field based on the classical drude oscillator model: development history and recent applications. Chemical reviews, 116(9):4983? 5013, 2016. [35] Jay W Ponder, Chuanjie Wu, Pengyu Ren, Vijay S Pande, John D Chodera, Michael J Schnieders, Imran Haque, David L Mobley, Daniel S Lambrecht, Robert A DiStasio Jr, et al. Current status of the amoeba polarizable force field. The journal of physical chemistry B, 114(8):2549?2564, 2010. [36] Oleg Borodin. Polarizable force field development and molecular dynamics simulations of ionic liquids. The Journal of Physical Chemistry B, 113(33):11463? 11478, 2009. 124 [37] Mark E Tuckerman. Ab initio molecular dynamics: basic concepts, current trends and novel applications. Journal of Physics: Condensed Matter, 14(50):R1297, 2002. [38] Yuji Sugita and Yuko Okamoto. Replica-exchange molecular dynamics method for protein folding. Chemical Physics Letters, 314(1):141 ? 151, 1999. [39] Pu Liu, Byungchan Kim, Richard A Friesner, and BJ Berne. Replica exchange with solute tempering: A method for sampling biological systems in explicit water. Proceedings of the National Academy of Sciences, 102(39):13749?13754, 2005. [40] Lingle Wang, Richard A Friesner, and BJ Berne. Replica exchange with solute scaling: a more efficient version of replica exchange with solute tempering (rest2). The Journal of Physical Chemistry B, 115(30):9431?9438, 2011. [41] Andrew J Ballard and Christopher Jarzynski. Replica exchange with nonequilibrium switches. Proceedings of the National Academy of Sciences, 106(30):12224?12229, 2009. [42] G.A. Huber and S. Kim. Weighted-ensemble brownian dynamics simulations for protein association reactions. Biophysical Journal, 70(1):97 ? 110, 1996. [43] Daniel M. Zuckerman and Lillian T. Chong. Weighted ensemble simulation: Review of methodology, applications, and software. Annual Review of Biophysics, 46(1):43?57, 2017. PMID: 28301772. [44] Anton K. Faradjian and Ron Elber. Computing time scales from reaction coordinates by milestoning. The Journal of Chemical Physics, 120(23):10880? 10889, 2004. [45] E.V. Anslyn, D.A. Dougherty, E.V. Dougherty, and University Science Books. Modern Physical Organic Chemistry. University Science Books, 2006. [46] Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562?12566, 2002. [47] Omar Valsson, Pratyush Tiwary, and Michele Parrinello. Enhancing important fluctuations: Rare events and metadynamics from a conceptual viewpoint. Annual Review of Physical Chemistry, 67(1):159?184, 2016. PMID: 26980304. [48] Pratyush Tiwary and Michele Parrinello. From metadynamics to dynamics. Phys. Rev. Lett., 111:230602, Dec 2013. [49] G.M. Torrie and J.P. Valleau. Nonphysical sampling distributions in monte carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187 ? 199, 1977. [50] Ao Ma and Aaron R. Dinner. Automatic method for identifying reaction coordinates in complex systems. The Journal of Physical Chemistry B, 109(14):6769?6779, 2005. PMID: 16851762. 125 [51] Hendrik Jung, Roberto Covino, and Gerhard Hummer. Artificial intelligence assists discovery of reaction coordinates and mechanisms from molecular dynamics simulations. arXiv preprint arXiv:1901.04595, 2019. [52] Stefan Klus, Feliks Nu?ske, Pe?ter Koltai, Hao Wu, Ioannis Kevrekidis, Christof Schu?tte, and Frank Noe?. Data-driven model reduction and transfer operator approximation. Journal of Nonlinear Science, 28(3):985?1010, 2018. [53] Frank Noe? and Feliks Nuske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Modeling & Simulation, 11(2):635?655, 2013. [54] Feliks Nu?uske, Bettina G Keller, Guillermo Pe?rez-Herna?ndez, Antonia SJS Mey, and Frank Noe?. Variational approach to molecular kinetics. Journal of chemical theory and computation, 10(4):1739?1752, 2014. [55] Guillermo Pe?rez-Herna?ndez, Fabian Paul, Toni Giorgino, Gianni De Fabritiis, and Frank Noe?. Identification of slow molecular order parameters for markov model construction. The Journal of chemical physics, 139(1):07B604 1, 2013. [56] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noe?. Vampnets for deep learning of molecular kinetics. Nature Communications, 9(1):5, 2018. [57] Wei Chen, Hythem Sidky, and Andrew L. Ferguson. Capabilities and limitations of time-lagged autoencoders for slow mode discovery in dynamical systems. The Journal of Chemical Physics, 151(6):064123, Aug 2019. [58] Carlos X. Herna?ndez, Hannah K. Wayment-Steele, Mohammad M. Sultan, Brooke E. Husic, and Vijay S. Pande. Variational encoding of complex dynamics. Phys. Rev. E, 97:062412, Jun 2018. [59] Simon Olsson and Frank Noe?. Dynamic graphical models of molecular kinetics. Proceedings of the National Academy of Sciences, 116(30):15001?15006, 2019. [60] Wei Chen and Andrew L. Ferguson. Molecular enhanced sampling with autoencoders: On-the-fly collective variable discovery and accelerated free energy landscape exploration. Journal of Computational Chemistry, 39(25):2079?2102, 2018. [61] Carlos X. Herna?ndez, Hannah K. Wayment-Steele, Mohammad M. Sultan, Brooke E. Husic, and Vijay S. Pande. Variational encoding of complex dynamics. Phys. Rev. E, 97:062412, Jun 2018. [62] Luigi Bonati, Yue-Yu Zhang, and Michele Parrinello. Neural networks-based variationally enhanced sampling. Proceedings of the National Academy of Sciences, 116(36):17641?17647, 2019. [63] Omar Valsson and Michele Parrinello. Variational approach to enhanced sampling and free energy calculations. Physical review letters, 113(9):090601, 2014. 126 [64] Zahra Shamsi, Kevin J Cheng, and Diwakar Shukla. Reinforcement learning based adaptive sampling: Reaping rewards by exploring protein conformational landscapes. The Journal of Physical Chemistry B, 122(35):8386?8395, 2018. [65] Frank Noe?, Simon Olsson, Jonas Ko?hler, and Hao Wu. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019. [66] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016. [67] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in Neural Information Processing Systems, pages 10215?10224, 2018. [68] Hao Wu, Jonas Ko?hler, and Frank Noe?. Stochastic normalizing flows. arXiv preprint arXiv:2002.06707, 2020. [69] Manuel Dibak, Leon Klein, and Frank Noe?. Temperature steerable flows and boltzmann generators. arXiv preprint arXiv:2108.01590, 2021. [70] Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information in a sensory population. Proc. Natl. Acad. Sci., 112(22):6908?6913, 2015. [71] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. Preprint at arXiv:physics/0004057, 2000. [72] Susanne Still. Information bottleneck approach to predictive inference. Entropy, 16(2):968?989, 2014. [73] A Elisabeth Eriksson, Walter A Baase, Xue-Jun Zhang, Dirk W Heinz, MPBE Blaber, Enoch P Baldwin, and Brian W Matthews. Response of a protein structure to cavity-creating mutations and its relation to the hydrophobic effect. Science, 255(5041):178?183, 1992. [74] Victoria A Feher, Enoch P Baldwin, and Frederick W Dahlquist. Access of ligands to cavities within the core of a protein is rapid. Nature Structural and Molecular Biology, 3(6):516, 1996. [75] Gordon J Berman, William Bialek, and Joshua W Shaevitz. Predictability and hierarchy in drosophila behavior. Proceedings of the National Academy of Sciences, 113(42):11943?11948, 2016. [76] Joa?o Marcelo Lamim Ribeiro, Pablo Bravo, Yihang Wang, and Pratyush Tiwary. Reweighted autoencoded variational bayes for enhanced sampling. J. Chem. Phys., 149(7):072301?072309, 2018. 127 [77] Yihang Wang, Joa?o Marcelo Lamim Ribeiro, and Pratyush Tiwary. Past?future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics. Nature Communications, 10(1):3573, 2019. [78] Sebastian J Wetzel. Unsupervised learning of phase transitions: From principal component analysis to variational autoencoders. Physical Review E, 96(2):022140, 2017. [79] Jing Zhang and Ming Chen. Unfolding hidden barriers by active enhanced sampling. Physical review letters, 121(1):010601, 2018. [80] Omar Valsson, Pratyush Tiwary, and Michele Parrinello. Enhancing important fluctuations: Rare events and metadynamics from a conceptual viewpoint. Annual Review of Physical Chemistry, 67(1):159?184, 2016. PMID: 26980304. [81] Pratyush Tiwary and Michele Parrinello. A time-independent free energy estimator for metadynamics. The Journal of Physical Chemistry B, 119(3):736?742, 01 2015. [82] David JC MacKay and David JC Mac Kay. Information theory, inference and learning algorithms. Cambridge university press, 2003. [83] Pratyush Tiwary and Michele Parrinello. From metadynamics to dynamics. Phys. Rev. Lett., 111(23):230602?230606, 2013. [84] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. Preprint at arXiv:1612.00410, 2016. [85] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noe?. Vampnets for deep learning of molecular kinetics. Nat. Comm., 9(1):5?12, 2018. [86] Wei Chen and Andrew L Ferguson. Molecular enhanced sampling with autoencoders: On-the-fly collective variable discovery and accelerated free energy landscape exploration. Journal of computational chemistry, 39(25):2079?2102, 2018. [87] Christoph Wehmeyer and Frank Noe?. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. The Journal of Chemical Physics, 148(24):241703, 2018. [88] Mohammad M. Sultan, Hannah K. Wayment-Steele, and Vijay S. Pande. Transferable neural networks for enhanced sampling of protein dynamics. Journal of Chemical Theory and Computation, 14(4):1887?1894, 2018. PMID: 29529369. [89] Nigel Goldenfeld. Lectures on phase transitions and the renormalization group. CRC Press, 2018. [90] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016. 128 [91] Bruce J Berne, Michal Borkovec, and John E Straub. Classical and modern methods in reaction rate theory. J. Phys. Chem., 92(13):3711?3725, 1988. [92] John A Montgomery Jr, David Chandler, and Bruce J Berne. Trajectory analysis of a kinetic theory for isomerization dynamics in condensed phases. The Journal of Chemical Physics, 70(9):4056?4066, 1979. [93] Christoph Dellago, Peter G Bolhuis, and David Chandler. On the calculation of reaction rate constants in the transition path ensemble. The Journal of chemical physics, 110(14):6617?6625, 1999. [94] Omar Valsson, Pratyush Tiwary, and Michele Parrinello. Enhancing important fluctuations: Rare events and metadynamics from a conceptual viewpoint. Annual Review of Physical Chemistry, 67(1):159?184, 2016. PMID: 26980304. [95] Giovanni Bussi, Alessandro Laio, and Pratyush Tiwary. Metadynamics: A Unified Framework for Accelerating Rare Events and Sampling Thermodynamics and Kinetics, pages 1?31. Springer International Publishing, Cham, 2018. [96] Joao Marcelo Lamim Ribeiro and Pratyush Tiwary. Towards achieving efficient and accurate ligand-protein unbinding with deep learning and molecular dynamics through rave. Journal of chemical theory and computation, 2019. [97] Yong Wang, Joao Miguel Martins, and Kresten Lindorff-Larsen. Biomolecular conformational changes and ligand binding: from kinetics to thermodynamics. Chemical science, 8(9):6466?6473, 2017. [98] Zachary Smith, Debabrata Pramanik, Sun-Ting Tsai, and Pratyush Tiwary. Multi-dimensional spectral gap optimization of order parameters (sgoop) through conditional probability factorization. J. Chem. Phys., 2018. [99] Guillaume Bouvignies, Pramodh Vallurupalli, D Flemming Hansen, Bruno E Correia, Oliver Lange, Alaji Bah, Robert M Vernon, Frederick W Dahlquist, David Baker, and Lewis E Kay. Solution structure of a minor and transiently formed state of a t4 lysozyme mutant. Nature, 477(7362):111, 2011. [100] Marcus D Collins, Gerhard Hummer, Michael L Quillin, Brian W Matthews, and Sol M Gruner. Cooperative water filling of a nonpolar protein cavity observed by high-pressure crystallography and simulation. Proceedings of the National Academy of Sciences, 102(46):16668?16671, 2005. [101] Pavan Ravindra, Zachary Smith, and Pratyush Tiwary. Automatic mutual information noise omission (amino): generating order parameters for molecular systems. Mol. Syst. Des. Eng., pages ?, 2020. [102] Christoph Wehmeyer and Frank Noe?. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. J. Chem. Phys., 148(24):241703, 2018. 129 [103] D. J. Bicout and Attila Szabo. Electron transfer reaction dynamics in non-debye solvents. J. Chem. Phys., 109(6):2325?2338, 1998. [104] Pratyush Tiwary and B. J. Berne. Predicting reaction coordinates in energy landscapes with diffusion anisotropy. J. Chem. Phys., 147(15):152701, 2017. [105] Edina Rosta and Gerhard Hummer. Free energies from dynamic weighted histogram analysis using unbiased markov state model. Journal of chemical theory and computation, 11(1):276?285, 2015. [106] Giovanni Bussi and Michele Parrinello. Accurate sampling using langevin dynamics. Phys. Rev. E, 75:056707, May 2007. [107] Wei Chen, Hythem Sidky, and Andrew L. Ferguson. Capabilities and limitations of time-lagged autoencoders for slow mode discovery in dynamical systems. J. Chem. Phys., 151(6):064123, 2019. [108] Natalia Sa?nchez De Groot, Alexandros Armaos, Ricardo Gran?a-Montes, Marion Alriquet, Giulia Calloni, R Martin Vabulas, and Gian Gaetano Tartaglia. Rna structure drives interaction with proteins. Nature communications, 10(1):1?13, 2019. [109] Samantha M Meyer, Christopher C Williams, Yoshihiro Akahori, Toru Tanaka, Haruo Aikawa, Yuquan Tong, Jessica L Childs-Disney, and Matthew D Disney. Small molecule recognition of disease-relevant rna structures. Chemical Society Reviews, 49(19):7167?7199, 2020. [110] James P Falese, Anita Donlic, and Amanda E Hargrove. Targeting rna with small molecules: from fundamental principles towards the clinic. Chemical Society Reviews, 50(4):2224?2243, 2021. [111] Brian J Tucker and Ronald R Breaker. Riboswitches as versatile gene control elements. Current opinion in structural biology, 15(3):342?348, 2005. [112] Evangelia Lekka and Jonathan Hall. Noncoding rnas in disease. FEBS letters, 592(17):2884?2900, 2018. [113] Manel Esteller. Non-coding rnas in human disease. Nature reviews genetics, 12(12):861?874, 2011. [114] Kenneth F Blount and Ronald R Breaker. Riboswitches as antibacterial drug targets. Nature biotechnology, 24(12):1558?1564, 2006. [115] Colleen M Connelly, Michelle H Moon, and John S Schneekloth Jr. The emerging role of rna as a therapeutic target for small molecules. Cell chemical biology, 23(9):1077?1090, 2016. [116] Aline Umuhire Juru and Amanda E Hargrove. Frameworks for targeting rna with small molecules. Journal of Biological Chemistry, page 100191, 2021. 130 [117] Jiri Sponer, Giovanni Bussi, Miroslav Krepl, Pavel Bana?s?, Sandro Bottaro, Richard A Cunha, Alejandro Gil-Ley, Giovanni Pinamonti, Simo?n Poblete, Petr Jurec?ka, et al. Rna structural dynamics as captured by molecular simulations: a comprehensive overview. Chemical reviews, 118(8):4177?4338, 2018. [118] Sandro Bottaro and Kresten Lindorff-Larsen. Biophysical experiments and biomolecular simulations: A perfect match? Science, 361(6400):355?360, 2018. [119] Nils G. Walter. The blessing and curse of rna dynamics: past, present, and future. Methods, 49(2):85?86, 2009. RNA Dynamics. [120] Kevin M Weeks. Advances in rna structure analysis by chemical probing. Current opinion in structural biology, 20(3):295?304, 2010. [121] Hashim M Al-Hashimi and Nils G Walter. Rna dynamics: it is about time. Current opinion in structural biology, 18(3):321?329, 2008. [122] Alexander Serganov and Evgeny Nudler. A decade of riboswitches. Cell, 152(1- 2):17?24, 2013. [123] Colleen M. Connelly, Tomoyuki Numata, Robert E. Boer, Michelle H. Moon, Ranu S. Sinniah, Joseph J. Barchi, Adrian R. Ferre?-D?Amare?, and John S. Schneekloth. Synthetic ligands for preq1 riboswitches provide structural and mechanistic insights into targeting rna tertiary structure. Nature Communications, 10(1):1501, 2019. [124] Adam Roth and Ronald R. Breaker. The structural and functional diversity of metabolite-binding riboswitches. Annual review of biochemistry, 78:305?334, 2009. [125] Krishna C Suddala, Arlie J Rinaldi, Jun Feng, Anthony M Mustoe, Catherine D Eichhorn, Joseph A Liberman, Joseph E Wedekind, Hashim M Al-Hashimi, Charles L Brooks III, and Nils G Walter. Single transcriptional and translational preq1 riboswitches adopt similar pre-folded ensembles that follow distinct folding pathways into the same ligand-bound structure. Nucleic acids research, 41(22):10462?10475, 2013. [126] Griffin M Schroeder, Debapratim Dutta, Chapin E Cavender, Jermaine L Jenkins, Elizabeth M Pritchett, Cameron D Baker, John M Ashton, David H Mathews, and Joseph E Wedekind. Analysis of a preQ1-I riboswitch in effector-free and bound states reveals a metabolite-programmed nucleobase-stacking spine that controls gene regulation. Nucleic Acids Research, 48(14):8146?8164, 06 2020. [127] Alessandro Barducci, Giovanni Bussi, and Michele Parrinello. Well-tempered metadynamics: A smoothly converging and tunable free-energy method. Phys. Rev. Lett., 100:020603, Jan 2008. 131 [128] Nathan A Siegfried, Steven Busan, Greggory M Rice, Julie A E Nelson, and Kevin M Weeks. Rna motif discovery by shape and mutational profiling (shape- map). Nature Methods, 11(9):959?965, 2014. [129] Matthew J Smola, Greggory M Rice, Steven Busan, Nathan A Siegfried, and Kevin M Weeks. Selective 2?-hydroxyl acylation analyzed by primer extension and mutational profiling (shape-map) for direct, versatile and accurate rna structure analysis. Nature Protocols, 10(11):1643?1669, 2015. [130] Michelle H Moon, Thomas A Hilimire, Allix M Sanders, and John S Schneekloth Jr. Measuring rna?ligand interactions with microscale thermophoresis. Biochemistry, 57(31):4638?4643, 2018. [131] Colleen M. Connelly, Tomoyuki Numata, Robert E. Boer, Michelle H. Moon, Ranu S. Sinniah, Joseph J. Barchi, Adrian R. Ferre?-D?Amare?, and John S. Schneekloth. Synthetic ligands for preq1 riboswitches provide structural and mechanistic insights into targeting rna tertiary structure. Nature Communications, 10(1):1501, 2019. [132] Tycho Marinus, Adam B Fessler, Craig A Ogle, and Danny Incarnato. A novel shape reagent enables the analysis of rna structure in living cells with unprecedented accuracy. Nucleic acids research, 49(6):e34?e34, 2021. [133] Griffin M Schroeder, Debapratim Dutta, Chapin E Cavender, Jermaine L Jenkins, Elizabeth M Pritchett, Cameron D Baker, John M Ashton, David H Mathews, and Joseph E Wedekind. Analysis of a preq1-i riboswitch in effector-free and bound states reveals a metabolite-programmed nucleobase-stacking spine that controls gene regulation. Nucleic acids research, 48(14):8146?8164, 2020. [134] Giovanni Pinamonti, Sandro Bottaro, Cristian Micheletti, and Giovanni Bussi. Elastic network models for rna: a comparative assessment with molecular dynamics and shape experiments. Nucleic acids research, 43(15):7260?7269, 2015. [135] Dazhi Tan, Stefano Piana, Robert M. Dirks, and David E. Shaw. Rna force field with accuracy comparable to state-of-the-art protein force fields. Proceedings of the National Academy of Sciences, 115(7):E1346?E1355, 2018. [136] Pratyush Tiwary and Michele Parrinello. A time-independent free energy estimator for metadynamics. The Journal of Physical Chemistry B, 119(3):736?742, 2015. [137] Manel Esteller. Non-coding rnas in human disease. Nature Reviews Genetics, 12(12):861?-874, 2011. [138] Qi Yan, Chengming Zhu, Shouhong Guang, and Xuezhu Feng. The functions of non-coding rnas in rrna regulation. Frontiers in Genetics, 10:290, 2019. 132 [139] Xuanyu Liu, Yi Ma, Kunlun Yin, Wenke Li, Wen Chen, Yujing Zhang, Changsheng Zhu, Tianjiao Li, Bianmei Han, Xuewen Liu, Shuiyun Wang, and Zhou Zhou. Long non-coding and coding rna profiling using strand-specific rna-seq in human hypertrophic cardiomyopathy. Scientific Data, 6(1):90, 2019. [140] Colleen M. Connelly, Michelle H. Moon, and John S. Schneekloth. The emerging role of rna as a therapeutic target for small molecules. Cell Chemical Biology, 23(9):1077?1090, 2016. [141] Warren L DeLano and Sarina Bromberg. Pymol user?s guide. DeLano Scientific LLC, 629, 2004. [142] William Humphrey, Andrew Dalke, and Klaus Schulten. Vmd: visual molecular dynamics. Journal of molecular graphics, 14(1):33?38, 1996. [143] Laura R Ganser, Megan L Kelly, Daniel Herschlag, and Hashim M Al-Hashimi. The roles of structural dynamics in the cellular functions of rnas. Nature reviews Molecular cell biology, 20(8):474?489, 2019. [144] Matthew Halvorsen, Joshua S Martin, Sam Broadaway, and Alain Laederach. Disease-associated mutations that alter the rna structural ensemble. PLoS genetics, 6(8):e1001074, 2010. [145] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256?2265. PMLR, 2015. [146] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, 2020. [147] Ulrich HE Hansmann. Parallel tempering algorithm for conformational studies of biological molecules. Chemical Physics Letters, 281(1-3):140?150, 1997. [148] Cameron Abrams and Giovanni Bussi. Enhanced sampling in molecular dynamics using metadynamics, replica-exchange, and temperature-acceleration. Entropy, 16(1):163?199, 2014. [149] Robert Abel, Lingle Wang, Edward D Harder, BJ Berne, and Richard A Friesner. Advancing drug discovery through enhanced free energy calculations. Accounts of chemical research, 50(7):1625?1632, 2017. [150] Andrew J Ballard and Christopher Jarzynski. Replica exchange with nonequilibrium switches. Proceedings of the National Academy of Sciences, 106(30):12224?12229, 2009. [151] Simon Trebst, Matthias Troyer, and Ulrich HE Hansmann. Optimized parallel tempering simulations of proteins. The Journal of chemical physics, 124(17):174903, 2006. 133 [152] Walter Nadler and Ulrich HE Hansmann. Optimizing replica exchange moves for molecular dynamics. Physical Review E, 76(5):057102, 2007. [153] Jaegil Kim, Thomas Keyes, and John E Straub. Generalized replica exchange method. The Journal of chemical physics, 132(22):224107, 2010. [154] John D Chodera and Michael R Shirts. Replica exchange and expanded ensemble simulations as gibbs sampling: Simple improvements for enhanced mixing. The Journal of chemical physics, 135(19):194110, 2011. [155] Alejandro Gil-Ley and Giovanni Bussi. Enhanced conformational sampling using replica exchange with collective-variable tempering. Journal of chemical theory and computation, 11(3):1077?1085, 2015. [156] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org. [157] Emilio Gallicchio, Michael Andrec, Anthony K Felts, and Ronald M Levy. Temperature weighted histogram analysis method, replica exchange, and transition paths. The Journal of Physical Chemistry B, 109(14):6722?6731, 2005. [158] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234?241. Springer, 2015. [159] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. [160] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998?6008. [161] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661?1674, 2011. [162] William L Jorgensen and Julian Tirado-Rives. Potential energy functions for atomic-level simulations of water and organic and biomolecular systems. Proceedings of the National Academy of Sciences, 102(19):6665?6670, 2005. [163] Jing Huang, Sarah Rauscher, Grzegorz Nawrocki, Ting Ran, Michael Feig, Bert L De Groot, Helmut Grubmu?ller, and Alexander D MacKerell. Charmm36m: an improved force field for folded and intrinsically disordered proteins. Nature methods, 14(1):71?73, 2017. [164] Virgiliu Botan, Ellen HG Backus, Rolf Pfister, Alessandro Moretto, Marco Crisma, Claudio Toniolo, Phuong H Nguyen, Gerhard Stock, and Peter Hamm. Energy transport in peptide helices. Proceedings of the National Academy of Sciences, 104(31):12749?12754, 2007. 134 [165] Mithun Biswas, Benjamin Lickert, and Gerhard Stock. Metadynamics enhanced markov modeling of protein dynamics. The Journal of Physical Chemistry B, 122(21):5508?5514, 2018. [166] Shams Mehdi, Dedi Wang, Shashank Pant, and Pratyush Tiwary. Accelerating all- atom simulations and gaining mechanistic understanding of biophysical systems through state predictive information bottleneck, 2021. [167] Sylvie Bannwarth and Anne Gatignol. HIV-1 TAR RNA: The target of molecular interactions between the virus and its host. Current HIV Research, 3(1):61?71, January 2005. [168] Ursula Schulze-Gahmen and James H. Hurley. Structural mechanism for HIV-1 TAR loop recognition by tat and the super elongation complex. Proceedings of the National Academy of Sciences, 115(51):12973?12978, December 2018. [169] Laura R. Ganser, Megan L. Kelly, Daniel Herschlag, and Hashim M. Al-Hashimi. The roles of structural dynamics in the cellular functions of RNAs. Nature Reviews Molecular Cell Biology, 20(8):474?489, June 2019. [170] Jir??? S?poner, Giovanni Bussi, Miroslav Krepl, Pavel Bana?s?, Sandro Bottaro, Richard A. Cunha, Alejandro Gil-Ley, Giovanni Pinamonti, Simo?n Poblete, Petr Jurec?ka, Nils G. Walter, and Michal Otyepka. RNA structural dynamics as captured by molecular simulations: A comprehensive overview. Chemical Reviews, 118(8):4177?4338, January 2018. [171] Christina Bergonzo, Niel M Henriksen, Daniel R Roe, and 3rd Cheatham, Thomas E. Highly sampled tetranucleotide and tetraloop motifs enable evaluation of common rna force fields. RNA (New York, N.Y.), 21(9):1578?1590, 09 2015. [172] Christopher Jarzynski. Rare events and the convergence of exponentially averaged work values. Physical Review E, 73(4):046105, 2006. [173] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530?1538. PMLR, 2015. [174] Giorgio Visani, Enrico Bagli, Federico Chesani, Alessandro Poluzzi, and Davide Capuzzo. Statistical stability indices for lime: Obtaining reliable explanations for machine learning models. Journal of the Operational Research Society, pages 1? 11, 2020. [175] Filip Karlo Dos?ilovic?, Mario Brc?ic?, and Nikica Hlupic?. Explainable artificial intelligence: A survey. In 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO), pages 0210?0215. IEEE, 2018. 135 [176] Alex Davies, Petar Velic?kovic?, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomas?ev, Richard Tanburn, Peter Battaglia, Charles Blundell, Andra?s Juha?sz, et al. Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70?74, 2021. [177] Floris Takens. Detecting strange attractors in turbulence. In Dynamical systems and turbulence, Warwick 1980, pages 366?381. Springer, 1981. [178] Krzysztof Baran?ski, Yonatan Gutman, and Adam S?piewak. A probabilistic takens theorem. Nonlinearity, 33(9):4940, 2020. [179] Robert M Clegg. Fluorescence resonance energy transfer. Current opinion in biotechnology, 6(1):103?110, 1995. [180] M Abraham, B Hess, D van der Spoel, and E Lindahl. Gromacs reference manual, version 2016.4. Department of Biophysical Chemistry, University of Groningen, 2016. [181] Mark James Abraham, Teemu Murtola, Roland Schulz, Szila?rd Pa?ll, Jeremy C. Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1-2:19 ? 25, 2015. [182] Gareth A. Tribello, Massimiliano Bonomi, Davide Branduardi, Carlo Camilloni, and Giovanni Bussi. Plumed 2: New feathers for an old bird. Computer Physics Communications, 185(2):604 ? 613, 2014. [183] Berk Hess, Henk Bekker, Herman J. C. Berendsen, and Johannes G. E. M. Fraaije. Lincs: A linear constraint solver for molecular simulations. Journal of Computational Chemistry, 18(12):1463?1472, 1997. [184] Jing Huang, Sarah Rauscher, Grzegorz Nawrocki, Ting Ran, Michael Feig, Bert L De Groot, Helmut Grubmu?ller, and Alexander D MacKerell. Charmm36m: an improved force field for folded and intrinsically disordered proteins. Nature methods, 14(1):71?73, 2017. [185] M. Parrinello and A. Rahman. Crystal structure and pair potentials: A molecular- dynamics study. Phys. Rev. Lett., 45:1196?1199, Oct 1980. [186] William G Hoover. Canonical dynamics: Equilibrium phase-space distributions. Physical review A, 31(3):1695, 1985. [187] Massimiliano Bonomi. Promoting transparency and reproducibility in enhanced molecular simulations. Nature methods, 16(8):670?673, 2019. [188] Christina Bergonzo, Niel M. Henriksen, Daniel R. Roe, and Thomas E. Cheatham. Highly sampled tetranucleotide and tetraloop motifs enable evaluation of common RNA force fields. RNA, 21(9):1578?1590, June 2015. 136