ABSTRACT Title of dissertation: MOLECULAR DYNAMICS SIMULATION AND MACHINE LEARNING STUDY OF BIOLOGICAL PROCESSES Mahdi Ghorbani, Doctor of Philosophy, 2022 Dissertation directed by: Professor Jeffery B. Klauda University of Maryland, Chemical Engineering In this dissertation, I use computational techniques especially molecular dynamics (MD) and machine learning to study important biological processes. MD simulations can effec- tively be used to understand and investigate biologically relevant systems with lengths and timescales that are otherwise inaccessible to experimental techniques. These include but are not limited to thermodynamics and kinetics of protein folding, protein-ligand binding free energies, interaction of proteins with membranes, and designing new therapeutics for diseases with rational design strategies. The first chapter includes a detailed description of the computational methods including MD, Markov state modeling and deep learning. In the second chapter, we studied membrane active peptides using MD simulation and machine learning. Two cell penetrating peptides MPG and Hst5 were simulated in the presence of membrane. We showed that MPG enters the model membrane through its N-terminal hydrophobic residues while Hst5 remains attached to the phosphate layer. Formation of helical conformation for MPG helps its deeper insertion into membrane. Natural language processing (NLP) and deep generative modeling using a variational attention based vari- ational autoencoder (VAE) was used to generate novel antimicrobial peptides. These in silico generated peptides have a high quality with similar physicochemical properties to real antimicrobial peptides. In the third chapter, we studied kinetics of protein folding us- ing Markov state models and machine learning. We studied the kinetics of misfolding in ?2-microglobulin using MSM analysis which gave us insights about the metastable states of ?2m where the outer strands are unfolded and the hydrophobic core gets exposed to solvent and is highly amyloidogenic. In the next part of this chapter, we propose a machine learning model Gaussian mixture variational autoencoder (GMVAE) for simultaneous di- mensionality reduction and clustering of MD simulations. The last part of this chapter is about a novel machine learning model GraphVAMPNet which uses graph neural net- works and variational approach to markov processes for kinetic modeling of protein fold- ing. In the last chapter, we study two membrane proteins, spike protein of SARS-COV-2 and EAG potassium channel using MD simulations. Binding free energy calculations using MMPBSA showed a higher binding affinity of receptor binding domain in SARS-COV-2 to its receptor ACE2 than SARS-COV which is one of the major reason for its higher in- fection rate. Hotspots of interaction were also identified at the interface. Glycans on the spike protein shield the spike from antibodies. Our MD simulation on the full length spike showed that glycan dynamics gives the spike protein an effective shield. However, breaches were found in the RBD at the open state for therapeutics using network analysis. In the last section, we study ligand binding to the PAS domain of EAG potassium channel and show that a residue Tyr71 blocks the binding pocket. Ligand binding inhibits the current through EAG channel. MOLECULAR DYNAMICS SIMULATION AND MACHINE LEARNING STUDY OF BIOLOGICAL PROCESSES by Mahdi Ghorbani Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2022 Advisory Committee: Professor Jeffery B. Klauda, Chair/Advisor Dr. Bernard R. Brooks, Co-Advisor Prof. Srinivasa R. Raghavan Prof. Taylor J. Woehl Prof. Pratyush Tiwary ?c Copyright by Mahdi Ghorbani 2022 Dedication I would like to dedicate this dissertation to my family and friends especially my mother who has always been a source of inspiration and hard work to me. They have taught me to have a callous mind, never give up when confronting daunting challenges and know that hard work always pays off. I also dedicate this dissertation to all my friends who supported me throughout the process. I appreciate what they have done for me especially my dear friends Hamid Doosthosseini and Ehsan Faegh who always encouraged me to pursue science. ii Acknowledgments It is impossible to acknowledge everyone that supported me during my PhD. Firstly, I would like to thank my advisor Prof. Klauda. He has been a wonderful mentor who has always helped me in successfully completing my PhD and I am proud of working with him. I would also like to thank Dr. Bernard Brooks my Co-advisor at NIH. I consider myself lucky to work with such intelligent and caring people. Working alongside them and group members at NIH and UMD gave me the expertise I needed to complete my degree and without their help this wouldn?t have been possible. I would like to thank my collaborators Dr. Karlsson at UMD and Dr. Brelidze at Georgetown university. I would like to thank the computing resources at UMD (deepthought2) and NIH (biowulf and lobos) for providing the computational time for my projects. iii List of Figures 1.1 Periodic boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Illustration of replica exchange molecular dynamics (REMD) . . . . . . . . 10 1.3 An illustration of metadynamics simulation technique . . . . . . . . . . . . 12 1.4 Representation of a perceptron . . . . . . . . . . . . . . . . . . . . . . . . 22 1.5 multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.6 Gradient descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.7 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 25 1.8 Unrolling a RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.9 Illustration of Long short term memory unit . . . . . . . . . . . . . . . . . 27 1.10 Variational autoencoder with encoder, decoder and a gaussian latent space . 30 2.1 Secondary structure of Hst5 and MPG . . . . . . . . . . . . . . . . . . . . 34 2.2 Starting conformations of peptides . . . . . . . . . . . . . . . . . . . . . . 39 2.3 Translocation of CPPs using HMMM . . . . . . . . . . . . . . . . . . . . . 40 2.4 Heatmaps for insertion of peptides into the model membrane . . . . . . . . 41 2.5 Heatmap plots for Hst5 and MPG starting as helices . . . . . . . . . . . . . 42 2.6 Heatmaps for MPG and Hst5 starting from unfolded . . . . . . . . . . . . . 44 2.7 Heatmap for Hst5 for orientation 2 and 3 . . . . . . . . . . . . . . . . . . . 45 2.8 Heatmap for MPG for orientation 2 and 3 . . . . . . . . . . . . . . . . . . 46 2.9 Cellular uptake studies done in Karlsson lab . . . . . . . . . . . . . . . . . 46 iv 2.10 Effect of time and cargo orientation on translocation . . . . . . . . . . . . . 47 2.11 snapshots of MPG with DOPC(80%)-DOPG(20%) . . . . . . . . . . . . . 49 2.12 insertion depth and secondary structure of MPG at 100 and 80 DOPC . . . 49 2.13 insertion depth and secondary structure of MPG at 60 and 40 DOPC . . . . 50 2.14 classification network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 2.15 Training and validation accuracy . . . . . . . . . . . . . . . . . . . . . . . 62 2.16 AMP generative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 2.17 Evaluation of the AMP-generative model over different values of ?KL and ?a 64 2.18 Physico-Chemical properties of the generated 10,000 AMP sequences . . . 66 3.1 feature selection with VAMP-2 score over 3 lagtimes (10,20,50 ns) . . . . . 74 3.2 Optimal choice of hyperparameters for MSM . . . . . . . . . . . . . . . . 74 3.3 Free energy landscape in the space of 4 TICs . . . . . . . . . . . . . . . . . 75 3.4 A) implied timescales B) CK test . . . . . . . . . . . . . . . . . . . . . . . 76 3.5 Eigenvectors of the transition matrix . . . . . . . . . . . . . . . . . . . . . 77 3.6 fraction of native contact for different states . . . . . . . . . . . . . . . . . 77 3.7 metastable state assignment according to PCCA++ over the TICA space . . 78 3.8 hydrophobic SASA over the TICA landscape . . . . . . . . . . . . . . . . 79 3.9 hydrophobic SASA and flux over different states . . . . . . . . . . . . . . . 81 3.10 representative structures and the timescale of transition between different states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 3.11 RMSF of different metastable states from sampled snapshots. . . . . . . . . 82 3.12 ? -sheet content of different states A) S1 B) S2) C) S3 and D) S4 . The grey area in each figure shows the ? -sheet content of state S1 . . . . . . . . . . 83 3.13 Representative structures of states S1, S2 and S3 . . . . . . . . . . . . . . . 86 3.14 graphical model for inference and generative parts of GMVAE . . . . . . . 94 v 3.15 Native folded structure of studied proteins. . . . . . . . . . . . . . . . . . . 94 3.16 Training and validation loss for Trp-Cage example . . . . . . . . . . . . . . 96 3.17 Reconstruction loss vs latent space dimension . . . . . . . . . . . . . . . . 97 3.18 Results of GMVAE for Trp-cage. . . . . . . . . . . . . . . . . . . . . . . . 98 3.19 rp-cage folding transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.20 clusters of Trp-cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.21 Results of GMVAE for BBA . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.22 BBA folding and unfolding transitions . . . . . . . . . . . . . . . . . . . . 103 3.23 RMSD of different parts of BBA . . . . . . . . . . . . . . . . . . . . . . . 104 3.24 GMVAE embedding results for villin . . . . . . . . . . . . . . . . . . . . . 106 3.25 Transitions between different states in villin . . . . . . . . . . . . . . . . . 106 3.26 RMSD distribution of helices in Villin . . . . . . . . . . . . . . . . . . . . 108 3.27 Overview of the architecture of GraphVAMPNet . . . . . . . . . . . . . . . 117 3.28 results of GraphVAMPNet for TrpCage . . . . . . . . . . . . . . . . . . . . 126 3.29 training and validation losses from SchNet based VAMPNet TrpCage . . . . 127 3.30 Representative structure of each metastable state in TrpCage . . . . . . . . 130 3.31 Results of GraphVampNet for Villin . . . . . . . . . . . . . . . . . . . . . 132 3.32 Representative structure of each metastable state in Villin . . . . . . . . . . 133 3.33 Results of GraphVAMPNet for NTL9 . . . . . . . . . . . . . . . . . . . . 135 3.34 Representative structure of each metastable state in NTL9 . . . . . . . . . . 137 3.35 Comparison of implied timescales . . . . . . . . . . . . . . . . . . . . . . 141 3.36 Comparison of implied timescales from GraphVAMPNet and standard VAMP- Net for A)TrpCage B)Villin C)NTL9 . . . . . . . . . . . . . . . . . . . . . 142 4.1 superposition of RBD of SARS-COV . . . . . . . . . . . . . . . . . . . . . 147 4.2 Sequence comparison of the receptor binding motif (RBM) . . . . . . . . . 149 vi 4.3 C? RMSD plots for nCOV-2019 and SARS-COV and mutants of SARS- COV-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 4.4 RMSF plots for nCOV-2019-WT and mutants . . . . . . . . . . . . . . . . 156 4.5 Mapping the principal components of RBD . . . . . . . . . . . . . . . . . 158 4.6 Dynamic cross correlation maps for nCOV-2019, SARS-COV and mutants . 159 4.7 Binding energy decomposition per residue for RBD of nCOV-2019 and SARS-COV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 4.8 Total binding energy of SARS-COV, nCOV-2019 and mutants . . . . . . . 163 4.9 H bonds between RBD of nCOV-2019 and ACE2 . . . . . . . . . . . . . . 167 4.10 Binding energy decomposition for systems: nCOV-2019, T478I and N439K. 171 4.11 Structure of spike protein and its glycosylation pattern . . . . . . . . . . . . 178 4.12 RMSD of different regions of the spike protein . . . . . . . . . . . . . . . . 184 4.13 Snapshot of spike in open state after 1000ns . . . . . . . . . . . . . . . . . 184 4.14 RMSF of glycans in the open and closed state . . . . . . . . . . . . . . . . 185 4.15 2-dimensional PCA for the open state of spike protein head . . . . . . . . . 186 4.16 porcupine plot for RBD-up and distribution of distances between NTD of each monomer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 4.17 Glycan occupacy of different regions . . . . . . . . . . . . . . . . . . . . . 189 4.18 Centrality measurements (eigenvector and betweenness) for A) RBD-up and B) RBD-down states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 4.19 network centralities for the head region . . . . . . . . . . . . . . . . . . . . 192 4.20 microdomains in the open state of spike head . . . . . . . . . . . . . . . . 194 4.21 Antibody overlap analysis for the open state. . . . . . . . . . . . . . . . . . 196 4.22 Avg number of clashes of each glycan with the overlaid antibody . . . . . . 198 vii 4.23 Structure of full-length EAG channel embedded in membrane. VSD stands for voltage sensing domain which includes S1-S4 transmembrane helices. Pore domain (PD) includes transmembranes S1 and S2 . . . . . . . . . . . 204 4.24 Replica exchange solute tempering. A) distribution of enthalpies for differ- ent replicas showing the high overlap B) transition of Tyr71 residue from the crystal structure to a state after REST simulations where it no longer blocks the binding pocket C) Random walk of resplicas 1,2 and 3 in the replica space with their neighbors over the course of simulations D) Dock- ing CPZ to the binding pocket of PAS after REST simulation. . . . . . . . 209 4.25 Docked poses of the 5 ligands . . . . . . . . . . . . . . . . . . . . . . . . . 210 4.26 Binding free energy decomposition for residues with higher than 0.5 kcal/mol contribution to binding free energy of different ligands . . . . . . . . . . . 211 4.27 electrostatic nature of binding pocket . . . . . . . . . . . . . . . . . . . . . 212 4.28 RMSD of different regions of the EAG channel for apo and bound states . . 213 4.29 RMSF of different regions of the PAS . . . . . . . . . . . . . . . . . . . . 214 4.30 H-bonds and salt-bridges between PAS and CNBHD for apo and bound state 214 4.31 Current flow analysis for the bound state and the current flow plots . . . . . 216 4.32 current flow difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 viii List of Tables 2.1 Membrane lipid components for yeast membrane . . . . . . . . . . . . . . 36 2.2 BLEU, accuracy and perplexity of a few selected models . . . . . . . . . . 65 2.3 self-BLEU (sBLEU) for 3,4 and 5-grams and KL divergence . . . . . . . . 68 3.1 stationary probability and free energy of different metastable states . . . . . 78 3.2 Chosen hyperparameters for each protein . . . . . . . . . . . . . . . . . . . 93 3.3 Hyperparameters for each system in GraphVAMPNet . . . . . . . . . . . . 125 3.4 Average VAMP-2 score for each system . . . . . . . . . . . . . . . . . . . 125 3.5 Implied timescales calculated for TrpCage (at lagtime of 20ns), Villin (at lagtime of 20ns) and NTL9 (lagtime of 200ns) from SchNet based Graph- VAMPNet and standard VAMPNet . . . . . . . . . . . . . . . . . . . . . . 128 4.1 Binding free energy decomposition in kcal/mol for nCOV-2019, SARS- COV and mutants of SASR-COV-2 . . . . . . . . . . . . . . . . . . . . . . 163 4.2 H-bonds and salt bridges between nCOV-2019(salt bridges are shown as bold)166 4.3 Binding free energies kcal/mol . . . . . . . . . . . . . . . . . . . . . . . . 210 ix Table of Contents Dedication ii Acknowledgements iii 1 Introduction 1 1.1 Molecular dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Force Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Hamiltonian equations of motion . . . . . . . . . . . . . . . . . . . 3 1.1.3 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.1.4 Periodic boundary conditions . . . . . . . . . . . . . . . . . . . . . 5 1.1.5 Cutoff methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.6 Thermostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.7 Barostats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.8 Enhanced sampling methods . . . . . . . . . . . . . . . . . . . . . 9 1.2 Kinetic modeling and Markov State Models . . . . . . . . . . . . . . . . . 14 1.2.1 MSM construction from MD simulations . . . . . . . . . . . . . . 18 1.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3.1 Artificial Neural network . . . . . . . . . . . . . . . . . . . . . . . 21 1.3.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . 24 1.3.3 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . 25 1.3.4 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 28 1.3.5 Graph neural networks . . . . . . . . . . . . . . . . . . . . . . . . 31 1.4 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2 Machine learning and MD for membrane active peptides 33 2.1 Molecular dynamics of cell penetrating peptide interaction with model mem- branes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.1.1 Simulation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.1.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 48 2.2 Deep generative models for Antimicrobial peptide discovery . . . . . . . . 53 x 2.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.2 AMP prediction model . . . . . . . . . . . . . . . . . . . . . . . . 55 2.2.3 Variational autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 57 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Variational Attention . . . . . . . . . . . . . . . . . . . . . . . . . 59 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 60 Antimicrobial prediction . . . . . . . . . . . . . . . . . . . . . . . 60 Training the generating network . . . . . . . . . . . . . . . . . . . 61 2.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3 Markov modeling and machine learning 70 3.1 Markov modeling of conformational fluctuations in ?2-microglobulin . . . . 70 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 MSM construction and validation . . . . . . . . . . . . . . . . . . 73 3.2.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 82 3.3 Variational embedding of protein folding simulations using Gaussian mix- ture variational autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.3 Model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.3.5 Trp-cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.3.6 BBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.3.7 Villin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.3.8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 108 3.4 GraphVAMPNet, using graph neural networks and variational approach to markov processes for dynamical modeling of biomolecules . . . . . . . . . 112 3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 3.4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.4.3 Protein Graph representation . . . . . . . . . . . . . . . . . . . . . 120 Graph Convolution layer . . . . . . . . . . . . . . . . . . . . . . . 121 SchNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 3.4.4 Model selection and Hyperparameters . . . . . . . . . . . . . . . . 124 3.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 3.4.6 Trpcage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.4.7 Villin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 3.4.8 NTL9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.4.9 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 136 xi 4 Molecular dynamics study of membrane proteins 145 4.1 Critical Sequence hotspots for binding of novel coronavirus to Angiotensin Converter Enzyme as Evaluated by Molecular Simulations . . . . . . . . . 145 4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Sequence comparison and mutant preparation . . . . . . . . . . . . 151 Molecular dynamics simulations . . . . . . . . . . . . . . . . . . . 152 Gibbs free energy and correlated motions . . . . . . . . . . . . . . 153 Binding free energy from MMPBSA method . . . . . . . . . . . . 154 4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Structural dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 155 PCA and approximate free energy landscape . . . . . . . . . . . . . 156 Dynamic Cross Correlation Maps (DCCM) . . . . . . . . . . . . . 157 Binding free energies . . . . . . . . . . . . . . . . . . . . . . . . . 159 Important interactions at the RBD-ACE2 interface . . . . . . . . . . 162 4.1.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 166 4.2 Exploring dynamics and network analysis of spike glycoprotein of SARS- COV-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Molecular dynamcis simulations . . . . . . . . . . . . . . . . . . . 179 Solvent accessible surface area (SASA) . . . . . . . . . . . . . . . 180 Network analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 Glycan-antibody overlap analysis . . . . . . . . . . . . . . . . . . 182 4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 Dynamical motions of the spike protein . . . . . . . . . . . . . . . 182 Occupancy of spike protein by glycans . . . . . . . . . . . . . . . . 188 Glycan-Glycan interaction and network analysis of glycans . . . . . 189 Antibody overlap analysis . . . . . . . . . . . . . . . . . . . . . . 195 4.2.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 197 4.3 Molecular dynamics of ligand binding to PAS domain of EAG channel . . . 202 4.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Full length EAG simulation . . . . . . . . . . . . . . . . . . . . . . 212 4.3.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 217 5 Conclusions and open problems 221 Bibliography 225 xii Chapter 1: Introduction 1.1 Molecular dynamics Proteins are responsible for nearly all biological processes that is essential for life. They metabolize nutrients, regulate genetics, recognize pathogens and sense the outside world. This is remarkable considering proteins are linear polymers of 20 building blocks called amino acids. Functionality of proteins depends on spatial and temporal structure of the protein. However, a 3D shape of protein is not the only determinant of protein function. Conformational flexibility is an inherent property of all proteins and it is essen- tial for the function of many proteins such as transport proteins, signal transduction pro- teins, cellular recognition and numerous enzymes.[345] Allosteric proteins, such as GPCR perform large scale conformational changes upon binding of ligand to their binding site which induces large conformational changes and a cascade of intracellular response.[219] There are numerous experimental techniques to study protein dynamics, such as Nuclear Magnetic Resonance (NMR)[141], fluorescence resonance energy transfer (FRET)[183], atomic microscopes and optical tweezers[126]. Despite the numerous amount of experi- mental techniques to study dynamics of proteins, there are spatio-temporal limitations to the time and length scale of the conformational space these method could achieve. More- over, details about the pathway of different conformations remains unknown. On the other hand, molecular dynamics (MD) have been named a computational microscope[163], giv- ing us detailed microscopic interactions that play major roles in folding, ligand binding 1 and other biological problems. In fact the Nobel Prize for Chemistry in 2013 was awarded to Martin Karplus, Michael Levitt, and Arieh Warshall for their pioneering work of MD methodology for biomolecular systems. In short, in MD the interactions between particles whose positions are denoted as the 3D Cartesian coordinates of individual atoms, are com- puted. The motion of each particle is defined by the potential energy whose derivative are calculated to obtain forces between particles which is then used to solve the Newton?s equa- tions of motion. Solving these together in consecutive steps generates a trajectory of the dynamics of the system under study. All-atom MD using classical force-fields allowed the study of dynamics in small molecules such as small peptides to large protein systems such as virus capsids.[38, 187] The accuracy of a MD simulation depends on two factors: First is the empirical force-field used for parameters of interaction between particles in the system. The other is the simulation time which should be long enough to overcome the local energy barriers which lead to quasi-ergodicity. One straightforward way is to run long simulations using supercomputers.[273] Enhanced sampling approaches are also developed to sample conformational space in a more efficient manner. The subsequent subsections give a broad overview of basic concepts and techniques in MD simulation. 1.1.1 Force Fields A force field describes the parameters of interaction between particles in the system. Empirical force fields represent biomolecules at atomistic resolution. These additive poten- tial energy functions consist of large number of force-field parameters which are obtained from empirical and quantum mechanical studies on small-molecules. Some commonly used forcefielsd are CHARMM , AMBER and GROMOS.[115] These force-fields may involve different terms and definitions of the potential energy function. For example the CHARMM forcefield takes the form: 2 V (R) = ? Kb(b?b 20) + ? K? (? ?? )20 ? K?(1+ cos(n??? ))+ bonds angles dihedrals ?(S?S 20) + ? Kimp(? ?? 20) + UB impropers R ? ? mini j 12 R ? mini j qiq j [( ) ( )6]+ (1.1) non?bond ri j ri j ?1ri j In the above equation V (R) is the total energy in CHARMM forcefield. First term three is the potential energy for bonds, angles and torsions respectively. Following terms are Urey-Bradley, improper dihedrals, non-bonded van der Waals and electrostatic terms. 1.1.2 Hamiltonian equations of motion For a system of N particles, the Hamiltonian of a system is the sum of potential and kinetic energies: 1 n p2 H(q1, ...,qn; p1, ..., pn) = ? i +U(q , ...,q2 m 1 m) (1.2)n=1 i where mi is the mass, qi is the coordinate and pi is the momenta of particle i. The Hamiltonian equations of motion are given by: ?H p q? ii = = (1.3)? pi mi ??H ??Up?i = = = Fi (1.4)?qi ?qi qi and pi are time derivatives and Fi is the net force on particle i. Solving these equations leads to the trajectory of all particles in the system. If the system is isolated, the total 3 energy of the system is conserved and ?H/? t = 0 and all microstates are visited with equal probability. However, if an external heat bath is coupled with the system, there will be energy exchange between the system and the bath which corresponds to a canonical ensemble. In this situation, the phase space is explored with probability: ?H(q,p)P(q, p)? e kBT (1.5) In a canonical ensemble, the temperature of the system and the number of particles of the system are conserved. In NPT canonical ensemble the pressure of the system remains constant while in NVT ensemble, the volume of the system is kept constant. 1.1.3 Integrators Numerical integration methods are needed to find an approximate solution of the or- dinary differential equations given a timestep and initial positions and velocities of the atoms in the system. Some of the most widely used numerical integrators are Verlet algorithm[109] and Leapfrog.[140] Verlet integrator uses two Taylor series expansions to derive the positions: q(t +? t) = 2q(t)?q(t?? t)+? t2q?(t)+O(? t4) (1.6) velocities are calculated by first-order central difference 1 q?(t) = [q(t +? t)?q(t?? t)]+O(? t2) (1.7) 2? t Leapfrog integrator improves the Verlet integrator by computing velocities at time t + 4 1/(2? t) and positions at time t +? t. 1 1 ? t q(t + ? t) = q?(t? ? t)+ F(t) (1.8) 2 2 m 1 q(t +? t) = q(t)+? tq?(t + ? t) (1.9) 2 In this way, the velocities and positions are updated with an offset of a half-timestep. Velocities at time t are computed as : 1 1 1 q?(t) = [q?(t + ? t)+ q?(t? ? t)] (1.10) 2 2 2 this algorithm is more efficient and time-reversible. 1.1.4 Periodic boundary conditions It is not computationally feasible to simulate a real system with a mole of molecules (1023 atoms) but periodic boundary conditions allow us to extend the simulations box and the unit cell is embedded in an infinite space.[343] An illustration of PBC is shown in Figure 1.1 1.1.5 Cutoff methods Calculation of nonbonded forces is usually the most time-consuming part of MD simu- lations. Most interactions such as van der Waals decay with increasing radius r and thus we can use a spherical cutoff rc and only compute the forces within this cutoff. Three different cutoff methods exist in MD simulations: truncation, shifting and switching.[232, 227] In the truncation method if the distance is more than the cutoff rc, the forces are truncated to zero. However, this scheme is problematic and causes a discontinuity at rc. In the shifting 5 Figure 1.1: Periodic boundary conditions. method, the potential is shifted linearly such that the force is zero at the cutoff rc. ?????Uvdw(r)? (r? rc)?Uvdw(rc)?U? vdw r ? rc USF(r) =? (1.11)0 r > rc Another approach is to switch off potential within a distance cutoff by applying a switching function to the potential function. This method can be applied to the electrostatic potential and forces. Long range electrostatic interactions are applied using the particle mesh ewald (PME) summation scheme.[94] 1.1.6 Thermostat In MD simulation, the temperature of the system is maintained constant in NPT or NVT ensemble by coupling the system to a thermostat with fixed temperature. There are several thermostats to employ in MD simulation: I) Berendsen Thermostat: Berendsen algorithm[168] (known as weak coupling scheme) generates the temperature fluctuations by multiplying the velocities by factor ? in a differ- 6 ent form: ? ? t T ? 1 0= + ( ?1) (1.12) ?T T (t) where ?T is the coupling parameter which determines the degree of coupling between the system and the bath and ? t is the timestep of simulation. If ?T  ? t then the system is weakly coupled to thermostat. This method is used for systems that are far from equilibrium (equilibration step). This method suffers from problems due to suppression of fluctuations of kinetic energy which is not consistent with canonical ensemble. II) Velocity rescaling thermostat: This thermostat is similar to Berendsen thermostat but produces the right canonical ensemble.[39]. In this thermostat the velocities of each particle at each timestep, or every nTC steps is scaled by a time-dependent factor ? as : nTC?t T? 1 { 0= [ + ?1}]1/2 (1.13) ?T T (t? 12?t) The parameters ?T is similar to the time constant of temperature coupling ? as : 2C ? V ?T = (1.14) Nd f k CV is the total heat capacity of the system, k is the Boltmann?s constant and Nd f is the total number of degrees of freedom of the system. This thermostat modifies the kinetic energy by ?Ek = (? ? 1)2Ek. Velocity rescaling thermostat can be viewed as a Berend- sen thermostat with an additional stochastic term that ensures the correct kinetic energy distribution.[39] III) Langevin thermostat: This thermostat can be viewed as a heat-bath with small fluid particles with Brownian motion that can affect the diffusive behavior of molecules in the system. In this method two terms are added to the Hamiltonian equations: first is the 7 viscous drag force ?? pi which acts opposite to direction of momentum.[72] Second is a random noise Ri(t) due to stochastic collisions with solvent molecule: ??Hd pi = dt? ? pidt +Ri(t)dt (1.15)?qi ?i is the friction coefficient or coupling constant?which measures the degree of coupling. Ri is the random collision and has the amplitude 2?mikBT? dWi/dt where Wi is the vec- tor of independent Wiener processes (Brownian motions) term with zero mean value and covariance of < Ri(t)R j(t + ?) >= 2?mikBT ? (?)?i j. Random forces are uncorrelated in time and between particles. With this thermostat the system not only couples globally to a heat batch but also is subject to random noise. Langevin thermstat produces a canonical ensemble when converged, but the parameter ? affects the diffusive behavior considerably. 1.1.7 Barostats In MD simulation and NPT ensemble the pressure of the system is maintained constant by coupling the system to a barostat. The pressure of a system in a cubic box of finite size can be derived from viral theorem as : Wtot =?3NkBT =?3PV +? (1.16) Wtot is the total work done by the system, V is the box volume, ?3PV is the external virial due to the interactions between the particles and the wall and ? is the inner virial for particle-particle interactions ? = ?Ni Fiqi. The pressure is obtained from : 1 N P = (3NkBT +3V ?Fi.qi) (1.17)i Scaling the inter-particle interactions is used in Berendsen barostat to control the pres- 8 sure of the system. In this scheme the volume of simulation cells is scaled by a rescaling factor ? . ? t ?(t) = 1? ?(P ?P(t)) (1.18) ? 0P ? t is the timestep, ?P is the relaxation time constant of barostat and ? is isothermal compressiblity and P0 is the temperature of barostat. In a simulation with isotropic scaling, coordinates and box vector are scaled by ?1/3. Berendsen barostat is useful in equilibrat- ing systems to get them to desired pressure but it does not provide the correct thermody- namic ensemble. Parinello-Rahman barostat is an extended-ensemble pressure coupling algorithm[196] where the simulation is carried out with both isotropic and anisotropic con- ditions and pressure fluctuations are captured correctly. This barostat is desirable for pro- duction runs. 1.1.8 Enhanced sampling methods Despite the success of MD to study biological processes, there are still limitations in the timescale that can be reached. Inadequate sampling of conformational states, in turn limits the full understanding of the functional properties of the system under study. Large scale conformational changes usually are complicated and time consuming processes and are commonly beyond the capabilities of standard MD and enhanced sampling techniques are often required. Several enhanced sampling techniques specifically replica exchange molec- ular dynamics and metadynamics are used in this dissertation and are described shortly below.[19] Replica exchange simulation (REMD) In REMD, simulations are performed with different temperatures and exchanges of coordinates between simulations at different tem- peratures are performed to enhance sampling of configurational space. An illustration of REMD is shown in figure 1.2. 9 Figure 1.2: Illustration of replica exchange molecular dynamics (REMD) where multiple simula- tions with different temperatures are run in parallel and exchanges happen between these replicas. Replica exchange solute tempering (REST) In REMD the number of required repli- cas scales with square root of the number of degrees of freedom in the system. This means even for a small system, tens of replicas would be needed to maintain an acceptable accep- tance ratio.[287] One can redefine the Hamiltonian in each replica to only consider a small subset of the system for parallel tempering. In REST1 [186] the potential energy of the sys- tem is decomposed into three parts: protein-intramolecular interactions (Epp), interaction between protein and water molecules in the solvation shell (Epw) and interaction between water molecules (Eww). The potential energy Em of replica at temperature Tm is defined as: ?0 +?E m ?0 m(X) = Epp(X) = Epw(X)+ Eww(X) (1.19)2?m ?m X is the coordinates of entire system and ?m = 1BT and ?0 = 1/k0T . In REST2 the potential energy is defined as: ? ? ? E X m E X mm( ) = pp( )+ Epw(X)+Eww(X) (1.20)?0 ?0 In REST1 different replicas have different temperatures, however, in REST2 [314] the tem- perature is constant for all replicas and all simulations arrive at the final ensemble distribu- tion for temperature Tm. The scaling is performed on bonded and non-bonded interactions 10 in the potential function. For a system with M replicas with an effective temperature Tm in replica m the equilibrium distribution is: e??mEm(Xm) Pm(Xm) = (1.21)Z where Z is the partition function. The acceptance ratio in REST2 of exchange between replicas m and n is based on the ratio of transition probabilities that satisfy the detailed balance: Pi? f Pn(Xm)Pm(Xn) = = e??nm (1.22) Pf?i Pm(Xm)Pn(Xn) ? ? ? 0nm = (?m??n)[Epp(Xn)?Epp(Xm)+? ? (Epw(Xn)?Epw(Xm))] (1.23) ?n + ?m Applying a metropolis criterion for the exchange results in : ???? Pi? f (r) =?1 ?? nm ? 0 ? (1.24)exp(??nm) ?nm > 0 REST2 have been used for conformational sampling of proteins and protein membrane interaction in multiple applications.[146, 129, 313] Metadynamics: In metadyanmics simulation, an external history dependent bias po- tential in the space of a few CVs that capture the slowest motions in the system is added to the Hamiltonian of the system which allows to escape from current conformation and sample other parts of the conformational space.[9, 10] An illustration of metadynamics simulation is shown in figure 1.3. The biasing potential is constructed as a sum of Gaussian 11 Figure 1.3: An illustration of metadynamics simulation technique kernels deposited along the collective variables s(q) ? (s ? s (q(k?))) 2 V (s, t) = W (k?)exp(?? d i i 2 ) (1.25) k? (1.35) i=1 ti are the relaxation times in a decreasing order, ?i are the eigenfunctions of transition ma- 15 ? trix with eigenvalues ?i(x) = e ? ti . The m dominant eigenfunctions ?1, ...,?m are the slow- est collective variables that characterize the dynamics on large timescales t  tm+1. The eigenvalues and eigenfunctions of Markov model can be computed by a constrained opti- mization problem which is named the variational approach to conformational dynamics.[215] This maximizes the following equation. Rm = max ?mi=1 E? [ fi(xt) fi(xt+?)]f1,..., fm E? [ fi(x )2t ] = 1 E? [ fi(xt) f j(xt)] = 0 ( f or i 6= j) (1.36) In the equations above, E? is the expected value of Xt sampled from a stationary dis- tribution and Rm is the Rayleigh trace. Therefore, the variational approach states that the eigenvalues are always underestimated and variational principle is used to maximize the eigenvalues of the Markov operator. Eigenfunctions of the Markov operator are approxi- mated by a linear combination of basis functions or feature functions X = (X1, ...,Xm)T and the eigenfunctions are computed as : m fi(x) = ? bii? j(x) = bTi ?(x) (1.37) j=1 Expansion coefficients bi and eigenvalues of the Markov model can be computed by solving a generalized eigenvalue problem: C(?)B =C(0)B?? (1.38) C(0) and C(?) are the autocorrelation with zero lagtime and time-lagged covariance ma- 16 trices of basis functions and ? is the diagonal matrix of eigenvalues ?? = diag(??1, ..., ??m) and B = (b1, ...,bm). Inserting these coefficients would result in approximation of eigen- functions P(?) =C(0)?1C(?) is the transition matrix of the MSM. The lagtime ? should be long enough to ensure the dynamics is Markovian and short enough to find the dynamics that wer are interested. The implied timescales ti is an approximation of the decorrelation time of ith process and is computed from the eigenvalues of the MSM transition rate matrix as : ti =? ? (1.39) ln|?i(?)| The implied timescales (ITS) can be used to choose a lagtime by choosing one that makes the ITS constant. Once we choose the lagtime we can check whether a transition probability matrix P(?) is Markovian by Chapman-Kolomogorov test: P(k?) = Pk(?) (1.40) The validated transition matrix is then decomposed into eigenvalues and eigenvectors. The highest eigenvlaue is always ?1(?) = 1 corresponding to eigenvector of the stationary distribution ? with the property: ?T P(?) = ?T (1.41) All other eigenvalues ?i>1 are real with norm less than 1 and are relatedto characteristic to implied timescales of dynamical processes within the system. 17 1.2.1 MSM construction from MD simulations In the last 20 years, researchers have constructed a pipeline for construction and valida- tion of a Markov state model from molecular dynamics trajectories that involves multiple steps described below. I) Feature selection: In order to build a Markov model of long-timescale kinetics, one must first choose a few features or collective variables that are important for the system un- der study. These features could include but not limited to distances, torsions and Cartesian coordinates. Variational approach to Markov processes provides a score (VAMP-2) which allows comparison of different features and choosing the best set of features based on a cross-validated VAMP-2 score.[262] II) Dimensionality reduction: Featurization of the molecular system leads to high dimensional space. Discretizing a very high dimensional space by clustering is inefficient and could lead to low-quality discretizations that does not accurately describe the dynamics of the system. Therefore, usually we first reduce the dimensionality by a linear coordinate transformation. In this transformation, we look for a set of basis vectors U = [u1, ...,um] where ui is a collective coordinate with mi components. After the transformation y(t) = UT x(t) are the new coordinates. Common linear dimensionality reduction methods are PCA and TICA. PCA transforms data into orthogonal basis which are uncorrelated and retains the variance of the dataset. However, this does not describe the molecular kinetics. We are interested in preserving the slow motions rather than high-amplitude motions as in PCA. For example, consider a small peptide which is highly flexible at its termini and undergoes a rare event concerted torsion at its center. We are interested in the kinetics of this rare-event rather than the high-variance fluctuations at the termini. TICA is a form of variational approach to conformational kinetics (VAC) and is the optimal method for finding the flow reaction coordinates and the relaxation timescales. TICA is similar to 18 PCA but uses a time-lagged correlation matrix C(?), where Ci j(?) =< x?i(t)x? j(t) >t . A generalized eigenvalue problem: C(?)ui =C(0)?i(?)ui (1.42) is solved and the new coordinates are now uncorrelated under the lagtime ? . The kinetic m 2 variance under transformation can be defined as KV = ?i=1 ?im T KV where T KV = ? 2 i=1 ?i is the total kinetic variance which roughly measure the total number of slow processes. III) Clustering: State decomposition happens at the clustering step. The features in the TICA space are grouped into a set of clusters using a clustering algorithm. Kmeans clustering is usually used as the method of choice in this step. VI) Building the transition rate matrix: After the clustering, the state space is dis- cretized into discrete trajectories s(t) jumping between n microstates where n is the number of clusters. Conditional transition probability between microstates at time ? is defined as: Pi j(?) = P(s(t + ?) = j|s(t) = i) (1.43) A Markov model predicts the kinetics at longer timescales using the Markov property: P(s(t + ?) = j|s(t) = i) = [Pk(?)]i j (1.44) The MSM also predicts the equilibrium probability ? = ?i in terms of stationary vector ?T = ?T P(?). In order to compute the error bars of the timescales, a Bayesian MSM samples from posterior distribution P(P(?)|C c (?)(?)) ?P?n i ji, j=1Pi j where ci j(?) is the number of transitions computed between i and j over lagtime ? . The lagtime ? must be picked such that the relaxation times t (?) = ? ?i ln? (?) are approximately constant withini statistical error for longer times. The model is validated using Chapman-Kolmogrov (CK) 19 test where the model estimated at lagtime ? must be able to predict estimates performed at longer timescales k? which can be written as: P(k?) = Pk(?) (1.45) V) Coarse-graining MSM: Often it is desirable to describe the molecular process in a few states that contains the essential structural, thermodynamic and kinetic information. However, the number of microstates generated during clustering in building an MSM is usually in the range of hundreds to thousands. On the other hand, a coarse-grained model is important to compute information such as mean first passage times (MPFT) from one set of states to another. A fuzzy assignment in which each microstate i has a assignment probability to macrostate I has been proposed[156, 341] to preserve the slow kinetics in the coarse-grained kinetic model where miI = P(macro? I|micro? i). The membership prob- abilities are computed by a linear combination of first m eigenvectors of the observation matrix by PCCA++ method[245] which exactly preserves the relaxation kinetics of the m slowest processes. 1.3 Machine Learning Introduction to Neural networks: Machine learning (ML) involves using computa- tional methods to learn from the data without any explicit programming. ML is being used in nearly all the fields of science and has made important and notable breakthroughs. Deep learning is a subfield of ML and is concerned with algorithms which loosely mimic the human brain and are called deep neural networks (DNNs). A classical application of ML is in image classification where the model tries to associate a label to each image using the features in the pixel data. The underlying idea is that there is a explicit relation between the set of pixels and the associated label which the model tries to learn. The same idea 20 can be extended to the molecular space where a full description of atomic or molecular features dictates its chemical properties. ML techniques have been used in many aspects of molecular simulation such as enhanced sampling[21, 244], forcefield optimization[179? ] and kinetic modeling [194] and etc. Machine learning techniques could be grouped in multiple categories: Supervised learning: In supervised learning, the model is provided with the inputs and the labels for all input samples and the task is usually finding the desired properties (labels) of any given input. The model after proper training is able to predict the target (label) of unseen samples. Most common supervised learning tasks are regression and classification tasks. Unsupervised Learning: Unlike supervised learning in unsupervised learning the data will not contain labels and the task is to identify patterns or similarities/differences within the data. Clustering for example is an unsupervised learning problem. Methods such as K-means clustering fall in this category. Another important task in unsupervised learning is dimensionality reduction where the goal is to find a reduced dimension of the data that carries most of the important or relevant information. Reinforcement Learning: This type of machine learning is different from other cat- egories as an agent takes action in the environment and the goal is maximizing a reward function. The agent learns to take actions which gives reward and avoid the ones that have a negative reward or a punishment. In this type of machine learning there is no training data (labeled or unlabeled) and the agent self improves through trial and error. 1.3.1 Artificial Neural network The structure of analytic neural networks is inspired by neural connections in the brain. Neural networks consist of multiple layers of simpler units. A ?perceptron? is the simplest 21 Figure 1.4: Representation of a perceptron neural network with a single layer.[249]. In a perceptron, the model takes single or multiple inputs and computes a weighted sum of the input and finally applies a non-linear activation function to compute a single output. This is shown in figure 1.4. y = f (?xiwi +b) (1.46) i In the above equation xi are inputs, wi are the weights and b is a bias term. f is a nonlinearity function. The nonlinearity is also called activation function. There are many different activation functions that are commonly used in machine learning such as ReLU, tanh, Sigmoid and can be used depending on the task and the type of network. A Feed-forward Neural network (FFNN) (figure 1.5) is a collection of multiple percep- trons stacked together. A FFNN is also called a multi-layer perceptron (MLP). Universal approximation theorem[133] states that a MLP containing as little as one hidden layer with a finite number of neurons can approximate any continuous function under mild assump- tions on the activation functions used. Cost functions: Training a neural network involves minimizing a loss or cost function which usually measures the discrepancy between actual values and the values that are out- put of the network. Some types of cost functions are mean squared error or a categorical cross entropy loss for classification tasks. 22 Figure 1.5: Illustration of a multilayer perceptron with 4 hidden layers and 1 output backward propagation: During training a deep neural network, we need to update the weights and biases of the network (w,b). The question remains how to compute the gradient of loss function ? with respect to the parameters of the inner layers on the network. This is done by backpropagation algorithm where a chain rule is used to compute gradient of inner layers. In this approach, we calculate numerically the derivative of the cost function with respect to weight wi j in layer l using a chain rule. The derivative of the loss ? is taken with respect to the net input to outer node ui, then using chain rule with respect to parameters we want to optimize. this can be written as: ?? ?? ?ui ? = = ? ?w ?u ?w i [bi +?wi j?y j? ] = ??w iy j (1.47)i j i i j i j j? Same can be written for the bias parameters. gradient descent: Once we have the gradients of the loss with respect to model param- eters we can minimize the loss. If ? denotes all the parameters of the neural network, given the initial parameters of the network, the most basic gradient descent scheme updates them as: 23 Figure 1.6: Gradient descent algorithm. Image credit: Science magazine ?k+1 = ?k????k?(?k) (1.48) Where ? is the learning rate, controls the size of the step for training. This is considered a hyperparameter during training. The minus sign in the equation ensures the parameters are updated to minimize the loss. Other variants of gradient descent algorithm are proposed such as stochastic gradient descent (SGD) where at each epoch of training a mini-batch of data is used for error computation. Other types include Adam optimization and RMSProp which are variants of SGD using momentum techniques to speed up the training and avoid getting stuck in local minima. 1.3.2 Convolutional neural networks This is a specialized architecture of neural network for grid-structured data with strong spatial dependencies. This architecture is increasingly being used for one-dimensional time-series data, two-dimensional images and three-dimensional video data.[110] In convo- lutional neural networks (CNNs) we use different kernels (or filters) on the data which are just matrices that learn features such as edges or lines from the data. These basic features are then used to build more complicated shapes and patterns. The convolution operation 24 Figure 1.7: Convolutional neural networks for handwritten digits. is a simple dot product, where a filter (kernel) is moved across the image. A 2D convo- lution involves length, width and depth parameters where the length and width describe the convolutional kernel and depth relates to the number of input channels. For example, images used with RGB color values have a depth of 3. The use of convolutional kernel, also gives a translational invariance to the model where features are detected regardless of their locations in the input image. 1.3.3 Recurrent neural networks Feedforward neural networks fail to capture the sequential behavior of data where the order is of importance. These include protein sequences or time series of MD simulation trajectories. Recurrent neural networks (RNNs) are particularly useful when dealing with sequential data. In RNNs data are provided in a sequential manner and the networks uses the inputs from the previous timesteps to make prediction or any decision making at the current timestep. A RNN is shown in figure 1.8 where a single RNN is unrolled to show the information processing at each timestep through multiple copies of the network.[267] In the RNN in figure 1.8, xt is th input at each timestep t, yt is the output of each 25 Figure 1.8: Unrolling a RNN timestep and ht is the hidden state at t which is calculated as: ht = f (Uxt +Wht?1) (1.49) where f is a non-linear activation function such as tanh or ReLU. U and W are weight matrices learned during training. In an RNN the weights are shared across all timesteps which greatly reduces the model complexity. Training a RNN involves a special type of backpropagation called backpropagation through time. LSTM networks: Vanilla RNNs suffer from vanishing gradient problem, which causes the model to forget long-term dependencies in data. In order to circumvent this issue, several extensions to RNNs were proposed such as long short term memory (LSTM)[130] and Gated recurrent units (GRU).[58] Long short term memory (figure 1.9) can be used to solve some of the problems with RNNs such as 1) long-term dependency in RNN and 2) vanishing gradient and exploding gradient in RNNs. A LSTM[130] consists of a cell, an input gate, an output gate and a forget gate. The cells are to store information, whereas the gates manipulate them. In LSTM, information is selectively allowed through a gate unit, using a sigmoid function. Forget gate: The first step is deciding what information to discard from the cell state. 26 Figure 1.9: Illustration of Long short term memory unit This forget gate takes ht?1 and xt as input and outputs a value between 0 and 1 for each cell state Ct?1 ft = ?(Wf [ht?1,xt ]+b f ) (1.50) where ht?1 represents the output of previous cell and xt is the input of current cell and ? is the sigmoid function. Input gate: this step decides how much new info will be added to the cell state. First a sigmoid layer determines which information needs to be updated and then a tanh layer is applied. it = ?(Wi[ht?1,xt ]+bi) (1.51) C?t = tanh(Wc[ht?1,xt ]+bc) (1.52) Ct = ft ?Ct?1 + it ?C?t (1.53) 27 Output gate: Here we fist use a sigmoid to determine which part of the cell state will be exported, then process the cell state using a tanh function. ot = ?(Wo.[ht?1,xt ]+bo) (1.54) ht = ot ? tanh(Ct) (1.55) 1.3.4 Variational Autoencoders Autoencoders are a special type of neural networks designed to efficiently compress and encode data and then learn back from the reduced encoded representation by reconstructing the original representation as closely as possible. Therefore, they are used to reduce the dimensionality of the input data. In a simple autoencoder the encoder maps the data into a low dimensional space and the decoder maps it back to the original dimension. The loss function is defined as the accuracy of reconstruction of the data with respect to the original data. A simple autoencoder is also called a deterministic autoencoder. Variational autoencoders combine autoencoders with variational inference and enable the model to learn a meaningful latent representation of data. Variational inference is based on the Bayes rule. The Bayes rule can be written as : | p(X ,Z)p(Z X) = (1.56) p(X) X refers to the observed data and Z is the latent variable. p(Z) is known as prior distribu- tion, p(X |Z) is the likelihood of the observation of X given the latent code Z. The inference problem in Bayesian statistics deals with computing the posterior p(Z|X). The denomina- tor in the above equation is the marginal distribution of data also called evidence, which 28 can be computed by marginalizing out the latent variables. However, in many cases the ev- idence is intractable and cannot be computed in closed form. On the other hand, vairational inference deals with approximate inference to find an approximate distribution that is close to the true posterior. Kullback-Leibler (KL) divergence measures the difference between two distributions and this KL is used to approximate the true posterior. KL between two probability distributions P and Q is defined as: P(i) KL(P||Q) = ?P(i) log (1.57) i Q(i) We try to find a distribution q(Z) that minimizes the KL divergence with respect to the true posterior p(Z|X). The KL in this case can be written as: q(Z) KL(q(Z)||p(Z|X)) = Eq(Z)[log ] (1.58)p(Z|X) = Eq(Z)[logq(Z)]?Eq(Z)[log p(Z,X)+Eq(Z)[log p(X)] (1.59) one can rearrange the above equation to obtain: p(Z,X) Eq(Z)[log ]+KL(q(Z)||p(Z|X)) = log p(X) (1.60)q(Z) ELBO(q)+KL(q(Z)||p(Z|X)) = log p(X) (1.61) the term on the right of above equation is called evidence logp(X) and the first term is called evidence lower bound (ELBO) since KL is always positive we have log p(X) ? ELBO(q) and ELBO acts as the lower bound for the evidence. The objective of minimizing KL is equivalent to maximizing the variational lower bound ELBO. The ELBO term in 29 Figure 1.10: Variational autoencoder with encoder, decoder and a gaussian latent space equation can be rewritten as: ELBO(q) = Eq(Z)[log p(X |Z)]?KL(q(Z)||p(Z)) (1.62) The first term in equation above is the expected log-likelihood of the data or data recon- struction loss and the second term refers to the negative KL divergence between approxi- mate posterior q(Z) and the prior p(Z) In variational autoencoders, neural networks can be used to represent the inference network (encoder) q? (Z|X) and the generative network (decoder) parameterized by param- eters ? as p? (X |Z). The network can be optimized by maximizing the ELBO as described previously. For a continuous latent variable, prior is usually chosen as a Gaussian N(0, I). In this case KL has the following closed form solution: 1 KL(N(?,?)|N(0, I)) = (1+ log((?)2)??2??2) (1.63) 2 where ? and ? are the mean and variances of the prior distribution that are learned by the network during training. KL divergence can be viewed as a regularizer of the network where the latent space is forced to a pre-defined distribution. This latent space can be used to generate novel data such as images and texts. A representation of variational autoencoder 30 is given in figure 1.10. 1.3.5 Graph neural networks Many data can be represented as graph data structure such as the structure of molecules, proteins or social networks. Graph neural networks are increasingly being used in many areas such as protein structure prediction, drug design, etc. A graph G is defined as G = (V,E,A) where V is the set of nodes, E is the set of edges and A is the adjacency matrix. Graph convolutional networks (GCN) were introduced by Kipf et al.[152] which rely on message passing between neighbors in a graph. Each node has a feature vectors that represents its message and messages are passed between neighbors during each graph convolution layer or message passing. Multiple types of GCN can be formulated based on how the messaged are passed between nodes and edges. In a simple GCN, where only nodes have feature vectors, the GCN layer can be defined as: H l+1 = ?(D??1/2A?D??1/2H(l)W (l)) (1.64) where H(l) is the features of previous nodes, W (l) is the weight parameters. A? is the adjacency with including self connections. A? = A+ I. The messages are averaged over neighbors using the matrix D? which is the diagonal matrix, where Dii is the number of connections of node i and ? denotes a nonlinear activation function. 1.4 Dissertation overview In chapter 2, I study the membrane active peptides using molecular dynamics and ma- chine learning. In the first section of this chapter, I study two different cell penetrating peptides (MPG and HST5) and their interaction with membrane using MD simulations. 31 In the second part of this chapter, I develop a deep learning model called attention based variational autoencoder to generate new antimicrobial peptides and test the efficiency of the model for generating effective peptide sequences. Third chapter deals with Markov state models and kinetic modeling of biomolecules. This chapter is divided into three sec- tions. In the first part, I study conformational dynamics of ?2-microglobulin using MD simulations and Markov state modeling to find metastable states that contribute to amyloid formation. In the second section, I explore a machine learning model called gaussian mix- ture variational autoencoder (GMVAE) for dimensionality reduction and clustering of MD trajectories of protein folding and show that the latent space from GMVAE can be used for building a Markov model. The last part of this chapter introduces a novel neural network approach to replace the pipeline of building a Markov model called GraphVAMPNet where we used graph neural network as feature representation for protein folding trajectories. Chapter 4 of this dissertation is about membrane proteins spike protein of SARS-COV-2 and the EAG potassium channel. In the first section of this chapter, I study the interaction of spike protein with its receptor and the hot spots of interaction. In the second section I study the dynamics of glycans in the spike protein of SARS-COV-2 and its impact on shielding the protein from antibodies. In the last section, I study the inhibition of EAG channel by small molecule drugs through MD simulation, docking and binding free energy calculations. Chapter 5 concludes the work presented in the dissertation and gives future directions for continuing some of the projects. 32 Chapter 2: Machine learning and MD for membrane active peptides 2.1 Molecular dynamics of cell penetrating peptide interaction with model membranes Membrane active peptides (MAPs) are peptides with activity toward membrane either by translocation (cell penetrating peptides, CPPs) or even by disruption of the membrane (Antimicrobial peptides, AMPs).[32] CPPs which are mostly short cationic or amphipathic, have the ability to enter cells without large extent disruption of the membrane. [241] How effective a pharmaceutical treatment is, often is related to its membrane permeability which prevents biomolecules from reaching their specific intracellular targets. In this context, CPPs are often used to carry biomolecular cargos with various sizes and shapes inside the cell.[79, 101] Most CPPs are primary or secondary aphipathic in nature depending on the position of hydrophobic residues in their sequence. These usually possess a sufficient num- ber of positively charged amino acid residues necessary for their adsorption onto the neg- atively charged lipid membranes. Among various CPPs, MPG and Histatin 5 (Hst5) have been utilized to deliver proteins, fluorescein labels, and siRNAs into cell and are the focus of this study.[204, 209] MPG is a short amphipathic peptide with cationic residues in its C-terminal half and hydrophobic residues in the N-terminal half and was shown to strongly interact with cell membranes and spontaneously insert into natural membranes.[208] Hst5 is shown to have both antifungal activity and bacterial effects and possess a cell penetration ability.[86] 33 Figure 2.1: Secondary structure of A) MPG B) Hst5 predicted from PEPFOLD3. Orange and red spheres show the N-terminus and C-terminus respectively. Peptide secondary structure is colored based on hydrophobic (white) hydrophilic (green) acidic (red) and basic (blue) residues Candida albicans are opportunistic pathogens which cause infections for immunocom- promised patients. Current drugs for Candida can lead to toxicity [261] or cells can develop resistance to drugs on excessive use.[99] An essential feature of an effective therapeutic is the ability to successfully deliver across cell membranes to intracellular targets. CPPs have been proposed as an alternative treatment strategy. MPG and Histatin 5 (Hst5) have been used to deliver fluorescent protein cargo, GFP into fungal cells. Both MPG and Hst5 have previously been used to deliver cargo into cells.[209, 276]MPG has been shown in Karlsson lab to deliver large biomolecules such as GFP into candida albicans cells through recombi- nant production of CPP-GFP. The predicted structure (using PEPFold3) and the sequence of MPG and Hst5 are shown in figure 2.1. Translocation of CPPs through cell membranes depends on several aspects which in- cludes the specific sequence, concentration of CPP, cell type, the secondary structure of CPP and and the cargo for translocation. Understanding the folding of CPPs is a major step for characterizing their mode of action and is important for their membrane interaction and efficacy of internalization. For instance, the secondary structure of penetratin is highly 34 dependent on the experimental conditions such as lipid types, buffer conditions and the technique used which can take ?-helical or ? -strand conformations.[90, 189] This struc- tural polymorphism has been suggested as an important factor for internalization route of CPPs as one route might be preferred by the peptide over others depending on its sec- ondary structure.[80] Understanding how CPPs interact with cell membranes and insert into cells is crucial for designing new CPPs or utilizing CPPs for intracellular delivery of molecular cargos. Obtaining detailed atomic-level resolution and nano-second timescales information about interaction of peptides with membrane is difficult and costly with exper- imental techniques. In contrast, MD provides detailed structural and dynamic information of peptide-membrane systems. Ulmschneider and coworkers [46, 299] used unbiased MD to rationally tune the functional properties of pore-forming AMPs. Based on their results, AMPs can assemble in multiple architectures near the membrane and their relative pop- ulations can provide insights into their mechanism of action. Structural properties and specifically the conformation of CPPs when interacting with cell membranes play a major role in their cellular uptake mechanisms.[108] Previous studies have shown that there is a direct impact of peptide conformation that modulates the amphipathicity and membrane insertion of CPPs.[90, 347] To better understand the details of interaction between CPPs and the cell membrane, multiple MD simulations were run for the peptides MPG and HST5. Initially, we used the highly mobile-membrane mimetic (HMMM) [220] where the lipid dynamics is enhanced by replacing the acyl tails in the bilayer center by an organic solvent designed to mimic the membrane interior. This accelerates the membrane insertion process while maintaining the detailed energetics of peptide-membrane interactions.[15, 220] Simulations included the peptides without the fusion construct. The membrane model closely mimicked a yeast plasma membrane with the composition shown in table 1.[206] To avoid bias from starting secondary structure, Hst5 and MPG were from an extended conformation in the solvent 35 Table 2.1: Membrane lipid components for yeast membrane Lipid type # lipid per leaflet ERG 60 YOPA 7 DYPC 18 POPE 20 POPI 18 POPS 27 Total lipids per leaflet 150 phase using HMMM model. The concentration of different lipid types can affect the sec- ondary structure of peptides during interaction with membrane. In order to prevent the artifact of hydrophobic solvent at the center of HMMM membrane, we also studied the interaction of MPG with several concentrations of DOPC/DOPG membrane using long- timescale molecular dynamics. 2.1.1 Simulation methods Simulations starting from the predicteed secondary structure: In the HMMM sim- ulations in this study, 1,1-dichloroethane (DCLE) was used as the hydrophobic core of the membrane. Short tailed lipids were used as headgroups where the composition of the membrane closely mimicked a Baker?s yeast model membrane which is the closest to C. albicans membrane[206] and consists of 150 lipids per leaflet. The lipid composi- tion of the membrane is described in Table 1 along the corresponding number of lipids in each leaflet. In HMMM, a scaling factor of 1.2 and an acyl tail carbon length of 8 was used. The scaling factor has the effect of increasing the area per lipid than the fully de- tailed atomistic model and the shortened acyl chain increases lipid diffusion by exposing more of the hydrophobic core. All systems were build using CHARMM-GUI HMMM Builder.[238, 165, 143] The simulations were performed using NAMD simulation engine and Charmm36 for lipid and membrane parameters.[153] TIP3P model was used for water 36 and Na+ and Cl? ions were added to system to neutralize the charges on protein and lipids. [144] In the HMMM systems, upon insertion of peptide, the system ran under NPAT en- semble using Langevin thermostat to maintain the temperature at 298K and a constrain to maintain a constant lateral surface area. Langevin piston was used to maintain the pres- sure at 1bar.[97] A force switching function of 10 to 12 A? was used for van der Waals and electrostatic interaction.[283] Long range electrostatic interactions were computed with Particle Mesh Ewald (PME) method.[69] An integration time-step of 2 fs was used for all simulations using SHAKE algorithm to constrain hydrogen atoms.[252] The HMMM sys- tems were equilibrated using standard 6 step CHARMM-GUI input parameters for 225 ps. For the first series of simulations, PEPFOLD3 was used to predict the secondary structure of both peptides which were predicted to be ?-helical for both MPG and Hst5. Peptides were inserted into aqueous phase, with at least 12 A? distance relative to the phosphate plane of nearest leaflet with three different orientations of the peptide with respect to mem- brane to avoid bias. The main axis of the peptide was aligned either perpendicular or at a 45 ? tilt with respect to membrane plane with either N or C-terminus being closer to the membrane. The production run lasted for 300 ns at NPAT ensemble after which they were converted to full-membrane models using CHARMM-GUI HMMM Builder.[238] Furthermore, MPG systems were simulated for an extra 100 ns under NPT ensemble at 298K. MPG-membrane systems were converted to full-membrane model and equilibrated with six-step CHARMM-GUI protocol for 225 ps except for the last step which lasted 10 ns. An extra 200 ns full-atomic membrane simulation was performed after conversion for MPG-membrane systems. Simulations from extended peptide conformation: The simulations with full mem- brane model started with an extended conformation of peptide randomly placed at least 12 A? above the nearest leaflet with three different replicates. The systems were equilibrated using the typical six-step CHARMM-GUI protocol for 225 ps. The production run lasted 37 for 1 ?s for all systems. The membrane had the same concentration of lipids as described previously and the simulation parameters are as before for the full-membrane model. Long timescale simulation of MPG using Anton: In order to study the effect of mem- brane composition on the secondary structure of MPG during interaction, we simulated membranes with different compositions of DOPC and DOPG and their interaction with MPG using Anton2 supercomputer. Production run using Anton2 ran under NVT ensemble and semi-isotropic pressure. The Anton2 multigrator [182] framework and Nose-Hoover thermostat [217, 132] and MTK barostat [197] were used with a timestep of 2.5 fs. Short range electrostatic interactions were calculated with a cutoff of 9 A? and long-range elec- trostatic were computed using u-series approach.[273] The membrane had 100 lipids per leaflet with different concentrations of DOPC And DOPG. 1)DOPC(100%)-DOPG(0%) 2)DOPC(80%)-DOPG(20%) 3)DOPC(60%)-DOPG(40%) 4)DOPC(40%)-DOPG(60%). An- ton simulations ran for 11 ?s for each system. 2.1.2 Results Simulations starting from predicted peptide secondary structure: For our initial studies of MPG and Hst5, we used a PEPFOLD3 which predicted both peptides to be ?- helical with a small turn in N-term of MPG. The peptides were placed at least 12 A? above the membrane phosphate plane in 3 different orientations with respect to membrane to avoid bias. These conformations were parallel and tilted 45? with respect to the membrane plane. The starting orientations of MPG and HST5 are shown in figure 2.2. The HMMM simulations were performed for 300ns. Figure 2.3 shows snapshots of the first replicates of Hst5 and MPG at four different timepoints (0, 50, 100, 300ns) during the simulation with HMMM model. As shown Hst5 binds to the membrane after 50ns however it fails to enter the membrane after 300 ns of simulation. On the other hand, MPG inserts into membrane 38 Figure 2.2: starting conformations of peptides MPG and Hst-5 with respect to the membrane. A,B,C for Hst5 and C,D,E for MPG through its hydrophobic N-terminus after about 50ns of simulation and adopts a vertical conformation in the membrane after 300ns. Since MPG showed penetration into membrane, we converted the HMMM to full- membrane model and simulated for additional 100ns to study its interaction with a full atomic membrane. After conversion to full-membrane model, MPG maintained its deep insertion below the phosphate plane which was consistent throughout the additional 100ns all-atom simulation.To study the translocation of MPG and HST5, we calculated the dis- tance of each residue during simulation to the phosphate plane of the closest leaflet. (fig- ure 2.4) MPG shows insertion into membrane after 50 ns through its N-term hydrophobic residues. On the other hand, HST5 which is highly charged, does not show penetration into membrane. Heatmap plots of distance for other starting orientations of MPG and Hst5 are in figure 2.5. Other initial orientations do not show deep insertion into membrane. In the second orientation of MPG which faces the membrane from C-terminal charged residues, the N-term binds to the membrane after 100ns (figure 2.5C). However, it does not enter the membrane during the 400ns of simulation. MPG with parallel orientation to membrane 39 Figure 2.3: Translocation of CPPs using HMMM model at various timepoints. Phosphate head- groups of the membrane are represented as tan spheres and the membrane acyl chains are cyan lines. Residues on the peptide are colored according to their charge. MPG inserts into membrane after 300ns failed to enter the membrane and binds to membrane through its N-term. The interac- tion energies of all the residues in MPG with membrane were calculated for the last 10ns of full-membrane model (figure 2.4) C-term residues have high interaction energies with membrane which is due to electrostatic interaction of these charged residues with phosphate headgroups of the membrane. N-term hydrophobic residues such as L3, F4, F7, L8 have fa- vorable interaction energies with membrane which is due to the hydrophobic interaction of these residues with membrane core which drives the peptide to penetrate into membrane. In summary, the preliminary simulations starting from ?-helical MPG and Hst5 showed that MPG can enter the membrane through its hydrophobic N-term, however, HST5 remains attached to the membrane surface and does not show insertion into membrane. Simulations starting from extended conformation with HMMM membrane:Structural properties and specifically the conformation of CPPs when interacting with cell membranes play a major role in their cellular uptake mechanisms.[108] Previous studies have shown that there is a direct impact of peptide conformation that modulates the amphipathicity and 40 Figure 2.4: Insertion of peptides into the model membrane A) Heatmap of distance of residues in MPG with respect to phosphate plane of hte closest leaflet with respect to simulation time for orientation 1. B) Heatmap of distance for Hst5 orientation 1. C) Interaction energies of residues in MPG with membrane from last 10ns of simulation with full-membrane model D) Snapshot of simulation for MPG at 300ns (red spheres show phosphate groups and grey lines show acyl tail of the membrane E) Snapshot of simulation for Hst5 at 300ns (showed as sticks are the K and R residues on Hst5 which are interacting with phosphate plane of the membrane 41 Figure 2.5: Heatmap plots for Hst5 and MPG starting with helical conformation predicted by PEP- FOLD3 a) Hst-5 in orientation 2 b) Hst-5 in orientation 3 c) MPG in orientation 2 d) MPG in orientation 3 membrane interaction of CPPs.[347] Hst5 is known to be disordered in aqueous solution and adopts a helical conformation in trifluoroethanol and DMSO solution.[239] MPG has been seen to be unstructured in aqueous solution and adopts a partially ? -sheet confor- mation upon interaction with vesicles made of phospholipids of DOPC/DOPG. It is worth mentioning that the CD experiments for MPG was done using vesicles of DOPC and DOPG which does not represent the full structure of fungal cell membranes.[276] Other stud- ies demonstrated that MPG remains random coil upon interaction with fungal cells.[108] Swiecicki et al. studied the effect of membrane composition on the internalization of sev- eral cationic CPPs using fluorescence quenching and showed that the internalization effi- cacy of CPPs such as penetratin and Tat greatly depend on the membrane composition.[288] The membrane composition of the cell can affect the conformation of cell penetrating pep- tides as well as their internalization efficacy.[53, 306] To this end, we investigated the effect 42 of conformation on peptide-membrane interaction of MPG and Hst5 using MD simulation with HMMM model. Since these initial simulations were biased to the initial structures being helical, we started new simulations from extended conformations of the peptides in the water phase. To this end, we investigated the effect of conformation on peptide- membrane interaction of MPG and Hst5 using MD with the HMMM membrane. 1 ?s simulations were performed in three replicates of extend conformations of MPG and Hst5 using HMMM model for the fungal membrane composition studied in the first part. Simi- lar to simulations starting from helical peptides, these results showed that MPG entered the membrane from its N-terminal hydrophobic residues, while Hst5 binds to the membrane surface and does not show penetration into the membrane. The simulations for MPG were converted to full-membrane model and simulated for an extended 200 ns to study its inter- action with a full-membrane model. The heatmap for distances of every residue in MPG to the phosphate plane of the entering leaflet are shown in Figure 2.6. Secondary structures of MPG during interaction with membrane was analyzed for the 1 ?s trajectories and is shown in Figure 2.6C. MPG adopts a random coil conformation in the C-terminal domain and interaction with membrane induces conformational change to ? -sheet in few residues such as A2, L3, F4, L8, G9, A10 from 100 to 600 ns of simulation. After 600 ns of simula- tion and deeper insertion to the membrane, most of the N-terminal residues adopt a helical conformation. Upon formation of ?-helical conformation, the peptide inserts deeper into membrane, as shown in the heatmap plot for MPG after 900 ns (Figure 2.6A), and most of the peptide residues are below the phosphate headgroup plane of the membrane. MPG adopts a helical conformation in the N-terminal region at 200 ns. Secondary structure and penetration of other orientations of MPG and Hst5 (extended conformation with HMMM membrane) are shown in figure 2.7 (MPG) and 2.8 (Hst5). In the second orientation of MPG after 400 ns the peptide loses its secondary structure in N-terminal region and adopts a helical structure in the C-terminal region (Figure 2.8C). Replicate 3 of MPG shows bind- 43 Figure 2.6: Results for MPG (A,C) and Hst5 (B,D) starting from extended peptide conformations in solvent A) Heatmap of distances of MPG first replica with respect to phosphate plane of hte nearest leaflet B) heatmap of distances of Hst5 first replicate C) secondary structure of MPG during simulation D) secondary structure of Hst5 during simulation E) Snapshots of MPG during all-atom simulation ing to the phosphate plane of membrane after 250 ns as shown in the heatmap plot (Figure 2.8B). Interestingly, it adopts a ? -sheet conformation in residues F7, L8, G9, S13, T14, and M15, and loses this ? -sheet conformation after about 600 ns of simulation which coincides with deeper insertion of MPG into membrane from the N-terminal region (figure 2.8D). In contrast to MPG which shows entry into membrane, Hst5 only binds to the phosphate plane of the membrane and fails to insert below the headgroup region of membrane. How- ever, a few hydrophobic residues such as Y10 and F14 insert below the headgroup region of membrane as shown in figure 2.6B,D and figure 2.7. Hst5 does not form any confor- 44 Figure 2.7: a) heatmap plot for Hst-5 (orientation-2) showing distance of every residue with respect to phosphate plane of nearest leaflet b) heatmap plot for Hst-5 (orientation-3) c) secondary structure of Hst-5 in orientation 2 during simulation d) secondary structure for Hst-5 orientation 3 mation when interacting with membrane during the 1 ?s simulation which is consistent in all replicas of Hst5-membrane systems. Karlsson lab recombinantly expressed CPP-GFP fusions and tested tested cellular uptake in Candida albicans cells using flow cytometry experiments. Based on their results MPG-GPF significantly improved the translocation of GFP into the cells while Hst5-GFP had no significant effect on the translocation of GFP into the Candida albicans (figure 2.9). Interestingly, they showed that the orientation of MPG in MPG-GFP construct affects the translocation efficacy. MPG attached to the N- terminus significantly improves the cargo translocation whereas in the construct with MPG at the C-terminus the translocation of GFP was insignificant compared to the control with no CPP attached. Effect of orientation on translocation of MPG-GFP constructs is shown in figure 2.10. 45 Figure 2.8: a) heatmap plot for MPG (orientation-2) showing distance of every residue with respect to phosphate plane of nearest leaflet b) heatmap plot for MPG (orientation-3) c) secondary structure of MPG in orientation 2 during simulation d) secondary structure for MPG orientation 3 Figure 2.9: Cellular uptake studies done in Karlsson lab. The flow cytometry data for 24h incuba- tion of samples were analyzed furhter for 7 replicates to quantify the translocation and membrane permeabilization in C. albicans. The percentage of fluorescence-positive cells were used to evaluate the GFP delivery efficacy. The permeability of the cells were evaluated after treatment with the fusion protein using propidium iodide (PI). A) Translocation data at 24h for 7 replicates showed significantly higher uptake of MPG-GFP compared to both GFP and Hst5-GFP B) Propidium io- dide uptake for the same times were recorded with no significant uptake of PI. Error bars represent standard error of the mean for 7 replicates for panels (a,b). Image credit Karlsson lab. 46 Figure 2.10: Effect of time and cargo orientation on MPG-mediated delivery of GFP to C. albicans. Purified protiein (100 ?M) with GFP attached to MPG at either the N-terminus or C-terminus and controls with no CPP were incubated with cells. Translocation was quantified using flow cytometry. Image credit Karlsson lab Effect of membrane composition on internalization and secondary structure of MPG: It has been shown through various experimental and simulation techniques that the structural state of most CPPs is highly dependent on the lipid types. For instance, penetratin has been shown to adopt a variety of conformations from ? -sheet structure to ?-helical and unstructured in the presence of different concentration of charged lipids[53] Swiecicki et al. studied the effect of membrane composition on the internalization of several cationic CPPs using fluorescence quenching and showed that the internalization efficacy of CPPs such as penetratin and Tat greatly depend on the membrane composition.[288] The membrane com- position of the cell can affect the conformation of cell penetrating peptides as well as their internalization efficacy.[306] This is also true for MPG where CD experiments with vesicles of DOPC and DOPG showed the presence of ? -sheet conformation.[208] However, experi- ments with fungal cells showed a helical conformation for MPG.[108] Furthermore, the re- sults from HMMM model could be biased due to the highly hydrophobic solvent in the core of the membrane which can induce the formation of ?-helical structure. In the next step of this study, we investigated the interaction and secondary structure of MPG in the presence 47 full membrane model of various compositions of the negatively-charged phosphatidylglyc- erol (PG). In this part of our study, we investigated the structure and interaction of MPG with 4 different membrane compositions 1) DOPC/DOPG(1:0) 2) DOPC/DOPG(4:1) 3) DOPC/DOPG(3:2) 4) DOPC/DOPG (2:3). We simulated all these system for 11 ?s each with a special-purpose supercomputer called Anton2 and using CHARMM36 forcefield for both lipids and protein. The results were consistent with our earlier study with HMMM. We have observed that the presence of charged DOPG lipids in the membrane induces per- sistent and long ?-helical conformation in the N-terminal of MPG. Formation of a helical conformation is also concurrent with deeper insertion of MPG into membrane. Although the peptide is mostly surface bound for all the charged lipids, we think that the occurrence of TM state for MPG requires longer simulation time or a higher temperature which was shown by Ulmschneider et. al. [299] Moreover, we also observed short ? -sheet forma- tion in N-terminal of MPG which was transient and replaced with ?-helical conformation upon deeper insertion of MPG into membrane. Snapshots of MPG in DOPC/DOPG(4:1) is shown in figure 2.11. Initially a ? -sheet structure forms when MPG contacts the phosphate plane of the membrane. This is transient and after about 2 ?s the peptide becomes unfolded on top of the membrane and then forms a helical structure at about 3 ?s. Formation of a fully helical structure at the N-terminal region coincides with deeper penetration of MPG into membrane. Heatmaps of distances of MPG residues to the phosphate plane and pep- tide secondary structure are shown for two membrane compositions DOPC/DOPG(1:0) or DOPC-100 and DOPC/DOPG(4:1) or DOPC-80 in figure 2.12 and 2.13. 2.1.3 Discussion and Conclusion Here we studied the interaction of CPPs MPG and Hst5 with model membranes using MD simulations. HMMM model was used for the membrane in our initial study to accel- 48 Figure 2.11: snapshots of MPG with DOPC(80%)-DOPG(20%) Figure 2.12: Heat map of insertion depth and secondary structure of MPG for A, C) 100% DOPC membrane B,D) 80% DOPC-20%DOPG membrane E) snapshot of MPG interaction with 100% DOPC F) snapshot of interaction of MPG with 80% DOPC 20% DOPG membrane 49 Figure 2.13: Heat map of insertion depth and secondary structure of MPG for A, C) 60%DOPC- 40%DOPG membrane B,D) 40% DOPC-60%DOPG membrane E) snapshot of MPG interaction with 60%DOPC-40%DOPG F) snapshot of interaction of MPG with 40% DOPC-60% DOPG mem- brane 50 erate the membrane insertion of peptides. MPG and Hst5 were predicted to adopt a helical conformation using PEPFOLD3. It was shown that during the simulation MPG inserts into the membrane from its hydrophobic N-terminus. However, Hst5 fails to insert into membrane and remains attached to the phosphate plane. Experiments done in Karlsson lab confirms the simulation results. The translocation of MPG with GFP as the cargo protein was significantly higher than GFP alone whereas Hst5 made no significant improvement of the GFP translocation. On the other hand, they showed that the orientation of MPG in the MPG-GFP constructs affects the translocation where MPG at the N-terminal had a sig- nificantly higher translocation than MPG at the C-terminal of the constructs. This is also in line with our simulation results where MPG enters the membrane from its hydrophobic N-terminus. It is therefore reasonable to assume that placing MPG at the C-terminus of the construct, prevents effective interaction of hydrophobic residues at the N-terminus of MPG with membrane and lowers its translocation. Secondary structure of CPPs, plays a crucial role in their uptake mechanism as well as their translocation efficacy. Structural studies have been performed before for MPG and Hst5. As discussed before, MPG has been shown to adopt a partially ? -sheet confor- mation upon interaction with vesicles made of phospholipids of DOPC/DOPG in exper- iments. Other studies have shown that MPG remains unstructured upon interaction with fungal cells. Membrane composition of cells can affect the conformation of cell penetrat- ing peptides as well as their internalization efficacy.[53] Simulations starting from extended peptide conformation and HMMM fungal cell membrane showed that MPG has a partially folded ? -sheet conformation when interacting with the phosphate plane of the membrane but upon deeper insertion into the membrane core, it adopts a helical conformation. This shows that the conformational change of peptide from ? -sheet which is mostly at the in- terface with phosphate headgroups to ?-helical when inside the membrane facilitates the translocation and deeper insertion of MPG. Hst5 on the other hand does not form any 51 secondary structure and remains unstructured during the simulation in all replicas. The hy- drophobic solvent used in HMMM model is likely to affect the secondary structure of pro- teins inserted into membrane. In order to avoid the artifact of HMMM model and also study the effect of different concentrations of charged lipids (DOPC/DOPG ratio) we ran long- timescale simulations of MPG interaction with different concentrations of DOPC/DOPG using Anton2 supercomputer. These simulations showed that MPG has a ? -sheet con- formation upon making contact with the membrane but after deeper insertion it adopts a helical conformation. Moreover, the helical conformation was maximum at 20% DOPG concentration which is the natural concentration of negatively charged lipids. The 100% DOPC concentration of membrane showed the smallest helical conformation and also the slowest penetration of peptide into membrane which points to the importance of charged lipids for helical conformation of MPG and its insertion into membrane. With the com- bined knowledge gained from experimental and simulation studies we are better equipped to design better CPPs and study their translocation into the fungal pathogen. This study will further motivate the use of both experiments and simulations to design better CPPs to enable cargo delivery. 52 2.2 Deep generative models for Antimicrobial peptide discovery Antimicrobial resistance causes ?2.8 million resistant infections yearly which leads to more than 700,000 deaths globally. This is expected to rise to 10 million deaths per year by 2050 if the current trend continues.[70, 161, 221] Of particular importance is Multi- drug resistant Gram-negative bacteria. Naturally occurring antimicrobial peptides (AMPs) have remained effective to combat pathogens despite their ancient origins and continuous contact with pathogens. Therefore, AMPs are deemed as ?drugs of last resort? for their ability to combat multi-drug resitant bacteria. AMPs are usually 12-50 amino acids long and are typically rich in cationic residues (R and K) as well as hydrophobic (A, C and L) amino acids. The mechanism of action of AMPs depends on their sequence but they generally act by disrupting the membrane or through other routes such as binding to DNA and essential cytoplasmic protein and inhibiting their function.[171]. There have been numerous studies on generating new AMPs and/or improving their activity which resulted in some successful AMPs.[75, 293] These have been generated usu- ally through expert knowledge and rational design approaches which could be very costly due to vast space of peptides. There are some limitations in using current AMPs such as their relatively low half-lives, unknown toxicity to human cells, and relatively high produc- tion costs.[195, 107, 198] On the other hand, due to the vast space of peptide sequences, computational techniques are necessary for discovery of novel AMPs with desired prop- erties. Generative models in artificial intelligence have previously shown great promise in material and drug discovery.[270, 255, 352] Deep learning have been previously used in peptide identification, property prediction and peptide generation.[52] Specifically, deep generative models have been used for generating antimicrobial, anticancer, immunogenic and signal peptides to name a few. Computational methods using recurrent neural networks (RNNs)[210], VAEs[71] and generative adversarial networks (GANs) [113, 297] showed 53 the promise of these methods for AMP discovery in silico. In this study, we use variational autoencoders (VAEs) to learn a meaningful latent space of AMPs and generate novel AMPs from this latent space. A variational autoencoder[151] encodes data into a latent space and decodes it back to the original data and optimizes a vari- ational lower bound of the log-likelihood of the data. Since we are dealing with sequences as done in natural language processing (NLP), we use recurrent neural networks (RNNs) as both encoders and decoders in what is known as sequence-to-sequence models. Due to complexity of natural language and sequential nature of data, these models are harder to train than other types of neural networks. However, Bowman et al.[31] showed that a seq- to-seq VAE is able to generate meaningful and novel sentences from the learnt continuous latent space. Attention mechanism proposed for translation originally has made a great leap in NLP tasks.[303] In attention mechanism, source information is summarized into a context vector using a weighted sum where weights are learned probabilistic distributions. This context vector is used during the decoding process to guide the decoder into what word in the sequence was most important during decoding. Attention was shown to signifi- cantly improve almost every task in seq2seq models such as translation [64], summarization [251], etc. However, Bahuleyan et al.[5] showed that using a deterministic attention where the source information is directly provided during decoding can lead to a phenomenon called the ?bypassing? where the variational latent space is not meaningful since the atten- tion mechanism is too powerful. Thus, they proposed a variational attention mechanism to address this problem where the attention vector (context vector) is modeled as a random variable by imposing a prior Gaussian distribution. They evaluated this model on question generation and dialog systems and showed that the variational attention achieves a higher diversity than deterministic attention while retaining high quality of generated sentences. In this study, we have used a variational attention with variational autoencoder in a seq2seq approach to generate novel, high quality and diverse AMPs. Moreover, we trained a binary 54 classifier network using attention mechanism for evaluation of the generated peptides in the generative model. The generated peptides are also analyzed for their physicochemical properties and comparison with real antimicrobial peptides. 2.2.1 Methods The training data for the AMP prediction model was set by combining AMPs from mul- tiple databases. These include DRAMP [148], LAMP2[342], DBAASP[234] and APD3 [311]. All AMP sequences had a length of 5-30 amino acids. To exclude repetitive sequences from our dataset, we used CD-HIT with a cutoff of 0.35 which resulted in 16,808 AMP sequences. Since there are no known dataset for non-AMPs, we made a non-antimicrobial dataset using Uniprot excluding keywords antimicrobial, antibiotic, an- tibacterial, antiviral, antifungal, antimalarial, antiparasitic, anti-protist, anticancer, defense, defensin, cathelicidin, histatin, bacteriocin, microbicidal and fungicide. The final non- AMP dataset had 16808 examples. We also made sure the positive and negative dataset have similar length distribution to avoid bias. The final code for AMP prediction and gen- eration can be found at the https://github.com/ghorbanimahdi73/AMPGen. 2.2.2 AMP prediction model We trained a model on both AMP and non-AMP dataset for antimicrobial prediction. The architecture of our model is shown in figure 2.12. The Antimicrobial classification network contains an embedding layer, a 2D convolution and a bidirectional LSTM with a context attention and a sigmoid activation at the end for binary classification of peptide sequences. The dataset was split into training (70%) and validation set (30 %). The output of the prediction model is the probability score for sequences. The sequences with score > 0.5 are considered AMP and those with score <0.5 are considered non-AMP. We used 55 Figure 2.14: An illustration of the classification network used for evaluating the generated AMPs. a binary cross-entropy loss and an Adam optimizer for training the network. Early stop- ping was applied if the validation loss was not improving for 5 consecutive epochs during training. The weights of the model with the best validation accuracy was selected as the optimal model. A 10-fold cross-validation was applied to tune the hyperparameters of the model. In the AMP prediction network, the peptide sequences are first transformed into a sequence of integers from 1 to 20 which are then embedded into 2d matrices in the embed- ding layer of the network. The optimal embedding size was found to be 64 dimension. A 2d convolution is then applied to the embedded sequences using 64 convolutional filters of size 3. The output of the convolutional layer then goes into a bidirectional-LSTM which processes the matrices for each residue from both forward and backward directions and the output is the summation of the two directions. The tuned bi-LSTM hidden dimension was 64. The attention layer then gathers the hidden state of bi-LSTM and computes a weighted sum of all hidden states as: exp(a ? j ) j = (2.1)? j? exp(a j? ) where a is the attention score by applying a linear transformation to the bi-LSTM out- puts followed by a ReLU activation and ? is the attention weight. The output of the atten- 56 tion layer can be computed as a weighted sum ? j ? jh j where h j is the j?th hidden state of the bi-LSTM output. The output of attention then goes into the sigmoid activation for an- timicrobial prediction. The final model had 93% accuracy under a 10-fold cross-validation which is comparable to other AMP-prediction models such as AMPlify and ACEP with 93.7% and 92.6% accuracies. However, the goal of this study is not antimicrobial pre- diction and this network was trained in order to evaluate the generated peptides by our generative model. 2.2.3 Variational autoencoder A traditional VAE proposed by Kingma and Welling encodes data to a latent space and then decodes to reconstruct the input data. The network is trained to optimize the variational lower bound of the log-likelihood of data. Since we are dealing with sequences, in natural language processing (NLP) recurrent neural networks (RNNs) are typically used as encoders and decoders in what is known as sequence to sequence models (seq2seq). Bowman et al. [31] trained a seq2seq VAE and used the continuous latent space to generate new text. A model with useful information in the latent space will have non-zero KL and a relatively small cross-entropy term. However, in a standard VAE, the KL becomes vanishingly small and the model becomes a RNN language model. The decoder learns to ignore the latent z vector and only use the input data which is provided at each step of decoding. Two techniques have been proposed by Bowman et al. to mitigate these issues , both of which are used here: 1) KL-annealing and 2) word dropout. For the KL-annealing, we add a variable weight to the KL term in the loss function during training. This weight is set near zero at the beginning of training and then increases to a maximum weight toward the end of training process. This ensures that the at the beginning the model learn enough information from latent space. The word dropout weakens the decoder by removing some 57 of the conditional input information during training. This forces the model to rely more on the latent code. Attention mechanism has transformed the natural language processing enabling train- ing of enormous models and achieving high accuracies. In the attention mechanism, the source information in summarized in an attention vector using a weighted sum of hidden states of the source sentence where the weights are learned probabilistic distribution. Then this attention vector is directly fed to the decoder at each step during decoding. It has been demonstrated that this attention mechanism improves the performance of models in translation[64], Summarization[251] and other NLP tasks. However, it was shown that this deterministic attention can serve as a bypassing mechanism and the latent space cannot learn the distribution of the data since the attention is too powerful.[5] Here we use vari- ational auto-encoder with variational attention for the generation of novel antimicrobial peptides. Different parts of the network are described in detail below: Encoder In our model, encoder is a GRU which is parameterized by ?E . The encoder network takes the inputs sequence x = x1, ...,xn and outputs the hidden representation of the se- quence h = h1, ...,hn where n is the length of the sequence. This is written as : hi = GRU(hi?1,xi;?E) (2.2) Two dense connection layers are then used to learn the mean vector ?z and the stan- dard deviation vector ?z. latent variable z is then sampled from the Gaussian distribution N (?z,diag(?2z )) 58 Variational Attention Attention mechanism tries to dynamically align x? = x?1, ..., x?n during generation. During decoding, the attention mechanism for step j of the decoder is computed between all the hidden states of encoder and hidden state at step j of decoder as : exp(e ? ji ) ji = n (2.3)? ? exp(ei =1 ji? ) In the above equation e ji is the pre-normalized score calculated as e x? T x x?ji = h jW hi where h j and hxi are the j?th and i?th hidden representation of decoder and encoder hidden states and W is a bilinear term to capture specific relation. The attention vector is then calculated by a weighted sum as: n a j = ? ? jihi (2.4) i=1 The posterior qaE(a j|x) is modeled as another Gaussian distribution N (?a j ,diag(?2a ))j which is written as: a j ? qaA(a j|x) = N (? 2a j ,diag(?a )) (2.5)j Decoder The decoder is a single layer GRU. The decoder at each step is provided with the latent space of encoder, attention vector computed at each step and the true input sequence hx?j = GRU x?? (h j?1,y j?1,a j,z) A softmax function at the end is used to predict the next word in the sequence x? j given 59 the hidden representation at h? j: p(x?t) = so f tmax(Whh?t +bh) (2.6) The loss function of the model with attention at each step of decoder can be written as: L j(?D,?E ,x) =?KL(qE(z,a|x)||p(z,a))+EqE(z,a|x)[logpD(x?|z,a)] (2.7) =?KL(qE(z|xi)||p(z))?KL(qE(a|xi)||p(a))+EqE(z|xi)qE(a|xi)[logpD(x?i|z,a)] In the above equations KL is the Kullback-Leibler divergence between two distribu- tions. The posterior qE(z,a|x) = qE(z|x)qE(a|x) is factorized into two distributions since a and z are conditionally independent given x. The sampling can then be performed sepa- rately for a and z. The overall objective of the VAE with variational attention can then be written as: LD(?D,?E) = Lrec(?D,?E , x?)+?KLKL[qE(z|x)||p(z)] (2.8) +?a ?nj KL[qE(a j|x)||p(a j)] hyperparameters ?KL and ?a are the weights on the KL and attention terms of the loss function. Annealing is done on the ?KL weight while the ?a weight is kept constant. We used a monotonic annealing scheduler for training from 0 to maximum weight of the KL term. 2.2.4 Results and Discussion Antimicrobial prediction To assess the quality of the generated peptides in the generative network, we first trained a binary classifier network for prediction of antimicrobial peptides. The training data for 60 our model consists of AMPs and non-AMPs collected from multiple datasets as described in the methods section. The total dataset has 16,808 positive and 16,808 negative (non- antimicrobial) sequences. Our model architecture (fig 2.12) consists of an Embedding layer, a convolutional layer, a bidirectional LSTM, a context attention and a sigmoid at the end to compute the probability of antimicrobial class for sequences. Attention mech- anism is inspired by the the brain?s ability to prioritize segments of information during textual or visual processing. A bi-directional LSTM is a variant of RNN which encodes positional information from the sequence in a recurrent manner in both forward and back- ward directions. The context-vector attention generates a vector summary of all hidden states of the bidirectional LSTM using a weighting average, where the weights are learned during training. The architecture of this model is shown in figure 2.12. During training we used early-stopping to stop the training when there is no improvement after 5 consecutive training epoch. For training this model we performed a train/test split (70%/30%). For the classification model, we used a batch size of 32, a learning rate of 0.001, embedding dimension of 64, hidden dimension of 64. The number of convolutional kernels in the con- volution layer were 64 with a size of 3. We also employed dropout with a dropout rate of 0.3 to avoid overfitting. The training and validation accuracy during training is shown in figure 2.13. This model achieved an accuracy of 93.5% under a 10-fold cross validation. The accuracy of our model is comparable to other AMP prediction models such as AMPlify [172] and ACEP [102] which use deep learning models and report accuracies of 92.79% and 91.16 (for sequences less than 30), respectively. Training the generating network For the generative model, we only used the AMP dataset consisting of 16,808 known AMP sequence. For training the generative VAE with variational attention, our architec- 61 Figure 2.15: Training and validation accuracy of the AMP-prediction over the training epochs. Figure 2.16: Illustration of AMP generative model with Encoder, Decoder and variational attention parts ture consisted of 128 hidden units for the Encoder and Decoder which are both single directional LSTM networks, a latent space dimension of 32. The model was trained for 100 epochs. The training process takes about 3 hours on a Tesla V100 GPU. During gener- ation we experimented with different sampling methods such as Beam-search, Temperature sampling, Top-K sampling and Top-P sampling. Top-K sampling gave better results than other sampling methods with K = 5. During the training process, we tokenize the sequences of amino acids into all the twenty natural amino acids and three additional tokens representing the start of the se- quence ??, the end of a sequence ??, the padding ??. In the word- 62 dropout technique, we randomly replace an amino acid with ?? token during train- ing. Since the peptides have different length, we added a padding token to the sequences so that they all have a fixed length of 30 during training and evaluation. The standard VAE where the weight on the KL term is 1, suffers from a KL vanishing problem which leads to i) an encoder that produces posteriors almost identical to Gaussian prior for all sam- ples and ii) Decoder ignores the latent variable and the model reduces to a simple RNN Encoder-Decoder. Bowman et al.[31] proposed two approaches to deal with this problem. A word-dropout which randomly replaces the words in sequences with an unknown token ?? during training to avoid overfitting in the model. And the second approach is KL-annealing where at the start of the training process, the weight of the KL term is small where z is learned to capture useful information for reconstructing x during training. Then the KL-weight increases monotonically to a maximum value. During training, we used a monotonic annealing where the weight increases from a small value (0.001) to a maxi- mum value. Not annealing the KL-term led to a posterior collapse and an uninformative latent space. We experimented with different weights on the KL and attention KL term and the results of generative model evaluation are shown in figure 2.15. For each set of hyperparameters we generated 10,000 peptide sequences from the generative model. The AMP-prediction model was used to predict what fraction of the generated AMPs are in fact antimicrobial. As shown in figure 2.15, increasing the ?KL increases the accuracy of the generative model in generating antimicrobial peptides. In most ?KL values, increasing ?a also increases the accuracy of AMP prediction. BLEU is another metric used here to evaluate the generated peptides. This was originally proposed for evaluation of machine translation tasks by comparing the similarity between the translated sentences by the model and the true human references. Here we used BLEU score to compare the generated AMPs with the training data of real AMP sequences. BLEU score is calculated for every sample 63 Figure 2.17: Evaluation of the AMP-generative model over different values of ?KL and ?a a) Ac- curacy of the generative model over 10,000 generated sequences using the trained AMP-prediction model B) average perplexity of the generated sequences using an external language model C) Av- erage BLEU score (BLEU-2 to BLEU-5) of generated sequences D) Average Self-BLEU score (BLEU-2 to BLEU-5) of generated sequences in the generated dataset Sgen against all the real AMP references Sre f as : 1 BLEU(Sgen,Sre f ) = ? BLEU(s,Sre f ) (2.9)|Sgen| s?Sgen Higher BLEU score implies more overlap of n-grams in the generated data and the real AMPs. BLEU score for VAE without annealing, VAE with annealing and ?KL=0.08 and VAE-attn with different combinations of ?KLand ?a are calculated in table 2.2. External language model: In NLP, given a sequence of words x = (w1, ...,wn), lan- guage models estimate the probability distribution P(x) over it. A popular choice of lan- 64 Table 2.2: BLEU, accuracy and perplexity of a few selected models Model BLEU-3 BLEU-4 BLUE-5 BLEU ACC PPL VAE (no anneal) 0.999 0.984 0.911 0.974 99.5 5.3 VAE (?KL=0.08) 0.998 0.962 0.826 0.947 96.2 8.09 VAE-attn (?KL=0.04, ?a=0.5) 0.997 0.921 0.711 0.907 88.7 11.82 VAE-attn (?KL=0.04, ? =1.5) 0.998 0.928 0.722 0.912 89.8 11.48a VAE-attn (?KL=0.06, ? =0.5) 0.998 0.943 0.769 0.927 93.6 10.26a VAE-attn (?KL=0.06, ? 0.998 0.944 0.773 0.929 92.84 10.13a=1.5) VAE-attn (?KL=0.08, ? =0.5) 0.998 0.957 0.818 0.943 95.6 8.8a guage modeling is autoregressive model where: P(x) = P(w1, ...,wn) = ?ni=1 p(wi|w1, ...,wi?1) (2.10) The likelihood of a sequence of words P(x) can be used as a proxy for its quality. RNNs and LSTMs are common architectures for autoregressive modelling. RNNs are trained to predict next word given the current word wi and the hidden state of previous word hi?1 which is equivalent to maximizing the marginal likelihood P(x) of a sequence of words in the training data. RNN language models have been proposed to compute the perplexity of generated text which is a measure of fluency of machine generated text.[349] We trained an LSTM language model on the AMP dataset. For this, we split the training set into train (70%) and heldout or test (30%) set. Then, we trained a character-level LSTM on the training set and calculated the perplexity on the heldout dataset. Our best model achieved a perplexity of 6.0 on the heldout dataset. Figure 2.15B shows the change in perplexity upon changing the ?KL and ?a values in VAE-attn model. Higher ?KL gives a higher perplexity to the model. Table 2.2 shows the perplexity of a standard VAE (without annealing), a ? - 65 Figure 2.18: Physico-Chemical properties of the generated 10,000 AMP sequences and their com- parison with real AMP dataset. A)fraction of different type of amino acids B) distribution of charge in real and generated AMPs C) Global hydrophobicity over generated, real and random sequences D) Global hydrophobic moment for generated, real and random sequences E) Sequence length dis- tribution for generated and real AMPs VAE with ?=0.08 and other selected models. A standard VAE has the lowest PPL which is even lower than that evaluated on the validation set. This shows that the standard VAE has collapsed to a denoising autoencoder which is just reconstructing back the original data. In order to evaluate the diversity of the generated AMPs we used self-BLEU score which assesses the similarity between every generated sequence and the rest of the gener- ated dataset. Lower self-BLEU score implied higher diversity of the generated text. The BLEU and self-BLEU scores for different ?KL and ?a weights are shown in figure 2.15C,D. Increasing ?KL increases the BLEU and self-BLEU scores. We noticed that at higher ?KL 66 values the difference between different ?a values are more noticeable. The comparison of self-BLEU for real AMPS, standard VAE (no annealing), ? -VAE (?=0.08) and other selected models is shown in table 2.3. The self-BLEU for real-AMPs is 0.968. This high value is due to the choosing a small cutoff (0.35) for removing repetitive AMPs in the data curation process which was chosen to maximize the number of selected AMPs for the generative model. Standard VAE without annealing shows a higher self-BLEU than the real AMPS which shows a very low diversity of generated sequences. ?VAE also shows a higher self-BLEU than other VAE-attn models. The KL divergence for each model was also calculated against the validation AMP dataset. A higher KL implies a higher differ- ence between real and generated AMP distributions. However, a very high KL could lead to the model only generating random sequences. A comparison of KL for different models is shown in table 2.3. Standard VAE have a KL of 0 which shows the posterior collapse and the model becoming a RNN language model. The KL for ?VAE is 1.1 which is also lower than VAE-attn models which points to lower divergence more similarity of ?VAE generated sequences to the real AMPs. We also investigated the generated antimicrobial peptides through their physicochem- ical properties such as length, charge, hydrophobicity and hydrophobic moment. Specifi- cally the generated sequences are rich in amino acids such as Lys, Leu, Arg, Ala, Ile, Gly, Val, Phe and Trp in that order. (figure 2.16) As shown in the distribution of charges for generated peptides, most of them have a positive net charge due to presence of Lys and Arg residues. Furthermore, the hydrophobic moment shows that most of these are ?-helix peptides. These observations highlight the close properties of the generated peptides with the dataset of real antimicrobial peptides which point to the fact that the generated peptides have antimicrobial activities. 67 Table 2.3: self-BLEU (sBLEU) for 3,4 and 5-grams and KL divergence Model sBLEU-3 sBLEU-4 sBLEU-5 sBLEU KL real AMPS 0.998 0.967 0.909 0.968 - VAE (no anneal) 0.998 0.984 0.911 0.973 0 VAE (?KL=0.08) 0.998 0.962 0.826 0.946 1.1 VAE-attn (?KL=0.04, ? =0.5) 0.999 0.929 0.764 0.923 13.1a VAE-attn (?KL=0.04, ? =1.5) 0.994 0.9333 0.771 0.924 12.4a VAE-attn (?KL=0.06, ? =0.5) 0.993 0.941 0.819 0.938 7.0a VAE-attn (?KL=0.06, ? 0.994 0.944 0.828 0.942 6.4a=1.5) VAE-attn (?KL=0.08, ? =0.5) 0.998 0.957 0.818 0.943 3.2a 2.2.5 Conclusion Antimicrobial peptides have shown great potential as alternative therapeutics for bacte- rial resistance. In this study, we use deep learning generative model attention based vari- ational autoencoder to generate novel and high quality sequences of AMPs. A bypassing phenomena has been observed when using deterministic attention in a VAE framework. The bypassing phenomena makes the latent space uninformative and sampling from this space would give random results during generation. Since attention has been shown to improve tasks in NLP, it is tempting to include attention mechanism in our model. But a deterministic attention is not possible for sequence generation task on the same dataset so we opted for a variational attention mechanism. Therefore, Bahuleyan et al.[5] proposed a variational attention approach where the attention vector is modeled as random variable by imposing a prior Gaussian. Here, we used variational attention based variational autoen- coder to generate novel AMPs. The generated AMPs from our best model were evaluated using an antimicrobial prediction network which showed a more than 95% probability. We 68 have also used evaluations metrics such as BLEU, self-BLEU and perplexity which showed that high quality of the generated sequences. Moreover, we compared the physicochem- ical properties of the generated peptides with the real AMPs which showed closeness of these properties. The future direction of this work will be using post-evaluation models such as regression models to predict antimicrobial (MIC) activity of the generated peptides and a toxicity prediction to further choose a few selected peptides. Further evaluation of the generated peptides can be performed by performing MD simulations and experimental validation. 69 Chapter 3: Markov modeling and machine learning 3.1 Markov modeling of conformational fluctuations in ?2-microglobulin 3.2 Introduction ?2-microglobulin (?2m) is a 99-residue protein subunit of major histocompatibility complex I.[229] Upon renal failure the concentration of ?2m in the serum increases by 60-fold which causes fibril formation. Individuals with kidney impairment living through heamodyalisis have a high concentration of ?2m at about 30-50 mg/mL from the normal level of 0.3?30 mg/mL because of the inability of dialysis membrane to effectively remove the protein.[18] High concentration of ?2m is known to be the major cause of fibrogene- sis associated with dialysis related amyloidosis (DRA).[123, 63] One of the first steps in protein aggregation involves partial unfolding or misfolding of monomeric species to ini- tiate aggregation and amyloid formation.[85] Moreover, the monomers of ?2m are highly stable at physiological conditions even at high concentrations in vitro.[89] Intermediate states in the folding of ?2m have been identified which are known to adopt a non-native trans conformation in Pro32.[4, 278] This intermediate state is known to be a precursor for ?2m enhanced fibrilogenesis and is known to form stable isomers.[56, 166] However, the trans conformation alone is not sufficient to induce amyloid formation as mutants of ?2m have been identified where the trans conformation is dominant but spontaneous amyloid formation is not observed.[235] This indicates that other structural changes are involved in 70 misfolding of ?2m and structural, thermodynamic and kinetic properties of ?2m confor- mational changes need to be investigated to unveil the amyloid propensities of ?2m.[63] Here we set out to study the folding landscape of ?2m to identify the metastable misfolded states with potential aggregation propensities. Experimental techniques such as NMR and Cryo-EM only provide static snapshots of most populated conformational states while other techniques such as FRET are limited by their resolution in giving atomic level detail of dynamics. On the other hand, molecular dy- namics simulation has proven to be a useful tool to provide atomic level details of biological systems such as protein folding and protein conformational heterogeneity. This usually re- quires generating large amount of data using fast supercomputers or distributed computing platforms such as folding@home[160] which makes interpretations and gaining biologi- cally relevant information a challenging task. Recently Markov state models (MSM) have been increasingly adopted for analyzing the high dimensional data from MD simulations. In this model, the dynamics of the biological system is described by memory-less jumps between discrete conformational states. In this study we characterize the dynamic confor- mational landscape of ?2m to identify the aggregation prone intermediate metastable states using molecular dynamics simulations. MSM analysis was applied to find the thermody- namics and kinetics of ?2m misfolding which gives us important insights into the first stage of aggregation of ?2m. Metadynamics simulation was performed to sample misfolded and near folded conformations of ?2m and seed MSM simulations. Next, we have accumulated 250?s MD simulation trajectories of ?2m to perform the MSM analysis. 3.2.1 Methods Metadynamics and conventional simulations of ?2m: Metadynamics simulation was used to sample different conformations of ?2m close to the native state. In metadynamics[9], 71 one picks a few relevant (slow) collective variables in the simulation and an external history-dependent bias potential which is constructed as a sum of Gaussian kernels are added to the simulation trajectory in the space of collective variables (CVs). The idea of metadynamics is to push the system away from local minima and visit new states in col- lective variable space. More information about metadyanmics is given in the chapter 1 of this dissertation at the enhanced sampling methods section. Three different collective vari- ables were chosen according to a previous study on ?2m.[162] These include: 1) ? -sheet content of the protein, 2) phipsi collective variable containing all ? and ? backbone tor- sion angles and 3) RMSD with respect to the folded state. We performed metadynamcis simulations with 2 collective variables combining every 2 two mentioned CVs, totaling 3 different metadynamics simulations each lasting 500 ns. These include metadynamics-1 (? -sheet content and phipsi CVs), metadynamics-2 (? -sheet content and RMSD CVs) and metadynamics-3 (RMSD and phisphi CVs). In each metadyanmics simulation the Gaus- sians were deposited every 2ps with a Gaussian height of 2 kJ/mol and a biasfactor of 10. All simulations ran at a temperature of 340K to further enhance the conformational tran- sition of ?2m. This temperature is at the experimental melting point of ?2m (357.6K) to avoid complete unfolding of the protein.[257] Metadynamics simulation were performed using GROMACS and PLUMED.[294] CHARMM36m[23] forcefield was used for protein and Tip3p [144] water model for the solvent. Na+ and Cl? ions were added to neutralize the system. Prior to running metadynamics, we minimized the system with steepest descent algorithm followed by 0.25ns equilibration with a 1fs timestep and further 20ns equilibra- tion with a 2fs timestep at NPT ensemble. For all simulations, we used a velocity rescaling thermostat to maintain temperature at 340K with a coupling constant of 0.1 ps?1. Pres- sure was maintained at 1 bar using Parinello-Rahman barostat with a coupling constant of 5 and compressibility of 4.5? 10?5 bar?1.[226] The resulting 500ns simulation of each CV were combined into 1.5 ?s simulation trajectory and clustered using K-Means into 300 72 structures for seeding Markov model simulations. The seeding structures were solvated and simulated for 500ns for building a MSM. These simulations were performed at 340K and 1 bar using velocity rescaling thermostat and Parinello-Rahman Barostat and a 2fs in- tegration timestep. The initial Markov model resulted in disconnected states in the space of TICA. Thus we seeded further states from the low-populated intermediates in the TICA space and simulated another 200 structures. In total, we accumulated 500 trajectories with a total simulation time of 250?s for the wild-type ?2m. Snapshots of the system were saved every 200ps and we used every 1ns snapshot for building the MSM. 3.2.2 Results MSM construction and validation In MSM, our purpose is to model the slow dynamics of the system.[215] The variational principle of conformational dynamics provides a scalar value (VAMP-2) [218] to compare features in order to find hyperparameters of Markov model that provide a kinetic model with the highest kinetic variance. To find the optimal hyperparameters for featurization and TICA, we used the variational scoring method with cross-validation to evaluate the model quality.[139] The following trajectory featurizations were considered for the optimization process: 1) Cartesian coordinates of C? atoms 2) pairwise distances of C? ?s 3) dihedral angles 4) transformed pairwise distances according to f (di j) = exp(?di j) and 5) inverse pairwise distances between C? atoms. A 50:50 train-test split cross-validation scheme was used to evaluate hyperparameters of MSM and avoid overfitting. In this approach, the model is fitted to the training data while transforming the test set according to the model. We repeat the shuffle split process 5 times to obtain standard deviations of out- of-sample model performance. As MSM and VAMP-2 score is highly dependent on the chosen lagtime, we repeated the process for three different lagtimes 10, 20 and 50ns. The 73 Figure 3.1: feature selection with VAMP-2 score over 3 lagtimes (10,20,50 ns) Figure 3.2: Optimal choice of hyperparameters for MSM result of feature optimization is shown in figure 3.1. Based on our analysis C? Cartesian coordinates outperforms other type of features consistently in all lagtimes. After selecting C? coordinates as the featurization choice, we need to select hyperparameters of MSM including the number of components for TICA, the number of microstates for clustering and the lagtime for building the MSM. The optimal hyperaparameters were obtained using a cross-validated VAMP-2 score. Based on our analysis (figure 3.2) the VAMP-2 score were maximized at a lagtime of 20ns with 80 microstates and 4 TICs. Featurization and TICA were performed using PYEMMA software.[263] After finding the optimal hyperparameters for building the kinetic model, we trans- formed the data into 4D TIC space to reduce the dimension of the feature space. TICA projects the dynamics into a few components (TICs) while preserving the long-timescale 74 Figure 3.3: Free energy landscape in the space of 4 TICs dynamics of the system. TICA lagtime, number of components (TICs) and number of cluster points were optimized using 5-fold cross-validated VAMP-2 score (20ns lagtime, 4 TICs and 80 clusters). The free energy landscape in the space of TICs were obtained by performing histogram analysis over TIC dimensions as shown in figure 3.3 (transforma- tions on different combinations of TIC components). Multiple low-energy basins are found in the FEL with transition regions between metastable states. The 4D TICA space was then clustered using 80 KMeans points. All trajectories were then discretized into these 80 microstates and a MSM transition matrix was built using a 75 Figure 3.4: A) implied timescales B) CK test Bayesian Markov modeling scheme. Thermodynamic and kinetic properties of the system can be extracted from the Eigendecomposition of the transition matrix. To choose a proper lagtime for the final MSM, we plotted the implied timescales (ITS) as a function of lag- time as shown in figure 3.4. In MSM, we select the smallest lagtime where the implied timescales have converged. the implied timescale converge after about 75ns which is what we use to build the final MSM. Diagonalization of the 80 microstate transition matrix led to 4 leading timescales (eigenvalues) followed by a spectral gap which motivated us to choose a 5 Macrostate MSM for further analysis. The Chapman-Kolmogrov (CK) test was performed on the diagonal of MSM transition matrix to check the self-consistency of the constructed MSM (figure 3.4B) Figure 3.5 shows the visualization of the top 4 eigenvectors of the transition matrix. the first eigenvector of the transition matrix corresponds to the stationary distribution and 76 Figure 3.5: Eigenvectors of the transition matrix over transformed on TIC space. The timescales t1 to t4 correspond to the timescale of each eigenvector where t1 is the timescale of the second eigenvector (first eigenvector is the stationary distribution) Figure 3.6: fraction of native contact for different states 77 Figure 3.7: metastable state assignment according to PCCA++ over the TICA space Table 3.1: stationary probability and free energy of different metastable states state probability Free energy / kT S1 0.027 3.62 S2 0.079 2.53 S3 0.192 1.65 S4 0.231 1.47 S5 0.471 0.75 gives the free energy landscape (FEL). The timescale for each eigenvector correspond to the timescale in the ITS figure at a lagtime of 75ns. To identify the folded and unfolded region of the TICA landscape we computed the fraction of native contact of each snapshot and colored the TICA landscape with this quantity. Figure 3.6 shows the fraction of native contact over 4 combinations of TICs.[22] This shows the first implied timescale of 8.2?0.9 ?s as shown in the eigenvector visualization (figure3.5) corresponds to going from the global folded state of the protein to an unfolded (misfolded) region. The second implied timescale of 3.0?0.9 ?s corresponds to transition between two different misfolded states. The 80 state transition matrix was further coarse-grained into 5 states using PCCA++ 78 Figure 3.8: hydrophobic SASA over the TICA landscape clustering over the the first 4 eigenvectors of transition matrix. This is a fuzzy clustering algorithm which gives the probability of each microstate to belong to each 5 macrostates. We further use the maximum assignment probability to assign 80 microstates into the 5 macrostates. A visualization of the Macrostate assignment (states) over the space of TICA is given in figure 3.7 over a few TIC dimensions. The stationary probability and free energy of each macrostate are calculated in Table 3.1. State S1 has the smallest population of only 0.027 with a energy of 3.62 kBT while the folded state S5 has a population of 0.471 and a free energy of 0.75 kBT . Exposure of hydrophobic residues in the misfolded states of monomeric proteins is an important factor for aggregation and amyloid formation. We have computed the solvent accessible surface area of the hydrophobic residues in ?2m in the TICA landscape. Figure 3.9A shows the average SASA of each metastable states in the ?2m folding. State S1 which is the unfolded (misfolded) has the highest hydrophobic SASA and S5 which is the 79 folded has the lowest hydrophobic SASA. We have applied transition path theory (TPT) to analyze the flux from the unfolded S1 to folded S5 states. The results are shown in figure 3.9B. The most putative folding pathways are from the unfolded states S1 and S3 to the folded S5 state while other pathways have smaller fluxes. In order to visualize metastable state structures, we sampled from the center of each metastable state to generate a structure. Figure 3.10 shows the structure of these states along with the mean first passage times between different states. In the misfolded state S1, the outer strands of the strand A and D were unfolded. This unfolding of the outer strands, exposes the hydrophobic core of the protein to the solvent which make the protein prone to aggregation. Hydrophobic residues Leu54, Phe56, Trp60, Phe62, on the DE loop as well as residue Phe30 are the dimerization hotspots of the protein.[95] We have also computed the root mean square fluctuation (RMSF) relative to the folded state for 10,000 sampled structures from each metastable state. The results are shown in figure 3.11. state S1 has the highest fluctuation in the A strand which completely detaches from the core of the protein. Detachment of A strand is a hallmark of aggregation for the ?N6 variant of ?2m. Other metastable states show lower RMSF in strand A. Therefore, a structural characteristic of state S1 is the unfolded strand A. Other misfolded states show high RMSF in strand D which is also unfolded. This strand is in fact the first to unfold and has high RMSF in all metastable states even in the folded S5 state. This causes more fluctuation in the DE loop which exposes the hydrophobic residues to the solvent. We have computed the ? -sheet content of the protein in each metastable state in figure 3.12. Strand A in state S1 is unstructured in more than half of this structural ensemble. Second half of strand D is also unstructured in state S1. Strand G also has a lower sheet content than the folded ensemble S5. State S2 has an unstructured D strand which has even lower probability than state S1. An unstructured D strand is thus characteristic of this state. States S3 and S4 also have a partially unfolded strand D and a fully folded strand A. 80 Figure 3.9: A) hydrophobic SASA of different metastable states B) network representing the flux from the misfolded S1 to folded S5 with arrows showing the flux between states and the size of each state corresponds to the stationary distribution Figure 3.10: representative structures and the timescale of transition between different states. Strand A is shown as blue and strand D is shown as red. The thickness of each arrow is relative to the transition rate between different states and the diameter of each circle corresponds to the population of each metastable states. 81 Figure 3.11: RMSF of different metastable states from sampled snapshots. 3.2.3 Discussion and Conclusion High concentration of ?2m is suggested to be the major cause for fibrilogenesis in dial- ysis patients undergoing haemodialysis.[18, 89] Monomers of wild-type ?2m are highly stable under physiological conditions with almost no tendency to form aggregates even at elevated concentrations in vitro.[89] Partial unfolding or misfolding of monmeric species is widely believed as the first stage of aggregation for globular proteins.[4, 278] In this regard, a long-lived metastable intermediate in folding of ?2m has been identified with non-native trans-Pro32 conformation.[145] Although this folding intermediate has been recognized as an important amyloidogenic precursor, it is not the only factor driving the amyloidogenic properties of ?2m. Structural and dynamical studies are needed to investigate the misfolded states of ?2m with highly amyloidogenic characteristics. Marchand et al.[162] studied the aggregation properties of D76N mutant of ?2m by combining ssNMR and ensemble mod- eling Molecular Dynamics. Their results pointed to the presence of major conformational exchanges happening on the ?s-ms timescale. The metastable states in their study are characterized by the loss of ? strand in D76N for the outer strands. They proposed that the 82 Figure 3.12: ? -sheet content of different states A) S1 B) S2) C) S3 and D) S4 . The grey area in each figure shows the ? -sheet content of state S1 destablization of the outer strands of D76N ?2m causes increased aggregation propensity. This impairs the hydrophobic core of the protein and exposes it to solvent. In their study, in the excited state of D76N the D and A strand were unstructured and the C-terminal was paritally detached. This resulted in exposure of aggregation prone core strands (B, E and F) which lose their protection from aggregation resistant edge strands A, D and G. Despite nu- merous structural and mutational studies on ?2m conformational dynamics, the structural details that dictate formation of amyloidogenic species of ?2m remain elusive. Markov state models are statistical models, able to estimate conformational changes as Markovian transitions at discrete space. They can overcome the timescale limitations of long-timescale MD which is an inherent problem in MD. A MSM can be estimated from multiple short MD trajectories which allow the sampling of different states to be conducted in parallel which is highly efficient using modern supercomputers. To our knowledge, this is the first study applying MSM on the folding landscape of ?2m using accumulated 250 ?s trajectories of MD simulation. Initially, we ran three different metadynamics simula- 83 tions with 2 different collective variables from metadynamics-1 (? -sheet content and phipsi CVs), metadynamics-2 (? -sheet content and RMSD CVs) and metadynamics-3 (RMSD and phisphi CVs). The data from all metadyanmics simulation were combined and clus- tered to generate 300 seeds for conventional MD simulation each for 500ns. The initial MSM built using these trajectories resulted in a free energy landscape with disconnected states. Therefore, we further adaptively sampled from the low-energy states and ran ad- ditional 200 simulations. The total simulation time of 250?s although might be short for describing the full folding landscape of ?2m, is probably sufficient for characterizing the near-folded and misfolded states and their transitions. Constructing a Markov model of the folding and misfolding trajectories involves selection of multiple hyperparameters such as number of TICs and the number of discreate states. Variational approach to conformational dynamics (VAC) allows the comparison of different models and choosing the optimal fea- turization and hyperparamers for MSM. Therefore, we computed a 5-fold cross-validation VAMP-2 score for different types of features as well as number of TICs and number of clusters. These led to choosing the C? coordinates as features, 4 TICs and 80 clusters for MSM construction. Projection of data into 4 TICA space (figure 3.3) shows multiple low energy states and transition regions between them. To choose a proper lagtime for final MSM construction and check its Markovian properties, we conducted implied timescale and CK tests. At the chosen lagtime of 75 ns, the implied timescales converge and the CK test shows the ability of the model to predict multiple lagtimes into future which shows the Markovian property of the model. The eigenvectors of the MSM transition matrix pro- jected onto the TICA space showed the transition from the unfolded to folded state to be the slowest process with a timescale of 8.2?0.9 ?s and the second slowest process the transition between different misfolded states. (figure 3.5) The implied timescales showed a gap between 4th and 5th timescale which led to constructing a 5-state coarse-grained MSM. The 80-state MSM was further coarse-grained into 5 metastable states using PCCA++ algo- 84 rithm. The metastable states projected on TICA are shown in figure 3.7 with the probability and free energy of each state shown in table 3.1. We investiagted the aggregation propen- sity of different states by computing the hydrophobic SASA of samples in each state. The computed hydrophobic SASA projected onto the TICA is shown in figure 3.8 and the av- erage SASA of each state in figure 3.9A. Unfolded state S1 has the highest hydrophobic SASA and the folded state S5 has the lowest SASA. The transition from the unfolded S1 to folded S5 happens directly or through an intermediate S3 state as shown in figure 3.9B using TPT analysis. We investigated the structural properties of intermediates states by sampling a conformation from the center of each metastable state. Figure 3.10 shows the structure of metastable states as well as the MFPT of transition between them. S1 state has an unfolded D-strand and an unstructured and detached A-strand which is reminiscent to ?N6 intermediate state. The detachment of A strand in S1 is the slowest process in the mis- folding landscape of ?2m. RMSF for different states is shown in figure 3.11 and shows the high RMSF of strand A is only a characteristic of state S1. Secondary structural properties of different state as the percentage of ? -sheet content (figure 3.12) showed that state S1 has an unfolded strand A and a partially folded strand D while state S2 has an unfolded strand D. Other states had a low ? -sheet content in strand D which points to the fact that strand D is highly flexible. Some representative structures of states S1, S2 and S3 are shown in figure 3.13. The unfolding of A strand is reminiscent of ?N6 mutant of ?2m.[95, 93] Strand A plays a major role in aggregation by acting as a hook in dimer assembly. [88] Unfolding or detachment of strand A from the core exposes hydrophobic residues Pro5, Leu7 and Val9 on this strand to the solvent (figure 3.13). It was proposed that the high aggregation po- tential of ?N6 is due to its ability to populate one or more aggregation prone intermediate states.[254] Estacio et al.[95] performed computational study on dimerization of interme- diate state using Monte-Carlo ensemble docking (MC-ED) and constructed contact maps 85 Figure 3.13: Representative structures of states S1, S2 and S3 of dimer interface. For wild-type ?2m, the dimerization was majorly driven by DE-loop. Other studies have shown the importance of DE loop for dimerization and aggregation of ?2m.[257] Hot spots residues Phe56, Trp60, Phe62, Tyr63 and Leu65 on or near the DE loop were identified which also assist docking of H?2m to MHC-I heavy chain.[235] Aro- matic residues Phe56, Trp60, Phe62 and Tyr63 all lie in aggregation prone sequence make contact with MHC-I heavy chain.[259] Residues Phe62, Tyr63 and Leu65 were shown to play major roles in fibril nucleation. [250] Trp60 in their study had the largest number of intermolecular contacts between monomers of ?2m. Phe30 also belongs to the same hy- drophobic cluster near the DE loop and is important for aggregation. Structural mapping us- ing computational techniques mapping the dimerization interface suggested residues Tyr10, His13, Phe30 and His84 to be hotspots for ?N6 amyloidosis.[95] Our Markov state model also showed in involvement of the hydrophobic residues especially residues Phe56, Trp60, Phe62 and Tyr63 in the misfolded state of the ?2m folding landscape with ?s transition times for unfolding and folding. Since, the temperature of simulation has a direct impact on the transition rates, it is expected that the folding and unfolding MFPTs which are on the order of 10?s of ?s are underestimated. However, this study gives important insights into misfolding pathway and metastable states for misfolding ?2m. 86 3.3 Variational embedding of protein folding simulations using Gaussian mixture variational autoencoders 1Conformational sampling of biomolecules using molecular dynamics simulations of- ten produces a large amount of high dimensional data that makes it difficult to interpret us- ing conventional analysis techniques. Dimensionality reduction methods are thus required to extract useful and relevant information. Here, we devise a machine learning method, Gaussian mixture variational autoencoder (GMVAE), that can simultaneously perform di- mensionality reduction and clustering of biomolecular conformations in an unsupervised way. We show that GMVAE can learn a reduced representation of the free energy land- scape of protein folding with highly separated clusters that correspond to the metastable states during folding. Since GMVAE uses a mixture of Gaussians as its prior, it can directly acknowledge the multi-basin nature of protein folding free-energy landscape. To make the model end-to-end differentiable, we use a Gumbel-softmax distribution. We test the model on three long-timescale protein folding trajectories and show that GMVAE embedding re- sembles the folding funnel with folded states down the funnel and unfolded states outside in the funnel path. Additionally, we show that the latent space of GMVAE can be used for kinetic analysis and Markov state models built on this embedding produce folding and un- folding timescales that are in close agreement with other rigorous dynamical embeddings such as time independent component analysis (TICA). 1Taken from a published paper: Ghorbani, M., Prasad, S., Klauda, J. B., Brooks, B. R. (2021). Variational embedding of protein folding simulations using Gaussian mixture variational autoencoders. The Journal of Chemical Physics, 155(19), 194108. 87 3.3.1 Introduction In recent years, computer simulations of biomolecular systems have gained huge at- tention due to advances in theoretical methods, algorithms and computer hardware. This enabled efficient exploration of processes in atomic scale using molecular dynamics (MD) simulations.[134] In a MD simulation, one integrates the Newton?s equations of motion where the forces between atoms in the system are described by a parameterized force field. Exploration of the high dimensional space typically requires long timescale simulations or the use of some enhanced sampling techniques.[339, 19] These simulations usually gen- erate a large amount of high dimensional data making analyzing the important features of protein folding such as free energy landscape (FEL) and identifying metastable states a challenging task.[105] Therefore, dimensionality reduction techniques are often used to describe the processes such as folding and conformational transitions of proteins.[169] The ideal FEL should consist of heavily clustered datapoints, where each cluster is po- sitioned in a local free energy minimum and corresponds to long-lived metastable states separated by kinetic bottlenecks (i.e. free energy barriers).[124] This ideal FEL is the cornerstone of many kinetic models that describe the dynamics of the system using for example Markov state models (MSM).[61, 60, 59] Traditional methods to capture FEL, rely on identifying relevant collective variables (CVs) that are well-suited to describe the physical processes or to distinguish different states. However, finding the right collective variables for the system of interest requires a physical/chemical intuition about the process of interest.[85, 222] This makes it necessary to define a low-dimensional representation of the system that can capture the essential degrees of freedom or the important CVs of the system of interest. There are various methods for dimensionality reduction and find- ing optimal representation of complex FEL such as PCA[1], TICA[269, 228], Isomap[6], sketch map[42] and diffusion map[212, 213]. PCA-based methods assume an underlying 88 linear manifold which is generally not correct. Some of the nonlinear manifold methods like Isomap assume data to be isomorphic to a hyperplane which leads to topological in- stabilities. Moreover, these methods involve computation of distances (geodesic or other kernel based) between all pairs of points which makes it unscalable to larger MD simula- tion trajectories. In diffusion maps, one needs to calculate the Gaussian kernels which can be computationally expensive and not scalable to large-scale MD simulation data. Machine learning (ML) has recently emerged as a powerful alternative tool for learn- ing informative representations and in particular variational auotencoders (VAEs) have shown great potential for unsupervised representation learning [151]. An autoencoder has two parts: encoder and decoder. The encoder network reduces the input data to a low- dimensional latent space and the decoder maps the latent representation back to the original data. In the VAE framework, a regularization is added to the model by forcing the latent space to be similar to a pre-defined probability distribution (e.g Gaussian) which is called a prior. VAEs have been recently used for CV discovery in MD simulations [47, 266, 50], enhanced sampling [244, 26] and dimensionality reduction methods [24, 302]. In a simple VAE, the prior is a simple standard distribution, which can lead to over- regularization of the posterior distribution and results in posterior collapse.[112] This makes the output of the decoder almost independent of the latent embedding and can result in poor reconstruction and highly overlapping clusters in the latent space [24]. On the other hand, a Gaussian prior is limited since the learnt representation can only be unimodal and can- not capture multimodal nature of data such as protein folding simulation where there exist multiple metastable states during the folding process.[83] In this work, we employ a Gaussian mixture variational autoencoder (GMVAE) that directly acknowledges the multimodal nature of protein folding simulations and can con- struct the ideal multi-basin FEL. This is achieved by modeling the latent space as a mixture of Gaussians by using a categorical variable that identifies which mode each data point 89 comes from. Therefor, GMVAE model simultaneously performs dimensionality reduction and clustering.[84]. The features in our model are the normalized distance map between C? atoms of the protein. We test our model on three long-timescale protein folding sim- ulations taken from DE Shaw group [180]. These include Trp-cage (208 ?s), BBA (325 ?s) and Villin (125 ?s). We show that the model can learn the funnel-shaped landscape of protein folding and cluster the conformational space with high accuracy that correspond to different structural features of protein. Furthermore, we show that despite the fact that the GMVAE embedding does not make use of any dynamical information, it is able to de- scribe the kinetics of protein folding and the folding and unfolding timescales obtained by making a Markov model on this embedding are in close agreement with other works using a rigorous dynamical model to describe the kinetics. 3.3.2 Methods Variational inference methods convert an intractable inference problem into an opti- mization one. While the classical variational methods are limited to conjugate priors and likelihood, VAEs allow the use of arbitrary function approximators (i.e. neural networks) as the conditional posterior [151]. VAEs can be approached from two different perspectives: variational inference and neural networks. In the variational inference, the main idea is to learn a distribution in the latent space that truly captures the distribution of the dataset. In particular, given a dataset x, the goal of variational inference is to infer the latent space representation z, i.e. to accurately model p(z|x). The Bayes theorem gives the relation between the posterior p(z|x), the prior p(z) and the likelihood p(x|z) as: | p(x|z)p(z)p(z x) = (3.1) p(x) 90 The denominator in this equation p(x) is called the evidence which requires marginal- ization over all latent variables and thus is intractable. Therefore, in variational inference one seeks an approximate posterior q? (z|x) with learnable parameters ? and minimize the Kullback-Leibler divergence (KL) between the approximate and true posterior. The KL divergence shows the difference between two probability distributions and is defined as: q (z|x) DKL(q? (z|x)||p(z|x)) = Eq log ? ( ) (3.2) p(z|x) by re-writing this equation and using Bayes rule we get: q (z|x) log p(x) = DKL(q? (z|x)||p(z|x))?Eq log ? ( ) (3.3) p(x,z) Due to Jensen?s inequality the KL divergence is a non-negative term which makes the last term in the equation called evidence lower bound (ELBO) to act as a lower bound for the log-likelihood of the evidence. p(x,z) ELBO = Eq log( ) (3.4)q? (z|x) Therefore, we can now write equation 3 as: log p(x) = DKL(q? (z|x)||p(z|x))+ELBO (3.5) This has the implication that minimizing the KL divergence or maximizing the log- likelihood of evidence can done by maximizing the ELBO. The graphical model of GMVAE is shown in figure 3.12. In the generative part (de- coder) of the network, a sample z is drawn from the latent space distribution p? (z|y) of cluster y which is parameterized by parameters ? using the decoder part of the neural net- work. This can be used to generate the conditional distribution p? (x|z) parameterized by 91 another neural network ? . The generative process for GMVAE can be written as p? ,? (x,z,y) = p? (x|z)p? (z|y)p(y) (3.6) p? (z|y) = N(z|?? (y),?2? (y)) (3.7) p? (z|x) = N(x|?? (z),?2? (z)) (3.8) p(y) =Cat(?) (3.9) In these equations, ? = 1/K is the uniform categorical distribution where K is the number of clusters, and Cat(?) refers to categorical distribution for discrete variable y. N() refers to normal distribution where ?? , ?? , ?2? , ? 2 ? are the means and variances learned by the neural nets parameterized by ? and ? . Variational inference of GMVAE can be done by maximizing the ELBO which can be written as: p ELBO E log ? ,? (x,z,y) = q (3.10)q? ,?(z,y|x) The approximate posterior of the inference model q? ,?(z,y|x) can be factorized into two distributions: q? ,?(z,y|x) = q? (y|x)q?(z|x,y) (3.11) q? (y|x) gives the cluster assignment probabilities and thus ?Kk=1 q? (y|x) = 1. q?(z|x,y) is a Gaussian mixture where the parameters of the each Gaussian (? 2? ,?? ) are learned by the encoder part of neural network. In this model, categorical variable y represents a discrete node for each categorical distribution, which cannot be backpropagated and thus is substi- 92 Table 3.2: Chosen hyperparameters for each protein number number number number latent kernel learning pooling systems of of ofdimension batch-size temperature of size rate sizes layers neurons clusters filters Trp-cage 2 64 5 8 5000 0.1 [3,3] 0.001 [64,64] [1,1] BBA 2 64 6 9 5000 0.1 [3,3] 0.001 [64,64] [2,2] villin 3 64 5 6 2500 0.05 [3,3,3] 0.001 [64,64,32] [2,2,1] tuted with a Gumbel-softmax distribution which approximates this categorical distribution with a continuous one. This can be written as: log(?i)+gi e ? yi = log ? g for i=1...K (3.12)( j)+K j? j ?=1 e where ? is called the temperature parameter controls the smoothness of distribution where at small temperatures samples are close to one-hot encoded and at large temper- atures, the distribution is more smooth. gi are the samples drawn from a Gumbel(0,1) distribution. Using the generative and inference model the ELBO can be written as: p? (x|z)p? (z|y)p(y)ELBO = Eq log (3.13)q? (y|x)q?(z|x,y) ELBO = Eq[log p(y)? logq? (y|x)+ p (z|y) (3.14) log ? + log p? (x|z)]q?(z|x,y) The second term in the loss is called the cross-entropy and the last term is the mean squared error between the true and the reconstructed data. 3.3.3 Model parameters The model architecture is shown in figure 3.14B. The GMVAE model was implemented in tensorflow. Convolutional layers were applied along with pooling for their ability to rec- 93 Figure 3.14: A) graphical model for inference and generative parts of GMVAE. Grey circles rep- resents the observed data B) Schematic of GMVAE architecture. In this architecture, q(y|x) refers to cluster assignment probabilities, q(z|x,y) is the approximate posterior, ? an ? are the mean and variance of each Gaussian in the approximate posterior of encoder network. p(z|y) is the prior Gaussian and ?p and ?p are the mean and Gaussians of the prior Gaussians in the decoder network Figure 3.15: Native folded structure of studied proteins. A) Trp-cage, B) BBA C) Villin headpiece ognize features in images. Exponential linear unit (Elu) activation function was used in each layer and a softmax activation was used for the cluster assignment probability. The means and variances of distributions were obtained using no activation, and softplus acti- vation respectively. Adam was used as optimizer in all models.[150] We have optimized the hyperparameters of the model based on the reconstruction loss. The chosen hyperpa- rameters for each protein are shown in table 3.2. During training, we split the data into a train/validation set with a fraction of 0.8 for training and 0.2 for validation set. The latent space dimension was chosen using a grid search for minimizing the reconstruction loss of 94 the validation set for each protein. Number of clusters is another hyperparameter that must be specified for training the model. Varolgu?nes? et al.[302] used a thresholding scheme to pick the clusters that have class probabilities more than a pre-defined cutoff. In this paper we adapted a similar proce- dure. To select this hyperparameter, we first started with a random number of clusters (e.g. 10) and computed the membership probability of each point in the input. Then we used a cutoff value (0.95) to count the number of clusters with membership probabilities higher than the cutoff. We then trained the model with the recovered number of clusters from the previous training. We found that this number is highly robust to the other hyperparameters of the model. We also found that after the first round of training, the number of recovered clusters does not change using the same probability assignment cutoff. Each model was trained for 100 epochs of training. The temperature parameter in gumbel softmax controls the smoothness of distribution. We also tried annealing the temperature parameter starting with a high value (5) and lowering it to 0.1 during the first 40 epochs of training and then keeping it the same for the rest of training. However, we found that the model would diverge after a few epochs of training and having a fixed and small value of temperature parameter gives the best results. Since the GMVAE model gives a probabilistic cluster assignment that is the probability of each datapoint belonging to each cluster (fuzzy-clustering), we used a k-nearest neighbors method to compute a hard-cluster assignment using the neighborhood of each point in the embedding. For the kinetic analysis, we used pyemma package [264] to build the transition matrix. In each case, the embedding was discretized using 500 K- means cluster points and the transition probability matrix was built by counting the number of transitions between different states at lag time ? . The implied time scales are computed from the eigenvalues of the transition probability matrix: ti(?) =? ? (3.15) ln |?i(?)| 95 Figure 3.16: Training and validation loss for Trp-Cage example To test the Markovianity of the transition matrix the implied timescales are plotted against the lag-time and then the smallest ? is chosen such that the implied timescales have con- verged. A coarse-grained transition matrix is later built by assigning the K-means points to the closest GMVAE clusters yielding a coarse-grained view of dynamics. The folding and unfolding timescales are obtained from this coarse-grained matrix. 3.3.4 Results Here we tested the performance of GMVAE model for dimensionality reduction and clustering of three protein folding systems including Trp-Cage (pdb: 2JOF)[11], BBA (pdb: 1FME)[260] and Villin (pdb: 2F4K)[157]. The native folded structure of these proteins are shown in Figure 3.15. We show that the GMVAE embedding captures the free energy land- scape of these proteins with well-separated clusters. We analyze the structural properties of each cluster and show that each cluster corresponds to a different structural feature in the protein. The total loss, cross-entropy loss and reconstruction loss shows a decreasing behavior for both train and validation set in all three proteins. The loss for TrpCage as an example is shown in figure 3.16. For visualizing the latent space of GMVAE, we used a low-dimensional latent space (2 96 Figure 3.17: Reconstruction loss vs latent space dimension for A) Trp-cage, B) BBA C) Villin headpiece or 3) and show that this embedding mimics the funnel-shaped landscape of protein fold- ing where the folded state resides down the funnel and the unfolded states are outside the funnel. For the rest of our analysis on each protein, we used an optimized number for latent-space dimension based on a cross-validated reconstruction loss. Figure 3.17 shows a cross-validated reconstruction loss as a function of latent space dimension for each pro- tein. Higher dimensional embeddings result in better reconstruction loss for all proteins. This means to capture the complex protein folding landscape we need a high dimensional latent space in our GMVAE model. To test whether the GMVAE clusters give meaningful structural information, we sampled 5000 datapoints from the center of each cluster and compared the distribution of RMSDs of the whole protein and specific domains of each cluster to the folded state. Moreover, we show that building a Markov model on the em- bedding of GMVAE produces folding and unfolding timescales that are in close agreement with the timescales obtained from constructing a Markov model on a dynamical embedding such as TICA. 3.3.5 Trp-cage As the first example, we test our GMVAE model on an ultra-long 208 ?s explicit solvent simulation of the K8A mutation of the 20-residue Trp-cage TC10b at 290K by D.E. Shaw Research.[180] Numerous experimental and computational studies have been performed 97 Figure 3.18: Results of GMVAE for Trp-cage. A) 3D embedding (zdim=3) colored with RMSD with respect to the folded state. B) first two dimensions of latent space (zdim=3) colored with RMSD C) Free energy landscape of the first two dimensions of embedding (zdim=3) D) TSNE visualization of 5D latent space colored based on the argmax of their cluster assignment probabilities (only points with more than 0.75 membership probability are shown) E) RMSD distribution of trp- cage in different clusters F) implied timescale (ITS) plot for MSM construction on Trp-cage.[201, 92, 274] The folded state of Trp-cage shown in figure 3.18A contains an ?-helix (residues 2-8), a 310 helix and a polyproline II helix, and the tryptophan residue is caged at the center of the protein. Two different folding mechanisms has been identified for Trp-cage to date [78]: one where Trp-cage goes through a hydrophobic collapse into a molten globule followed by formation of N-terminal helix and the native core (nucleation- condensation) and second the pre-formation of the helix from the extend conformation and the joint formation of 310-helix and hydrophobic core (diffusion-collision). The second mechanism is identified as the dominant folding pathway for Trp-cage. Here we investigated Trp-cage folding trajectories using GMVAE model for embed- ding and clustering. The features are the normalized distances between the C? atoms of Trp-cage in the trajectories. The hyperparameter K which identifies the number of clusters is unknown a priori. To choose a reasonable number for each protein, we started from a 98 higher estimate for number of clusters (e.g 10) and trained the model. Then we used a cut- off (0.95) to find the number of clusters with membership probability more than the cutoff value. We found only 8 out of 10 clusters had higher than 0.95 membership probability. Next we trained the model again with 8 clusters. At this stage we found that all clusters had membership probabilities higher than our original cutoff. Moreover, we found the same number of clusters regardless of the other hyperparameters for the mode such as the number of layers. Although the 2D or 3D latent spaces are used for visualization purposes, higher latent space embeddings are needed to describe the folding energy landscape more accurately. To choose an optimum latent-space dimension, we computed a cross-validated reconstruction loss for different values of latent space dimension from 2 to 10. The re- sults for Trp-cage are shown in figure 3.17A. We chose a 5-dimensional latent space for clustering this protein. Other hyperparameters such as batch-size, learning-rate, number of layers, temperature of gumbel-softmax, kernel-size, number of filters and pooling sizes were optimized using a grid search method based on reconstruction loss. The chosen hyper- parameters for each protein are listed in table 1. The total, reconstruction and cross-entropy loss using the determined hyperparameters in table 1 are shown in figure 3.16. Reconstruc- tion and cross-entropy loss for both training and validation data show a decreasing behavior demonstrating the convergence of the model after 100 epochs of training. Figure 3.18A shows the 3-dimensional embedding (z-dim=3) of Trp-cage trajectories colored based on the RMSD with respect to the crystal structure. The gradual change of color from high RMSD (red) to low RMSD (blue) in the landscape demonstrates that the low-dimensional embedding can capture the protein folding process. Figure 3.18B shows the first two dimensions of the latent embedding colored based on RMSD. The high RMSD and low RMSD regions are well separated on this landscape. The folded state has a narrow distribution and is the narrow wedge of the folding funnel. We computed the free energy landscape on the first two dimensions of the latent space (figure 3.18C). The free 99 Figure 3.19: Trp-cage folding transitions, the thickness of lines corresponds to the transition proba- bility between the two states. Transitions with probabilities less than 0.05 are not shown for clarity energy landscape shows multiple wells that are separated by diffuse regions in between them. The wells correspond to the centers of GMVAE clusters and the diffuse region is the transition region between different conformational states. Hard-cluster assignment in the 3D latent space is shown in figure 3.20A. Next, based on figure 3.17A, we used a 5-dimensional latent space for clustering Trp-cage. To visualize the 5D latent space, we only take datapoints with membership assignment probabilities higher than 0.75 and used t-distributed stochastic neighborhood embedding (t-SNE)[300] for transforming the 5-dimensional embedding into 2 dimensions. The t-SNE results for Trp-cage are shown in figure 3.18D. The clusters are highly separated on this landscape. To ensure that GMVAE clusters corresponds to different structures during folding, we sampled 5000 points from the center of each cluster and computed the RMSD distribution of the protein with respect to the folded state (figure 3.18E). The folded state (cluster 5) has a narrow distribution while other unfolded and misfolded states have wider distributions with higher RMSD values. Representative structures of each cluster are shown in figure 3.19. We have also computed the RMSD distribution of residues 11-15 comprising the 310-helix for different states. The 100 results are shown in figure 3.20B. Figure 3.20: A) clusters of Trp-cage with 8 clusters and 3D embedding zdim = 3 B) RMSD of residues 11-15 comprising the 310-helix for selected clusters for Trp-cage Next, we built a MSM on the 5D embedding by choosing 300 KMeans points and dis- cretizing the trajectories based on this clustering on the GMVAE embedding. The implied timescales for this transition matrix is shown in figure 3.18F. Based on this, we chose a lag time of 160 ns to build the MSM. To compute the mean-first-passage-time (MFPT) be- tween different GMVAE clusters, we coarse grained the transition 300-state matrix into 8 states that corresponded to the GMVAE clusters. The folding and unfolding times based on the coarse-grained Markov model are 11.62 and 4.85 ?s respectively. The folding and un- folding times are in good agreement with the values reported by Lindorff-Larsen et al.[180] who reported 14.4 and 3.1 ?s as the folding and unfolding times of this protein using the average lifetime in the folded and unfolded states observed in trajectories using a contact based definition of folded and unfolded states. A visualization of the 8 metastable states found by GMVAE model is shown in figure 3.19. The arrows between different states show the transition between different conformations and the arrow thickness relates to the transition probability between different clusters obtained by coarse gaining the Markov model into 8 GMVAE clusters. The native folded state S5 accounts for about 18% of the total distribution and the unfolded ensemble represents the remaining 82%. Folding mostly 101 Figure 3.21: A) 2D embedding of BBA colored based on RMSD to the folded state B) 2D free energy landscape of BBA based on 2D embedding C) clusters in 2D embedding of BBA using kNN for cluster assignment D) TSNE visualization of 6D latent space colored based on the argmax of their cluster assignment probabilities (only points with more than 0.75 membership probability are shown) E) histograms of RMSD for different clusters F) ITS plot based on 6D latent space. proceeds via the molten globule state S0 or the near-folded state S4. 3.3.6 BBA The second example is ??? fold protein (BBA) which is a 28-residue fast folding pro- tein. The NMR structure of this protein is shown in figure 3.15B. This protein contains an antiparallel ? sheet at the N terminal and a helical conformation at its C terminus. For finding the optimum number of clusters, we first trained the model with 10 clusters and only 9 clusters were recovered using a 0.95 cutoff. Next, we trained the model with 9 clusters and found all the clusters have probabilities higher than our cutoff. We also ob- served that training the model with different hyperparameters would yield the same number of clusters. To better visualize the latent space we trained the model with 2 dimensions. The resulting latent space colored based on RMSD with respect to folded state is shown in figure 3.21A. Unfolded and folded states are well separated on this 2D embedding. The 102 Figure 3.22: BBA transitions, the arrows show the transition between different clusters and the arrow thickness represents the transition probability between the corresponding clusters. Transition with probabilities less than 0.1 are not shown for clarity. free energy landscape on this embedding is shown in figure 3.21B. It is observed that all clusters reside in the wells of the free energy landscape. There are also some diffuse and high energy states between the wells which correspond to transitions between different metastable states. These regions are also where the model is least certain about cluster assignment. To transform the fuzzy clustered output of GMVAE into hard-cluster assign- ment we used a K-nearest neighbors algorithm and assigned each points to the most likely cluster in its neighborhood using 500 neighbors. The result is shown in figure 3.21C which exhibits highly separated and non-overlapping clusters in the 2D embedding. In this em- bedding, state 8 corresponds to the folded state and state 6 is the near-folded (misfolded) state and all the other states are the unstructured or unfolded conformations. The highly non-overlapping clusters in the GMVAE landscape showcases the ability of this model to separate a vastly diverse set of protein conformations from a protein folding trajectory. The 2d embedding latent space cannot fully capture the complex folding landscape. Therefore, we optimized the latent space dimension based on a cross-validated reconstruc- 103 tion loss in figure 3.17B. Next, based on this result, we used a 6-dimensional latent space for the rest of our analysis. The T-SNE visualization of this 6-dimensional landscape is shown in figure 3.21D. We have studied the structural properties of each cluster by sam- pling 5000 datapoints from the center of each cluster. Figure 3.21E shows the distribution of RMSD of each cluster with respect to the folded state. Cluster 0 is the folded state with the sharpest and lowest RMSD distribution. Other clusters have wider and higher RMSD distributions and correspond to misfolded or unfolded states. Representative structures for each cluster are shown in figure 3.22. We also investigated the details of structural features for each cluster by calculating the RMSD distribution of specific domains in BBA. Figure 3.23 shows the distribution of RMSD of antiparallel ? -sheet (residues 7 to 14) (left panel) and the ?-helical (right figure) parts of BBA (residues 16 to 26) with respect of the folded structure. The folded state (cluster 0) has the lowest RMSD in both domains while cluster 4 has a low RMSD in the antiparallel ? -sheet domain but a higher RMSD in the ?-helical domain. Figure 3.23: A) RMSD of antiparallel ? -sheet residues (7-14) with respect to the folded state of BBA B) RMSD of ?-helical domain of BBA (residues 16 to 26) with respect to the folded state for different clusters To perform a Markov model on this embedding, we first clustered this embedding using 500 KMeans and discretized the trajectories based on the points. To choose the proper lag- 104 time for the MSM model, we plotted the implied timescales (figure 3.19F) and picked 220 ns and built the transition probability matrix. Next to compute the transition timescale be- tween different GMVAE clusters, we assigned each of the 500 KMeans clusters to the clos- est cluster in GMVAE and then computed the mean first passage times (MFPTs) between clusters. The folding and unfolding timescales calculated here are 15.2 and 7.42 ?s re- spectively, which are in close agreement with the values reported by DE Shaw group.[180] Figure 3.22 illustrates the representative structures of each cluster which are sampled from the mean of each distribution in the latent space. The transition between different states is shown with the arrow where the width of each arrow represents the transition probability. 3.3.7 Villin The last example is a 35-residue villin-headpiece subdomain, which is one of the small- est proteins that can fold autonomously. It is composed of three ??helices denoted as helix 1 (residues 4-8), helix 2 (residues 15-18) and helix 3 (residues 23-32) and a com- pact hydrophobic core. The observed experimental folding timescale for wild-type villin is about 4 ?s and the replacement of two Lysine residues (Lys65 and Lys70) with uncharged Norleucine (Nle) yield a mutant with folding time of less than one microsecond. [158] Folding landscape of villin double mutant has been studied both by experiments and com- puter simulations.[280, 167, 62, 16] Folding a double mutant of Villin was studied using long-timescale molecular dynamics by D.E. Shaw group and is used here.[180] The number of clusters for villin was found as described for other proteins. We started with 7 clusters and found only 6 clusters were recovered using a 0.95 cutoff for cluster probability. The latent embedding using a 3D latent space is shown in figure 3.24A where each point is painted based on RMSD with respect to the folded structure. The first two dimensions of this 3D embedding colored based on RMSD is shown in figure 3.24B. Figure 105 Figure 3.24: GMVAE embedding results for villin. A) 3D latent space (zdim=3) colored with RMSD B) first two dimensions of 3D latent space colored based on RMSD C) FEL based on first 2 dimensions of latent space D) TSNE plot for 5D latent space. (only points with more than 0.75 membership probability are shown) E) distribution of RMSD for villin with respect to the folded state F) ITS plot for markov model construction based on 5D embedding Figure 3.25: Transitions between different states in villin headpiece simulation. The thickness of arrows corresponds to the transition probability between the two states. transitions with less than 0.1 probability are now shown for clarity. 106 3.24C shows the free energy landscape on the first two dimensions of the embedding. Due to fast transitions between different states in villin, unlike BBA the FEL has larger diffuse regions with smaller basins at the center of each cluster. Presence of large diffuse regions on this landscape means that the metastable states in the folding of villin are short lived and transition between each other quickly. The optimum latent space dimension for villin was found to be 5 (figure 3.17C). Other hyperparameters for villin were optimized based on a cross-validated reconstruction loss and the chosen hyperparameters are shown in table 1. The T-SNE visualization of this 5D latent space is shown in figure 3.24D which shows highly separated clusters. Figure 3.24E shows the RMSD distribution of each cluster in 5D latent space with respect to the folded structure. Cluster 3 corresponds to the folded state where the RMSD distribution is the narrowest and smallest. Figure 3.25 shows the representative structure of each cluster in 5D latent space. Structural properties of specific domains in different clusters were studied using the RMSD distribution of helices 1, 2 and 3 with respect to the folded structure. The results are shown in figure 3.26. each cluster has a different distribution for the helical residues of the protein which are Gaussian. Cluster S0 has a low RMSD for helix 1 and 2 but higher RMSD values for helix 3. Secondary structure calculations showed that S0 has a folded helix 1 and helix 2 but unfolded helix 3. Most clusters have a folded or near-folded helix-1, except for cluster S4. Cluster S3 is the folded state where are helices are folded with more than 80% probability. Helix 3 is only folded in S3 and S5 which shows the importance of this helix in proper folding of Villin. Next, we built a Markov model on this embedding by choosing 500 K-means cluster points for discretizing the trajectories. The implied timescales for this discretization are shown in figure 3.24D. A lag-time of 220 ns was chosen to build the transition matrix. The 500 K-means clusters were then assigned to their nearest GMVAE clusters to build a coarse-grained transition matrix. The folding and unfolding times obtained based on the constructed MSM on this embedding are 2.25 ?s and 1.54 ?s , respectively, which are in 107 Figure 3.26: RMSD distribution of helices 1,2 and 3 in different clusters with respect to the folded state. From left to right the figures in the panel correspond to helices 1,2 and 3 in villin headpiece. good agreement with the values reported DE Shaw group (2.8 ?s) and others building a Markov model using TICA.[180, 286, 223] Figure 3.25 shows the structures of each clus- ter and the transition probability between different states. The highest transition probability S3? S0 corresponds mostly to unfolding of helix 3. Therefore, proper folding of helix 3 leads to formation of native contacts and native helices. Piano et al.[231] studied the dou- ble mutant (Nle/Nle) of Villin and found a sparsely populated intermediate that involved formation of helix 3 and the turn between helices 2 and 3. This corresponds to cluster S2 in our analysis that has near-folded helix-3. Mori and coworkers[207] studied the molec- ular mechanics for folding of Villin and the Nle/Nle double mutant. They found that the mutation Lys? Nle speed up the folding transition by rigidifying helix-3. 3.3.8 Discussion and Conclusion Here we demonstrated the use of a deep learning algorithm, Gaussian mixture varia- tional autoencoder (GMVAE), to help analyze and interpret the highly complex landscape of protein folding trajectories. Variational autoencoder framework has been extensively used in the field of Molecular dynamics simulations for dimensionality reduction[24, 302], enhanced sampling [244, 26] and collective variable discovery[47, 266, 50]. Noe and coworkers proposed a time-lagged autoencoder (TAE) that can find the low-dimensional 108 embedding for high dimensional data while capturing the slow dynamics of the underlying processes [325]. Although Ferguson et al.[48] showed that TAE is limited in finding the optimal embedding for the dynamical system and in general it finds a mixture of slow and maximum variance modes. Ward et al. introduced DiffNets, which are deep autoencoders that identify structural features for predicting biochemical differences between protein vari- ants from MD simulation trajectories. [319] The GMVAE model acknowledges the multi-basin nature of protein folding by enforc- ing a mixture of multiple Gaussian as the prior model for the variational autoencoder. We applied our model to three long timescale protein folding trajectories, namely Trp-cage, BBA and Villin headpiece, all of which have been extensively characterized in previous studies.[180] In all cases, we showed that the model is able to characterize different fea- tures of the structure that could correspond to folded, misfolded or unfolded states. The low dimensional embedding obtained by GMVAE for these proteins resembles the folding funnel where the folded states lay down the funnel and unfolded ensemble are outside the funnel. This can be intuitively described from the conformational entropy point of view. The unfolded state has larger variations in the structure which causes the variance of Gaus- sian learned by GMVAE to be larger than the folded cluster having a narrower distribution. This along with the continuity of the latent space makes landscape funnel-shaped. To verify that the clusters obtained by GMVAE correspond to different structural features of proteins during folding, we computed the global and local RMSD of of each cluster respect to the folded structure. As expected the distribution of RMSD for different clusters follows a Gaussian where the folded state has the lowest and narrowest RMSD and the unfolded (extended) structure has the highest and widest RMSD distribution. We used normalized distance maps as the features in our machine learning model which are practical ways to represent the simulation dataset of proteins. Other features such as contact maps can also be used as the input to the model which would give a lower resolu- 109 tion embedding due to the amount of information in the contact maps relative to distance maps. Specifically in our model, we used convolutional operations which are known for their great ability to recognize and process image dataset. It is worth noting that, our GMVAE model is different from a simple Gaussian mixture model (GMM). In a GMM, the parameters of the model are optimized iteratively through expectation-maximization algorithm.[77] GMM has been used to cluster the FEL of proteins. Delemotte et al. used GMM to construct and cluster the FEL of binding Ca2+ to Calmodulin and found novel pathway involving salt bridge breakage and formation.[326] However, GMM requires the use of a few handcrafted features and a high number of collective variables can lead to over-fitting the model. On the other hand, since the GMVAE model is trained by gradient descent and is a deep learning architecture, it does not suffer from the same shortcomings of GMM. Unlike the GMVAE model proposed by Varolgu?nes? et al.[302] that learns the clusters cluster assignment through a stochastic layer, we replace this with a deterministic layer using Gumbel-softmax distribution which makes the model end-to-end differentiable and leads to better performance.[98, 142] The temperature parameter in Gumbel-softmax was tuned along with other model hyperparameters during training. The best hyperpa- rameters for each protein was chosen based on a cross-validated reconstruction loss. The number of clusters is a hyperparameter in the GMVAE. However, we showed that to find an optimum number of clusters, we first start with a higher estimate of number of clus- ters in each protein. Then, using a cutoff for cluster probability assignment, we find the number of clusters with membership probability higher than a defined cutoff. Next, we train the model with the recovered number of clusters from the previous step. We showed that at this stage all clusters have membership probabilities higher than the chosen cutoff (0.95). This also means the model has converged to the optimum number of clusters in the system. Notably, the number of recovered clusters was found to be the same regardless of other hyperparameters in the model. However, the number of clusters can be dependent 110 on the chosen cutoff. On the other hand, this can be viewed as a hierarchical clustering where based on the clustering resolution which correlates with the cutoff value in our pro- cess, different structures are embedded in the same cluster. The latent space dimension is another important hyperparameter that needs to be optimized. To find the optimum latent space dimension for each protein, we calculated a cross-validated reconstruction loss for different values of latent-space dimension for each protein. The reconstruction loss reduces as the latent space dimension increases and it reaches a plateau. For each protein, we pick the latent space dimension where the reconstructions loss reaches this plateau. Beyond the static characterization of the protein folding trajectories, we tested whether the model is able to characterize the kinetics of protein folding. We built a high resolution Markov model on the embedding obtained by GMVAE and computed the MFPTs between different states. Interestingly the folding timescales obtained by the model are in good agreement with the folding times reported by other groups constructing a MSM on a TICA landscape which characterizes the dynamics of folding. We should note that our model does not utilize any lag-time for construction of the low-dim embedding however, it is able to describe the folding timescales with reasonable accuracy. However, for some of the most dynamic proteins such as villin with fast folding timescales, only the first two implied timescales converge after 220 ns and the other implied timescales are below the maximum likelihood threshold which makes the model unable to give meaningful information about these faster processes. This might be remedied by adding dynamical information to the model by using a lag time in the training process. Further improvements to the model could include graph embedding of protein structures instead of using a distance map. This will be studied in our future work. 111 3.4 GraphVAMPNet, using graph neural networks and variational approach to markov processes for dynamical modeling of biomolecules 1 Finding low dimensional representation of data from long-timescale trajectories of biomolecular processes such as protein-folding or ligand-receptor binding is of fundamen- tal importance and kinetic models such as Markov modeling have proven useful in describ- ing the kinetics of these systems. Recently, an unsupervised machine learning technique called VAMPNet was introduced to learn the low dimensional representation and linear dynamical model in an end-to-end manner. VAMPNet is based on variational approach to Markov processes (VAMP) and relies on neural networks to learn the coarse-grained dy- namics. In this contribution, we combine VAMPNet and graph neural networks to gener- ate an end-to-end framework to efficiently learn high-level dynamics and metastable states from the long-timescale molecular dynamics trajectories. This method bears the advantages of graph representation learning and uses graph message passing operations to generate an embedding for each datapoint which is used in the VAMPNet to generate a coarse-grained dynamical model. This type of molecular representation results in a higher resolution and more interpretable Markov model than the standard VAMPNet enabling a more detailed ki- netic study of the biomolecular processes. Our GraphVAMPNet approach is also enhanced with an attention mechanism to find the important residues for classification into different metastable states. 1Taken from a published paper: Ghorbani, M., Prasad, S., Klauda, J. B., Brooks, B. R. (2022). Graph- VAMPNet, using graph neural networks and variational approach to Markov processes for dynamical model- ing of biomolecules. The Journal of Chemical Physics, 156(18), 184103. 112 3.4.1 Introduction Recent advances in computer hardware and software has recently enabled the gener- ation of extensive and high throughput molecular dynamics (MD) trajectories.[180, 272] These facilitates the thermodynamic and kinetic study of biomolecular processes such as protein folding, protein-ligand binding and conformational dynamics to name a few. These simulations often produce large amount of high dimensional data which require special rigorous techniques for analyzing the thermodynamics and kinetics of the molecular pro- cesses. In recent years, Markov state modeling approach [289, 139, 200, 237] has been greatly developed and utilized for understanding long-timescale behavior of dynamical systems and state-of-the art software packages such as Pyemma[263] and MSMBuilder [121] were introduced. Markov state models provide a master equation that describes the dynamic evolution of the system using a simple transition matrix.[263] Markovianity in these system means the kinetics are modeled by memoryless jumps between states in the state space. Combined with the advances in the MD simulations, the framework for MSM construction has been greatly advanced to a robust set of methods to analyze a dynamical system. In an MSM the molecular conformation space is discretized into coarse-grained states, where the inter-conversion between microstates within a macrostate are fast compared to transitions between different macrostates.[274] Markov state models have previously been used to investigate kinetics and thermodynamic properties of biophysical systems such as protein folding [216, 29, 305, 17], protein-ligand [125, 236] binding and protein conformational changes.[154, 8, 253] There are several steps in the pipeline of Markov model construction: The first step in- volves featurization where relevant MD coordinates such as distances, contact maps or tor- sion angles are chosen.[199] This is followed by a dimension reduction step that maintains 113 the slow collective variables using methods such as time independent component analy- sis (TICA)[228, 269] or dynamic mode decomposition (DMD) [202, 265, 296] or other variants of these techniques. The resulting low-dimensional space is then discretized into discrete states.[60, 330, 324] This is usually done using Kmeans clustering.[263] A transi- tion matrix is then built on the discretized trajectories which describes the time evolution of processes using a lagtime. [237, 29] This transition matrix can be further processed through its Eigendecomposition to find the equilibrium and kinetic properties of the system.[237] Finally, fuzzy clustering methods such as PCCA are often used to produce a more inter- pretable coarse-grained model.[245] As noted above, there are multiple steps where hyperparameters must be carefully cho- sen to construct the Markov model. The quality of the constructed MSM is highly depen- dent on these steps and this has brought many research opportunities into optimizing the pipeline of MSM using various techniques.[237, 215, 269, 138, 200, 137, 262] Moreover, complex dynamical systems require the optimal choice of model parameters which requires physical and chemical intuition about the model. Therefore, suboptimal choices of model parameters can lead to poor results in learning the dynamics from trajectory. Recently a variational approach for conformational dynamics (VAC) has been proposed which helps in selection of optimal Markov models by defining a score that measures the optimality of the given Markov model for governing the true kinetics.[332, 215, 262, 218, 262, 200] VAC states that given a set of n orthogonal functions of the state space, their time autocorrelations at lag-time ? are the lower bounds to the true eigenvalues of the Markov operator. [218] This is equivalent to underestimating the relaxation time scales and overestimating the re- laxation rates.[332, 215, 218] Before VAC, the tools to diagnose the performance of MSMs were mainly visual such as implied timescale plot (ITS)[289] and Chapman-Kolomogorov test[139] (CK-test). Variational approach enabled the objective comparison of different model choices for the same lag time. [215] The VAC has recently been generalized to vari- 114 ational approach for Markov processes (VAMP).[332, 331] VAMP was proposed for the general case of irreversible and non-stationary time series data and is based on a singular value decomposition of the Koopman operator.[331] Using this variational principle, Mardt and coworkers introduced VAMPNets to replace the whole pipeline of Markov model construction with a deep learning framework.[194] A VAMPNet maps the configuration x to a low-dimensional fuzzy state space where each timepoint has a membership probability to each of the metastable states. VAMPNets are then optimized by a variational score (VAMP-2). This framework was further developed by directly learning the eigenfuctions of the spectral decomposition of the transfer opera- tor that propagates equilibrium probability distribution in time with a state free reversible VAMPNet (SRV).[49] Physical constraints of reversibility and stochasticity were further added to the VAMPNet model to have a valid transition matrix enabling the computations of transition rates and other equilibrium properties from out-of-equilibrium simulations.[193, 192] However, a proper representation of the molecules is not discussed in these models and traditional distance matrices or contact maps are often used. Conformational heterogeneity of proteins during folding can complicate the selection of features for building a dynamical model. This is even more true in the case of disordered proteins.[188] Gaining interpretable few state kinetic model of protein folding using MD trajectories is still highly desirable and can be achieved by MSM approaches using care- fully chosen features. For this, one would choose a set of features for the system such as distances, dihedral angels or the root-mean-square deviation to some reference structure. Using more complicated feature functions such as convolutional layers on the distance matrices was proposed to enhance the kinetic resolution of the model.[188] Graph neural network have previously been used in molecular feature representation as a promising tool in a variety of applications to predict the properties of the system of interest or energies and forces.[268, 136] Battaglia first introduced graph neural networks with convolutional 115 operations and graph message passing.[13, 152] Currently there are various types of graph neural networks that differ in how the message passing operations are done between nodes and edges and how the output of the network is generated. Traditionally, distance maps or contact maps were used to represent the structure of the molecules. A more natural way of representing proteins is by using graphs where nodes represents atoms and the edges representing the bonds (real or unreal) connecting them. This representation is ro- tationally invariant by construction. Recent advances of protein structure prediction has greatly exploited the advances in geometric deep learning [36] and graph neural networks [152, 37]. Combining the VAMPNet framework with graph neural network improves the kinetic resolution of the resulting low dimensional model where a smaller lagtime can be chosen to build the transition matrix. In the original VAMPNet the dynamics is directly coarse-grained into a few coarse-grained states without learning a low-dimensional latent space representation. However, using graph neural networks, we show that the learned embeddings for the graphs can represent useful information about the dynamic system. Furthermore, using a graph attention network [304] gives us useful insights about the im- portance of different nodes and edges for different metastable states. An illustration of our GraphVAMPNet model is shown in figure 3.27 3.4.2 Methods A Markov model estimates the dynamics by a transition density which is the probability density of transition to state y at time t + ? given that the system was at state x at time t: p?(x,y) = P(xt+? = y|xt = x) (3.16) Where x and y are two different states of the system and ? is the lagtime of the model from which the transition probability density P is built. Using this definition of transition 116 Figure 3.27: Overview of the architecture of GraphVAMPNet method. Given a molecular structure at time t and a lagtime later t + ? , molecular graphs are built using the nearest neighbor of the chosen atoms. Several graph convolution operations are performed resulting in representation for each node. A hierarchical pooling is done to find a latent representation of the full graph which is concatenated between time t and t + ? . The full network is then optimized by maximizing a VAMP-2 score. density the time evolution of the ensemble of states in the system can be written as: ? pt+?(y) = (P? pt)(y) = p?(x,y)p(x)dx (3.17) In this equation, P? acts as a propagator which propagates the dynamics of the system in time. However, this definition of propagator assumes a reversible and stationary dynamical systems.[237] For the general case of non-reversible and non-stationary dynamics, Koop- man operator is used.[332] Koopman theory enables feature transformations into a feature space where the dynamics evolve on average linearly. Koopman operator acts like a transi- tion matrix for non-linear dynamics and describes the conditional future expectation values for a fixed lag time ? . In the Koopman theory, the Markov dynamics at a lagtime ? is 117 approximated by a linear model of the following form: E[X1(xt+?)]? KTE[X0(xt)] (3.18) In equation above X0(x) = (X T T00(x), ...X0m(x) and X1(x) = (X11(x), ...,X1m(x) are fea- ture transformations to a space where dynamics evolve on average linearly. This approxi- mations is exact in the infinite-dimensional feature transformations however it was shown that given a large enough lagtime ? the low dimensional feature transformations can be- come optimal.[332] Equation 3.18 can be interpreted as a finite rank approximation of the so-called Koopman operator.[203] The optimal Koopman matrix to minimize the regres- sion error from equation 3 is: K =C?100 C01 (3.19) Where the mean-free covariance matrices for data transformation are defined as: ????????C00 = E[X0(x Tt)X0(xt) ]?????C0t = E[X0(xt)X1(x T t (3.20)? +? ) ] ?Ctt = E[X1(xt+?)X1(x Tt+?) ] However, the regression error has no information about the choice of feature transfor- mations X0 and X1 and can lead to trivial solutions for these feature transformations.[194] On the other hand, VAMP provides useful scoring functions that can be used to find opti- mal feature transformations. VAMP is based on singular value decomposition of Koopman operator and is used to optimize the feature functions and does not have the limitations of time reversibility and stationary dynamics. VAMP states that given a set of orthogonal candidate functions, their time-autocorrelations are the lower bounds to the true Koopman 118 eigenvalues. This provides a variational score such as the sum of estimated eigenvalues that can be maximized to find the optimal kinetic model. Wu and Noe showed that the optimal choice of X0 and X1 in equation 3.18 are obtained using the singular value decomposition of the Koopman matrix and setting X0 and X1 to its top left and right singular functions respectively.[332, 331] The VAMP-2 score is then defined as : VAMP?2 ??2 ||C?1/2 ?1/2= = 2i 00 C0tCtt ||F+1 (3.21) i The left and right singular functions of the Koopman operator are always equal to the constant function 1. Therefore, 1 is added to basis functions. Maximum VAMP-2 score is achieved when the top m left and right Koopman singular functions are in the span (X01, ...X0m) and (X11, ..., X1m) respectively. VAMP-2 also maximizes the kinetic variance captured by the model. Feature transformations X0 and X1 can be learned with neural network in the so called VAMPNet where there are 2 parallel lobes each receiving MD configurations xt and xt+? . As done in the original VAMPNet, we assume the two lobes have similar parameters and use a unique basis set X = X0 = X1. The training is done by maximizing the VAMP-2 score to learn the low-dimensional state space produced by a softmax function. Since K is a Markovian model it is expected to fulfill the Chapman-Kolmogrov (CK) equation K(n?) = Kn(?) (3.22) for any value n? 1 where K(?) and K(n?) indicate models estimated at a lag time of ? and n? respectively. 119 The implied timescales of the process are computed as follows: ? ti(?) =? (3.23)ln|?i(?)| Where ?i(?) is the eigenvalue of the Koopman matrix built at a lagtime ? . The smallest lagtime ? is chosen where the implied timescales ti(?) are approximately constant in ? . After having chosen the lagtime ? we test whether the CK test holds within statistical uncertainty. 3.4.3 Protein Graph representation Each structure is represented in terms of an attributed graph G = (V,E) where V are node features V = v1, ...,vN and E are the edge features E = ei j that captures the rela- tions between nodes. We have tested different Graph Neural networks (GNNs) for their ability to learn higher resolution kinetic models from MD trajectories of protein folding simulations.[335] In all these different GNNs the node embeddings vi are initialized ran- domly and the edge embeddings ei j are the Gaussian expanded distances between the ad- jacent nodes using the following formula: eti j = exp(?(d 2 2i j??t) /? ) (3.24) Where di j is the distance between atoms i and j, ?t = dmin + t ? (dmax? dmin)/K A?, t = 0,1, ...,K, ? = (dmax?dmin)/K A?. dmax and dmin are the maximum and minimum distances respectively for constructing the gaussian expanded edge features. Unless noted otherwise, all the graphs are built using M nearest neighbors for the C? atoms of the protein with edges built from the gaussian expanded distances between C? atoms. 120 Graph Convolution layer In this type of graph neural network, protein graph is represented as G = (V,E) where V contains features of the nodes and E contains the edge attributes of the graph. A separate graph is constructed for configurations at each timestep of the simulation. The initial node feature representations are randomly initialized. However, a one-hot vector representation based on the atom type or the amino acid type can also be used. During training the node embeddings vki for node i at layer k are updated using the following equations:[258, 336] vk+1 = vki i + ? wk k ki, j g(zi, jWc +bkc) (3.25) j?Ni wki, j = ?(z k k i, jWg +b k g) (3.26) zk ki, j = vi ? vkj? ei, j (3.27) Where denotes element-wise multiplication and ? denotes concatenation, ? is the sigmoid function as the non-linearity and g() is the edge-gating mechanism introduced by Marcheggiani[191] to incorporate different interaction strength among neighbors into the model. W kg , W k c , b k g and b k c are gate weight matrix, convolution weight matrix, gate bias and convolution bias respectively for k?th layer of the graph convolution layer. To capture the embedding of the whole graph, we use graph pooling where graph embeddings are generated using the learnt node embeddings.[119, 240] The embedding for the whole graph is done through a pooling function where we average over the embedding of all the 121 nodes. 1 N vG = ? vi (3.28)N i=1 Other types of pooling such as hierarchical pooling can also be applied to make a more complicated model.[344] SchNet Another type of GNN is SchNet which was introduced by Schutt and others to use continuous-filter convolutions for predicting forces and energies of small molecules ac- cording to quantum mechanical models.[268, 13]. This was modified by Husic et al.[136] to learn a coarse-grained representation of molecules using graph Neural network. SchNet is employed here to learn feature representation of nodes for learning dynamics of protein folding. This is a subunit of our model for learning feature representations of molecules on the graph level. The initial node features or embeddings are initialized randomly but a one-hot encoding based on node type (amino acid) can also be used. These embeddings are later learned and updated during training of the network by a few rounds of message passing through nodes and edges of the graph. Node embeddings are updated in multiple interaction blocks as implemented in the original SchNet.[268] Each interaction layer con- tains continuous convolution between nodes. The edge attributes are obtained by using a radial basis function e.g. Gaussians centered at different distances. e 2i j = exp(??(di j??) ) (3.29) These edge attributes are then fed into a filter-generating network w that maps ei j to a dh-dimensional filter. This filter is then applied to node embeddings as (continuous filter 122 convolution): zki = ? ?k k k ki jw ei j b (hi ) (3.30) j?N(i) k exp(z k W ki j a )?i j = k (3.31)? exp(z W tj i j a) ? zk = wke j bk(hki j i i ) (3.32) w is a dense neural network and b is an atom-wise linear layer as noted in the original paper.[268] Note that the sum is over every atom j in the neighborhood of atom i. Multiple interaction blocks allow all atoms to interact with each other in the network and therefore allow the model to express complex multi-body interactions. We enhanced the standard SchNet architecture by adding an attention layer that learns the importance of the edge embeddings for updating the embedding of incoming node in the next layer. The attention weight ?i j is learned using a softmax function between embedding of the node and its neighbors. ? denotes the concatenation of node features of neighboring nodes j ? N(i) where i is the query node. The node embeddings are updated in each interaction block, which can contain a residual block to avoid gradient annihilation as done in deep neural networks.[122] The residual connection is followed by a nonlinear activation function of the output of continuous filter convolution zki as: hk+1 = hk +gki i (z k i ) (3.33) The trainable function g involves linear layers and a nonlinearity. We used a hyperbolic tangent as the activation as proposed by Husic et al.[136] The output of final SchNet inter- action block is fed into an atom-wise dense network. The learned embedding of nodes after 123 several SchNet layers is then fed into a pooling layer as described previously to produce a graph embedding for each timestep. 3.4.4 Model selection and Hyperparameters In GraphVAMPNet, instead of traditional features such as dihedral angles, distance matrices and contact maps, we use a general graph neural network, a more natural rep- resentation of molecules and proteins. We have implemented two different graph neural networks (GraphConvLayer and SchNet). The GraphVAMPNet built from each of these GNN layers have several model hyperparameters including the dimension of feature space (number of output states) and the lag time ? . To resolve k? 1 relaxation timescales, we need at least k output neurons in the last layer of the network since the softmax function removes one degree of freedom. The models are trained by maximizing the VAMP-2 score on the training set and hyperparameters are optimized using a a cross-validated VAMP-2 score for the validation set using a ratio of 0.7 for training and 0.3 for the validation set. To have a fair comparison between different feature representations we trained model with similar number of layers (4) and similar number of neurons per layer (16). In general, increasing the dimension of feature space makes the dynamical model more accurate, but it may result in overfitting when the dimension is very large. A higher dimensional fea- ture space is also harder to interpret as the model seeks a low dimensional representation. Therefore, in this study, we experiment on systems with 5-state output models unless stated otherwise. There are multiple hyperparameters in the model that must be selected. These include the architecture of GCN, the number of clusters, number of neighbors for making the graph, time step of analyzing the simulation. Here we have used 5 clusters for Trp-Cage and NTL9 and 4 clusters for Villin. In the case of Villin using a 5-state led to finding only 4 states after using a cutoff value of 0.95 on the cluster probabilities which is the reason we 124 Table 3.3: Hyperparameters for each system in this study. dmin, dmax and number of Gaussians are the parameters for building the Gaussian expanded distances in equation 9. number number number number number number system learningof graph of of Batch-size of of d rate min dmin of layers neurons clusters residues neighbors Gaussians TrpCage 4 16 5 1000 0.0005 20 7 2 8 12 Villin 4 16 4 1000 0.0005 35 10 2 10 16 NTL9 4 16 5 1000 0.0005 39 10 2 12 20 Table 3.4: Average VAMP-2 score for each system system Standard VAMPNet GraphConvLayer SchNet TrpCage 4.68 ? 0.08 4.76 ? 0.03 4.79 ? 0.01 Villin 3.74 ? 0.02 3.74 ? 0.06 3.78 ? 0.02 NTL9 4.67 ? 0.03 4.50 ? 0.41 4.80 ? 0.03 have used a value of 4 for this protein. The hyperparameters chosen for each protein are shown in table 3.3. 3.4.5 Results We tested the performance of our GraphVampNet method on 3 different protein folding systems including Trp-Cage (pdb: 2JOF)[11], Villin (pdb: 2F4K)[157] and NTL9 (pdb: 2hba)[57]. The Graph Neural network was implemented using PyTorch and the deeptime [131] package was used for VAMPNet. Pyemma[263] was used for free energy landscape plots. Adam was used as the optimizer in all models. GNN provides a framework to learn the feature transformations in VAMPNets that is invariant to permutation, rotation and reflection. Moreover, the graph embeddings can be learned with the GraphVAMPNet framework which correspond to different dynamic states during the simulation. In order to visualize the graph embeddings in 2-D we have also transformed the graph embed- dings in the last layer of GraphVAMPNet into 2-D and trained the model by maximizing the VAMP-2 score. The free energy landscape on the graph embeddings shows highly separated metastable states separated by high energy transition regions. The low energy 125 Figure 3.28: TrpCage system A) Implied timescale (ITS) plot for SchNet as feature transformation in VAMPNet (errors are taken from 95 % confidence interval from 10 different trainings) B) Free energy landscape (FEL) in a 2d graph embedding C) CK-test for SchNet using lagtime of 20 ns D) state assignment of the 2d graph embedding using 0.95 cutoff. 126 Figure 3.29: training and validation losses from SchNet based VAMPNet TrpCage metastable states correspond to the regions of high fidelity for metastable assignment prob- abilities with higher than 0.95. It is important to note that this is only done for visualization purposes and higher dimensional (16) embeddings are used for finding the metastable states in these complex protein folding systems. Furthermore, the present results do not depend on enforcing reversibility to the learned transition matrix. However, this can be done by Koopman re-weighting[332] or learning the re-weighting vectors during training in the VAMPNet framework.[193] Although we have tried different Graph Convolution networks in the GraphVAMP- Net approach, our results showed that SchNet has the best performance with the highest VAMP-2 score among all. Therefore, we present the results of SchNet in the main paper. The average VAMP-2 scores are calculated from the validation set of 10 different train- ing for each system and compared (table 3.4). The VAMP-2 score for SchNet in TrpCage system for training and validation set are plotted against the training epoch and shows a converging behavior after 100 epochs (figure 3.29). A comparison of implied timescales between standard VAMPNet and GraphVAMPNet based on SchNet is shown in figure 3.36 and table 3.4. 127 Table 3.5: Implied timescales calculated for TrpCage (at lagtime of 20ns), Villin (at lagtime of 20ns) and NTL9 (lagtime of 200ns) from SchNet based GraphVAMPNet and standard VAMPNet ITS TrpCage Villin NTL9 1-GraphVAMPNet 1917 ?28 1138 ?23 14,682 ?935 1-VAMPNet 1800 ?101 697 ?108 15,013 ?738 2-GraphVAMPNet 419 ?88 395 ?10 1623 ?93 2-VAMPNet 382 ?51 381 ?13 1285 ?37 3-GraphVAMPNet 253 ?85 82 ?17 680 ?76 3-VAMPNet 225 ?57 65 ?1 464 ?47 4-GraphVAMPNet 179 ?121 - 409 ?107 4-VAMPNet 186 ?28 - 288 ?86 3.4.6 Trpcage We test our GraphVAMPNet model an ultra-long 208 ?s explicit solvent simulation of K8A mutation of the 20-residue Trp-Cage TC10b at 290 K provided by DE Shaw group.[180] The folded state of Trp-Cage contains an ?-helix (residues 2-8), a 310-helix and a polyproline II helix.[201] The tryptophan residue (Trp6) is caged at the center of the protein. A VAMPNet was built for different types of feature learning in neural networks. The average VAMP-2 score of the validation set for 10 training runs were compared between different types of feature learning Neural Networks (Standard VAMPNet, and two graph layers) in table S1. SchNet showed the highest average VAMP-2 score among different types of features leanings. The average VAMP-2 score for training and validation set of TrpCage for 10 different training examples is shown in figure 3.29. The VAMP-2 score shows a converging behavior after 100 epochs of training. Since SchNet showed the high- est VAMP-2 score we use this type of Graph Neural network for the rest of our analysis. The implied timescales learned using SchNet is shown in figure 3.28A. A comparison of the implied timescales between SchNet based GraphVAMPNet and standard VAMPNet is shown in figure 3.36A. Both Standard VAMPNet and SchNet show a fast convergence 128 of implied timescales. However, the implied timescales from SchNet have smaller error bars than the standard VAMPNet (table 3.4). Standard VAMPNet using distances shows higher variance in the implied timescales than the VAMPNet built using SchNet. All 4 timescales in SchNet converge after 20 ns which is the lagtime we choose to build the kinetic Koopman matrix. Moreover, a closer look at the implied timescales shows that the timescales learnt from standard VAMPNet layer are also smaller (table 3.4) than the implied timescales in SchNet. According to the variational approach for Markov pro- cesses, a model with longer implied timescale corresponds to less modeling error of the true dynamics of the system.[218] To validate the resulting GraphVAMPNet, we conduct a CK-test which compares the transition probability between pairs of states i? j at time k? predicted by a model at lagtime ? . CK-test shows (figure 3.28C) excellent prediction of transition probabilities even at large timescales k? = 200 ns at a lagtime of 20 ns. We next analyzed the resulting coarse-grained states built from VAMPNet using SchNet as feature transformation. The folded state (S4) possesses 18% of the total distribution and the un- folded state (S1) has 69.5% of the total distribution which is in great agreement with other studies on this dataset using Markov state models.[274, 104]. GraphVAMPNet produces an embedding for each timestep of the simulation which is then turned into a membership assignment using a softmax function. This higher dimensional embedding (16 dimension) can be visualized using dimensionality reduction methods such as t-SNE. To have a better visualization of the low-dimensional space learned by the model, we also trained a Graph- VAMPNet where in the last layer we linearly transformed the learned graph embedding into 2-D and trained the model by maximizing the VAMP-2 score. Other parameters were kept similar to the main SchNet. The 2-D free energy landscape (FEL) for this embedding is shown in figure 3.28B. This low energy states in the FEL correspond to the states with high cluster assignment probability. This low-dimensional FEL shows the ability of the GraphVAMPNet to produce an interpretable and highly clustered embedding of graphs for 129 Figure 3.30: A) Representative structure of each metastable state in TrpCage with their probabilities B) average attention score between C? atoms for each cluster C) averaged attention score for each residue of TrpCage in each cluster which is the scaled sum of rows. 130 simulation of proteins. The learned 2-D embedding of graphs during TrpCage Folding is shown in figure 3.28D where the states with more than 0.95 cluster assignment probability are colored. The enhancement of the SchNet with attention gives an interpretable model where we can analyze the nodes and edges in the graph that are most important in each coarse-grained cluster. The scaled attention scores for TrpCage are shown in figure 3.30. The cage residue Trp6 shows a high attention score in most clusters due to being in the center of the protein and having a high number of connections in the graph. In the un- folded state (S1) most residues have high attention score only with their close neighbors on the sequence which is due to high level of dynamic and no defined structure of the un- folded state. On the other hand, other clusters such as S2 show different hot spot regions for their attention scores. In this hairpin-like structure, residues Ala4, Ser14 and Pro17 which make the groove have high attention scores. A two step folding mechanism has been proposed for TrpCage that involves an intermediate state with a salt bridge between Asp9 and Arg16.[354] Breaking this salt-bridge is thought to be a limiting step in folding of Tr- pCage. Surprisingly, our model puts high attention scores on residues Arg16 and Asp9 in metastable state S3 which also has a 10% probability. 3.4.7 Villin Villin is a 35-residue protein and is known as one of the smallest proteins that can fold autonomously. It is composed of 3 ?-helices denoted as helix-1 (residues 4-8), helix- 2 (residues 15-18), helix-3 (residues 23-32) and a compact hydrophobic core. The double mutant of Villin with replacement of two Lys residues with uncharged Norleucine (Nle) was simulated by DE Shaw [180] group and is studied here. Hernandez and coworkers [128] used a variational dynamic encoder to produce a low-dimensional embedding of Villin folding trajectories using C? distance maps. The optimized TICA for this protein used a 131 Figure 3.31: Villin system A) Implied timescale (ITS) plot for SchNet as feature transformation in VAMPNet (errors are taken from 95 % confidence interval from 10 different trainings) B) Free energy landscape (FEL) in a 2d graph embedding C) CK-test for SchNet using lagtime of 20 ns D) state assignment of the 2d graph embedding using 0.75 cutoff. lagime of 44 ns according to hyperparameter optimization. [137] We built VAMPNet with different types of feature functions and compared the average VAMP-2 score of validation set for 10 different training between them. VAMPNet based on SchNet showed the highest VAMP-2 score for the same number of states (4). The VAMP- Net built using SchNet shows an extremely fast convergence of implied timescales even after 20 ns (figure 3.31A) which gives a high resolution kinetic model for villin folding. A close comparison between SchNet based GraphVAMPNet and standard VAMPNet im- plied timescales is shown in figure 3.36B. VAMPNet built with standard VAMPNet shows a slow convergence for implied timescales where the first timescale converges after 40 ns of 132 Figure 3.32: A) Representative structure of each metastable state in Villin with their probabilities B) average attention score between C? atoms for each cluster C) averaged attention score for each residue of Villin in each cluster which is the scaled sum of rows. lagtime (figure 3.35B). The timescales of the processes are also higher for SchNet than in standard VAMPNet model which also demonstrates the higher accuracy of GraphVAMP- Net than VAMPNet with simple distance matrix (table 3.3). The CK test for SchNet (figure 3.31C) shows excellent Markovian behavior of the model built using a lagtime of 20 ns at large timescales k? = 200 ns. GraphVAMPNet also provides a latent embedding for graphs which is another advantage of GNN features compared with standard VAMPNet layer. To have a interpretable embedding we have trained a VAMPNet using SchNet where we linearly transformed the last embedding layer into 2 dimensions and trained the model using similar parameters as before. The 2-D embedding of Villin learned using Graph- VAMPNet is shown in figure 3.31D where datapoints with a cluster assignment probability higher than 0.75 are colored based on their corresponding state. The FEL on this 2-D em- bedding is shown in figure 3.31B. This FEL features highly separated clusters with low 133 energy minima corresponding to the center of clusters and the transition regions having low membership assignment probabilities. The representative structure of each cluster of Villin (Misfolded: S0, Unfolded: S1, Partially-folded: S2, Folded: S3) are shown in fig- ure 3.32 which are colored based on the average attention score for each residue in that cluster. The folded state (S3) shows a high attention score for residue Arg13. This is in agreement with previous study by Mardt et al.[192] who used distance map features and attention on neighboring residues. We found residues Gln25 and Nle29 to also have high attention scores in the folded state. These residues are in the central hydrophobic core of the protein and have high number of connections in the graphs built for the folded state. The partially folded (S2) state has similar attention scores to the folded state except for residue Lys7 which shows high attention score in partially folded (S2) but not in folded (S3). The misfolded state (S0) has high attention score for helix 2 residues which is also in agreement with work done by Mardt et. al.[192] In general the N and C-terminal of the pro- tein due to their high flexibility are given low attention scores. Hernandez and coworkers [128] used variational dynamic encoders to reduce the complex nonlinear folding of Villin into a single embedding and used a saliency map to find important Ca contacts for folding of Villin. They found residues Lys29 and His27 to be important for the folding of Villin. We found these residues to have high attention scores in our model for partially folded and folded states. 3.4.8 NTL9 As our last example, we tested the GraphVAMPNet on the NTL9 (residues 1-39) fold- ing dataset from DE Shaw group.[180]. We uniformly sampled the 1.11 ms trajectory using a lagtime of 5 ns. Mardt et al.[194] previously used a 5-layer VAMPNet with contact maps between neighboring heavy-atoms to coarse-grain the NTL9 simulation into metastable 134 Figure 3.33: NTL9 system A)Implied timescale (ITS) plot for SchNet as feature transformation in VAMPNet (errors are taken from 95 % confidence interval from 10 different trainings) B) Free energy landscape (FEL) in a 2d graph embedding C) CK-test for SchNet using lagtime of 200 ns D) state assignment of the 2d graph embedding using 0.95 cutoff. states. They showed that the relaxation timescales by a 5-state VAMPNet correspond to a 40-state MSM. Their implied timescales showed a converging behavior after about 320 ns which they chose as the lagtime of the Koopman matrix. The comparison of VAMP-2 score for VAMPNet built using different neural network feature transformations is shown in table S1. Standard VAMPNet based on the distance maps shows a lower VAMP-2 score compared to SchNet based VAMPNet. The implied timescales for SchNet layer is shown in figure 3.33A. A comparison of implied timescales between SchNet based GraphVAMPNet and standard VAMPNet for NTL9 is shown in fig- ure 3.35C. SchNet shows a better convergence behavior and also higher implied timescales than the standard VAMPNet (table 3.4). The convergence of implied timescale for SchNet (figure 3.33) and standard VAMPNet 135 (figure 3.35C) and the magnitude of the implied timescales are similar. A lagtime of 200 ns was chosen to build the Koopman matrix. The CK test (figure 3.33C) for SchNet us- ing a lagtime of 200 ns shows the Markovianity of the model even at high timescales of 2000 ns. As described for other proteins, we have trained a SchNet based VAMPNet for NTL9 with a 2-D embedding. Figure 3.33D shows the 2-D embedding of NTL9 which is colored by states higher than 0.95 cluster membership probability. The FEL in this 2-D em- bedding (figure 3.33B) shows low energy metastable states separated by transition regions that correspond to points where model is uncertain about their membership. Representa- tive structures of each cluster in NTL9 are shown in figure 3.34 (colored based on residue attention scores). The folded state (S1) and unfolded state (S3) posses 82.3 and 15.3 % of the total probability distribution. Schwantes et al.[269] used a TICA MSM for NTL9 and showed that the slowest timescale ( 18?s) corresponds to the folding process while the faster timescales correspond to transition between different register-shifted states. Register shift in each strand can also shift the hydrophobic core contacts. For instance based on their study, register shift in strand 3 produces a shift in the core packing in which Phe29 is packed. Interestingly in our model, high attention scores are given to the beta-strand residues such as Phe5, Phe29,Leu30 and Phe31. ALA36 which is part of the strand 3, has high attention score in folded and near folded states (S0, S1 and S2). A register shift for strand 3 was reported by Schwantes et al. [269] for NTL9. 3.4.9 Discussion and Conclusion MSM construction has previously been a complex process which involved multiple steps such as feature selection, dimension reduction, clustering, estimating the transition matrix K and coarse graining the dynamical model. Each of these steps requires choos- ing some hyperparameters and suboptimal choices could lead to poor kinetic model of 136 Figure 3.34: A) Representative structure of each metastable state in NTL9 with their probabilities B) average attention score between C? atoms for each cluster C) averaged attention score for each residue of NTL9 in each cluster which is the scaled sum of rows. 137 the system with lower kinetic resolution.[262] Variational approach for conformational dy- namics (VAC) and its more general form the variational approach for Markov processes (VAMP) have recently guided the optimal choice of hyperparameters.[332, 218, 262] A cross-validated variational score is usually used to find the set of features with the highest cross-validated VAMP-2 score.[262] The end-to-end deep learning framework VAMPNet was proposed by Mardt et al.[194] to replace the MSM construction pipeline by training a neural network that maps the molec- ular configurations x to a fuzzy state space where each point has a membership probability to each of the metastable states. VAMPNet is trained by maximizing a VAMP score allow- ing us to find the optimal state space which enables linear propagation of states through a transition matrix. VAMPNets are not restricted to stationary and equilibrium MD and can be used as general case for non-stationary and non-equilibrium processes. The few-state coarse-grained MSM in the case of VAMPNet is learned without the loss of model quality as is the case in standard pipelines such as PCCA.[245] Due to the end-to-end nature of deep neural networks, VAMPNets require less expertise to build an MSM. The framework of VAMPNet were further developed into state-free reversible VAMPNets (SRVs) not to approximate MSMs but rather to directly learn nonlinear approximations to the slowest dynamics modes of a MD system obeying detailed balance.[49] In SRVs, the transfer oper- ator rather than soft metastable state assignment, directly employs the variational approach under detailed balance to approximate the slow modes of equilibrium dynamics. Ferguson and coworkers [274] showed that MSMs constructed from nonlinear SRV approximations would permit the use of shorter lagtimes and therefore furnish the models with higher ki- netic resolution. Hernandez et al. [128] introduced Variational dynamic encoders (VDEs) which uses a variational autoencoder to find a simple embedding of nonlinear dynamical processes by optimizing a loss function that is the sum of trajectory reconstruction and auto-correlation losses. However, both SRVs and VDEs produce an embedding of the 138 dynamics and not the coarse-grained states. To build the Markov state model from the em- bedding, they rely on the traditional MSM construction pipeline (clustering and Markov model construction and coarse graining) to build a coarse-grained dynamical model. On the other hand, in VAMPNets and GraphVAMPNets the entire mapping from features to Markov states is done in a single end-to-end network. Moreover, VDEs are limited to esti- mate single leading eigenfunctions of the dynamical propagator and fail to uncover the full spectrum of slow modes. Lack of orthogonality constraint on the learned eigenfunctions could lead to different slow modes to be entangled for systems with multiple metastable states. SRVs also use variational approach to conformational dynamics (VAC) as their loss function and are only suitable for stationary and reversible processes whereas VAMPNets can be applied to more general non-reversible and non-stationary processes. Despite the success of VAMPNet, the feature selection is still a process that must be done with caution. Traditionally distance maps are used as a general feature representation of protein dataset. However this representation does not preserve the graph-like structure of proteins as it does not capture the 3d structure and models the protein as points on a regular grid. In this work we have focused on representation learning of VAMPNet using graph-representation of protein to get a higher-resolution kinetic model where a smaller lag-time can be chosen. Graph representation of molecules is shown to be effective in ex- tracting different properties using deep learning. Recently there has been a large amount of work in the area of geometric deep learning [36] that has graph based approaches for representing graph structures. These methods enable automatic learning of the best repre- sentation (embedding) from raw data of atoms and bonds for different types of predictions. [149, 191] These methods have been applied to various tasks such as molecular feature extraction [87, 149] protein function prediction [106] and protein design [285] to name a few. Park et al.[225] proposed a machine learning framework (GNNFF) a graph neural net- work to predict atomic forces from local environments and showed its high performance in 139 force prediction accuracy and speed. Introduction of graph message passing enhances the model ability to recognize symmetries and invariances (permutation, rotation and trans- lation) in the system. Hierarchical pooling from atom-level to residue level and then to the protein level enables the model to learn global transition between different metastable states that involves atomic-scale detailed dynamics. Xie and coworkers [335] developed graph dynamical networks (GdyNet) to investigate atomic scale dynamics in material sys- tems where each atom or node in the graph has a membership probability to the metastable states. Graph representation of materials in their model enabled encoding of local envi- ronment that is permutation, rotation and reflection invariant. The symmetry in materials facilitated identifying similar local environment throughout the material and learning of the local dynamics. This type of approach can be used to learn local dynamics in some biophysical problems such as nucleation and aggregation where local environment is im- portant. The introduction of Graph Neural networks to VAMPNets enables the higher resolu- tion of the kinetic model and higher interpretability. A large increase in VAMP-2 score is observed when switching from distance based features to graph-based. This suggests the usefullness and representation capability of GNN for further improving the kinetic em- bedding of MD simulation. We have tested the GraphVAMPNet with two different types of graph neural networks (Graph Convolution layer and SchNet) on three long-timescale protein folding trajectories. The GraphVAMPNet showed a higher VAMP-2 score than the standard VAMPNet and the implied timescales showed a faster convergence in Graph- VAMPNet due to an efficient representation learning as opposed to standard VAMPNet. This enables choosing a lower lagtime for building the dynamical model and improves the kinetic resolution of the resulting Markov model. The timescales observed from our Graph- VAMPNet are comparable to other methods which use markov state models with hundreds of states on the resulting embedding derived from state free reversible VAMPNets. We 140 Figure 3.35: Implied timescales for standard VAMPNet A)TrpCage B)Villin C)NTL9 and VAMP- Net based on Graph Convolution layer for D)TrpCage E)Vilin F)NTL9 141 Figure 3.36: Comparison of implied timescales from GraphVAMPNet and standard VAMPNet for A)TrpCage B)Villin C)NTL9 142 should also note that the graph neural network approach for molecular representation could also be extended in the SRV for learning a dynamical embedding based on graph represen- tations. The graph embeddings resulted from GraphVAMPNet are highly interpretable and shows clustered data in low energy minima in a free energy landscape. Furthermore, the addition of attention mechanism into SchNet enables us to decipher the residues and bonds that most contribute to each of the metastable states. However care must be taken when interpreting the attention scores returned by the model. One main obstacle of GNNs is that they cannot go deeper than a few layers (3 or 4) and suffer from over-smoothing problems in which all node representations tend to become similar to one another. An architecture that enables training deeper networks is the residual or skip connections as deployed in ResNet architecture is used here to train a deep neural network.[122] Due to the flexibility of graph representation of molecules, other physical properties of atoms or amino acids such as electric charge, hydrophobicity. can be encoded into node or edge features in order to enhance the physical and chemical interpretability of the model. Moreover, hierarchical pooling layers can be applied to learn the dynamics at different resolutions of the molecule.[344] Our GraphVAMPNet is inherently transferable. This means theoretically given a sufficient amount of dynamical data, transfer learning can be leveraged to reduce the number of trajectories needed for studying the dynamics of a particular system of interest. Time-reversibility and stochasticity of the transition matrix are the two physical con- straints that are needed to get a valid symmetric transition matrix for analyses such as transition path theory (TPT). Physical constraints added to the VAMPNet and the resulting model called revDMSM was successfully applied to study kinetics of disordered proteins. This allowed to have a valid transition matrix and therefore rates could be quantified for interesting processes. This revDMSM was further extended by including experimental ob- servables into the model as well as a novel hierarchical coarse-graining to give different 143 levels of detail. These physical constraints can be further added into GraphVAMPNet to obtain a valid and high resolution transition matrix. In summary, our GraphVAMPNet au- tomates the feature selection in VAMPNet to be learned from graph message passing on the molecular graphs which is a general approach to understand the coarse-grained dynamics. 144 Chapter 4: Molecular dynamics study of membrane proteins 4.1 Critical Sequence hotspots for binding of novel coronavirus to An- giotensin Converter Enzyme as Evaluated by Molecular Simulations 1 novel coronavirus (nCOV-2019) outbreak has put the world on edge, causing millions of cases and hundreds of thousands of deaths all around the world, as of June 2020, let alone the societal and economic impacts of the crisis. The spike protein of nCOV-2019 resides on the virion?s surface mediating coronavirus entry into host cells by binding its receptor bind- ing domain (RBD) to the host cell surface receptor protein, angiotensin converter enzyme (ACE2). Our goal is to provide a detailed structural mechanism of how nCOV-2019 rec- ognizes and establishes contacts with ACE2 and its difference with an earlier coronavirus SARS-COV in 2002 via extensive molecular dynamics (MD) simulations. Numerous mu- tations have been identified in the RBD of nCOV-2019 strains isolated from humans in different parts of the world. In this study, we investigated the effect of these mutations as well as other Ala-scanning mutations on the stability of RBD/ACE2 complex. It is found that most of the naturally occurring mutations to the RBD either slightly strengthen or have the same binding affinity to ACE2 as the wild-type nCOV-2019. This means the virus had sufficient binding affinity to its receptor at the beginning of the crisis. This also have im- plications for any vaccine design endeavors since these mutations could act as antibody 1Taken from published paper: Ghorbani, M., Brooks, B. R., & Klauda, J. B. (2020). Critical sequence hotspots for binding of novel coronavirus to angiotensin converter enzyme as evaluated by molecular simu- lations. The Journal of Physical Chemistry B, 124(45), 10034-10047. 145 escape mutants. Furthermore, in-silico Ala-scanning and long-timescale MD simulations, highlight the crucial role of the residues at the interface of RBD and ACE2 that may be used as potential pharmacophores for any drug development endeavors. From an evolu- tional perspective, this study also identifies how the virus has evolved from its predecessor SARS-COV and how it could further evolve to become even more infectious. 4.1.1 Introduction The novel coronavirus (SAR-COV-2) outbreak emerging from China has become a global pandemic and a major threat for human public health. According to World Health organization (WHO) as of August 28th 2020 (the time this research was done), there has been about 25 million confirmed cases and about 1 million deaths due to coronavirus in the world.[356] Coronaviruses are a family of single-stranded enveloped RNA viruses. Phylogenetic analysis of coronavirus genome has shown that nCOV-2019 belongs to the beta-coronavirus family, which also includes MERS-COV, SARS-COV and bat-SARS-related coronaviruses. [176, 315] In all coronaviruses, a homotrimetic spike glycoprotein on the virion?s envelope mediates coronavirus entry into host cells through a mechanism of receptor binding fol- lowed by fusion of viral and host membranes.[309, 176]. The spike protein in coronavirus contains two functional subunits S1 and S2. The S1 subunit is responsible for binding to host cell receptor, and the S2 subunit is responsible for fusion of viral and host cell membranes.[176, 329] The spike protein in nCOV-2019 exits in a metastable pre-fusion conformation that undergoes a substantial conformational rearrangement to fuse the vi- ral membrane with the host cell membrane.[28, 329] nCOV-2019 is closely related to bat coronavirus RaTG13 with about 93.1 % sequence similarity in the spike protein genome. The sequence similarity of nCOV-2019 and SARS-COV is less than 80% in the spike 146 Figure 4.1: A) superposition of RBD of SARS-COV (yellow) and nCOV-2019 (red) B) Different regions in the binding domain of nCOV-2019 definding the extended loop (non-yellow) protein sequence.[353]. S1 subunit in the spike protein includes a receptor binding do- main (RBD) that recognizes and binds to the host cell receptor. The RBD of nCOV-2019 shares 72.8% sequence similarity to SARS-COV RBD and the root mean squared deviation (RMSD) for the structure between the two proteins is 1.2 A?, which shows the high structural similarity.[315, 28]. Experimental binding affinity measurements using Surface Plasmon Resonance (SPR) have shown that nCOV-2019 spike protein binds its receptor human an- giotensin converter enzyme (ACE2) with 10 to 20 fold higher affinity than SARS-COV binding to ACE2.[329]. Based on the sequence similarity between RB of nCOV-2019 and SARS-COV and also the tight binding between RBD of nCOV-2019 and ACE2, it is most probable that SASR-COV-2 uses this human receptor on human cells to gain entry into the body.[176, 309, 329, 307] The spike protein and specifically the RBD in coronaviruses have been a major target for therapeutic antibodies. However, until August 28th, 2020, no monoclonal antibodies targeted to RBD have been able to efficiently bind and neutralize nCOV-2019.[329, 292]. The core of nCOV-2019 RBD is a 5-stranded antiparallel ? -sheet with connected short ?-helices and loops (Figure 4.1). Binding interface of nCOV-2019 and SARS-COV with 147 ACE2 are very similar with less than 1.3 A? RMSD. An extended insertion inside the core containing short strands, ?-helices and loops called the receptor binding motif (RBM) makes all the contacts with ACE2. In nCOV-2019 RBD, the RBM forms a concave surface with a ridge loop on one side and it binds to a convex exposed surface of ACE2. The overlay of SARS and nCOV-2019 RBD proteins is shown in figure 4.1A. The binding interface of nCOV-2019 contains loops L1 to L4 and short ? -strands ?5 and ?6 and a short helix ?5. The location of RBM in nCOV-2019 RBD as well as different helices, strands and loops are shown in figure 4.1B. The sequence alignment between SARS-COV in human, SARS civet, Bat RaTG13 coronavirus and nCOV-2019 in the RBM is shown in figure 4.2. There is a 50% sequence similarity between the RBM of nCOV-2019 and SARS-COV. RBM mutations played an important role in the SARS epidemic in 2002. [175] Two mutation in the RBM of SARS- COV in 2002 from SARS-Civet were observed from strains of these viruses. These two mutations were K479N and S487T. These two residues are close to the virus binding hotspots in ACE2 including hotspot-31 and hotspot-353. Hotspot-31 centers on the salt- bridge between K31-E35 and hotspot-353 is centered on the salt-bridge between K353- E358 on ACE2. Residues K479 and S487 in SARS-Civet are in close proximity with these hotspots and mutations at these residues causes SARS to bind ACE2 with significantly higher affinity than SARS-Civet and played a major role in civet-to-human and human-to- human transmission of SARS coronavirus in 2002. [176, 178, 174] Numerous mutations in the interface of SARS-COV RBD and ACE2 from different strains of SARS isolated from humans in 2002 have been identified and the effect of these mutations on binding ACE2 have been investigated by surface plasmon resonance.[173, 333]. Two identified RBD mu- tations (Y442F and L472F) increased the binding affinity of SARS-COV to ACE2 and two mutations (N479K, T487S) decreased the binding affinity. It was demonstrated that these mutations were viral adaptations to either human or civet ACE2.[173, 333]. A pseudo- 148 Figure 4.2: Sequence comparison of the receptor binding motif (RBM) in SARS-2002, SARS- civet, Bat RaTG13 and nCOV-2019. The mutations from SARS-2002 to nCOV-2019 are marked with blue. Important mutations in RBM are marked with yellow. Red color shows the 3-resdiue motif in SARS and Civet and 4-residue motif in RaTG13 and nCOV-2019. typed viral infection assay of the interaction between different spike proteins and ACE2 confirmed the correlation between high affinity mutants and their high infection.16 Further investigation of RBD residues in binding of SARS-COV and ACE2 was performed through ala-scanning mutagenesis, which resulted in identification of residues that reduce binding affinity to ACE2 upon mutation to alanine.[43] RBD mutations have also been identified in MERS-COV, which affected their affinity to receptor (DPP4) on human cells. Multiple monoclonal antibodies have been developed for SARS since 2002 that neutralized spike glycoprotein on SARS-COV surface.[65, 291, 76, 246] However, multiple escape muta- tions exist in the RBD of SARS-COV that affect neutralization with antibodies, which led to the use of a cocktail of antibodies as a robust treatment.[247] Full genome analysis of nCOV-2019 in different countries and the receptor binding surveillance has shown multiple mutations in the RBD of glycosylated spike. The GISAID database[91] (www.gisaid.org/) contains genomes of nCOV-2019 from researchers across the world since December 2019. Latest report by the GISAID database on June 2020 have shown 25 different variants of RBD from strains of nCOV-2019 collected from different countries along with the number of occurrences in these regions which is listed below for the seven most occurring mutations: 149 213x N439K (211 Scotland, England, Romania), 65x T478I in England, 30x V483A (26 USA/WA, 2 USA/UN, USA/CT, England), 10x G476S (8 USA/WA, USA/OR, Bel- gium), 7x S494P (3 USA/MI, England, Spain, India, Sweden), 5x V483F (4x Spain, Eng- land), 4x A475V (2 USA/AZ, USA/NY, Australia/NSW. It is not known whether these mutations are linked to the severity of coronavirus in these regions. Starr and coworkers[282] performed a deep mutational scanning of nCOV- 2019 RBD and used Flow Cytometry to measure the effect of single mutations on the expression of the folded protein as well as its binding affinity to ACE2. They showed that RBD is very tolerant to these mutations to maintains its expression level as well as binding affinity to ACE2 in most cases. According to their results, most natural mutations exert similar binding affinities to ACE2 as wild-type nCOV-2019. Furthermore, they showed that mutations at critical positions at the RBD-ACE2 interface at nCOV-2019 such as residues Q493 and Q498 does not reduce the binding affinity to ACE2 which shows the substantial plasticity of the interface.[282] Different groups have computationally studied the binding of nCOV-2019 RBD with ACE2.[317, 190, 281, 317] All these studies point to higher binding affinity of nCOV-2019 RBD than SARS-COV RBD to ACE2. Interestingly, the role of water-mediated interac- tions has been pointed out to be a driving force which is shown to be the similar for both SARS-COV and nCOV-2019 RBD.[190] Spinello and coworkers[281] studied the binding of nCOV-2019 and SARS-COV RBD to ACE2 and found that the former binds its recep- tor with 30 kcal/mol higher affinity than SARS-COV RBD. Gao et al. [317] used free energy perturbation (FEP) and showed that most amino acid mutations at the RBM from SARS-COV to nCOV-2019 increase the affinity of RBD to ACE2. The focus of this ar- ticle is to elucidate the differences between the interface of SARS-COV and nCOV-2019 with ACE2 to understand with atomic resolution the interaction mechanism and hotspot residues at the RBD/ACE2 interface using long-timescale molecular dynamics (MD) sim- 150 ulation. An alanine-scanning mutagenesis in the RBM of nCOV-2019 helped to identify the key residues in the interaction, which could be used as potential pharmacophores for future drug development. Furthermore, we performed molecular simulations on the seven most common mutations found from surveillance of RBD mutations N439K, T478I, V483A, G476S, S494P, V483F and A475V. From an evolutionary perspective this study shows the residues in which the virus might further evolve to be even more dangerous to human health. 4.1.2 Methods Sequence comparison and mutant preparation nCOV-2019 shares 76 % sequence similarity with SARS-2002 spike protein, 73 % sequence identity for RBD and 50 % for the RBM.[356]. Bat coronavirus RaTG13 seems to be the closest relative of nCOV-2019 sharing about 93% sequence identity in the spike protein.[309]. Sequence alignment of SARS-2002, SARS-Civet, Bat RaTG13 and SARS- COV-2 are shown in figure 4.2.[309] To investigate the roles of critical mutations on the complex stability of nCOV-2019 with ACE2, mCSM-PPI2 webserver [248] was used to find the residues in nCOV-2019 that are at the interface with ACE2. 21 different residues were identified to be in contact with ACE2 and were chosen to do further MD simulation. On the other hand, mutations are also observed in RBD domain from full genome analysis of different nCOV-2019 variants collected from different countries compiled in GISAID database.[76] The mutations selected are listed below along with their locations in the RBD. K417 (?3), K439 (?4), G446 (L1), G447 (L1), Y449 (L1), Y453 (?5), L455 (?5), F456 (L2), Y473 (?6), A475 (?6), G476 (L3), T478 (L3), Y483 (L3), E484 (L3), F486 (L3), N487 (L3), Y489 (?6), Q493 (?5), S494 (?5), G496 (L4), Q498 (L4), T500 (L4), N501 (L4), G502 (L4), Y505 (?5) 151 Molecular dynamics simulations The crystal structure of nCOV-2019 in complex with hACE2 (pdb id:6M0J)[159] as well as SARS-COV complex with human ACE2 (pdb id: 6ACJ)[279] were obtained from protein data bank. All initial structures were prepared in GROMACS.[2] A TIP3P water model was used for the solvent and Param999SB-ILDN AMBER forcefield (FF)[181, 23] was used for all protein complexes. Neutralizing ions were added to all systems. It is important to note that none of the RBD/ACE2 complexes studied here were glycosylated. The glycosylation of RBD are far from the binding interface and does not interfere with the binding to ACE2. The dynamics of glycans in the spike protein and its effect on shielding is studied in the next section of this chapter. 500 steps of energy minimizations were done using the steepest descent algorithm. In all steps the LINCS algorithm was used to constrain all bonds containing hydrogen atoms. The systems were equilibrated using a velocity-rescaling thermostat to maintain the temperature at 310 K with a 0.1 ps coupling constant in NVT ensemble under periodic boundary conditions and harmonic restraints on the backbone and sidechain atoms of the complex.[39] A velocity rescaling thermostat was used in all steps of the simulation. In the next step, further equilibration was done in NPT ensemble at pressure of 1 bar using Berendsen barostat.[301]. During the production run harmonic restraints were removed and all systems were simulated using NPT ensemble where pressure was maintained at 1 bar using the Parrinello-Rahman barostat [226] with a compressibility of 4.5?10?5bar?1 and a coupling constant of 0.5ps. The produciton run lasted for 500ns for SARS-COV and nCOV-2019 complexes and 300 ns for all the mutants with a 2fs timestep and the particle- mesh Ewald (PME)[69] for long range electrostatic interactions using GROMACS 2018.3 package. [2]. All mutant systems were constructed as described before and ran for 300ns of production run. In addition, the simulation time for a few mutants (Y449A, T478I, Y489A 152 and S494P) was extended to 500 ns. Gibbs free energy and correlated motions The last 400ns of simulation was used to explore the dominant motions in SARS- COV, nCOV-2019 and the mutations with extended simulation, and last 200ns for all other mutants using principal component analysis (PCA) as part of the quasiharmonic analysis method. For this method the rotational and translational motions of RBD of all systems were eliminated by fitting to a reference (crystal) structure. Next, 4,000 snapshots from the last 400ns of SARS-COV, nCOV-2019 and mutations with extended simulation time, and 2,000 snapshots from last 200ns of all other mutant systems were taken to generate the co- variance matrix between C? atoms of RBD. In the mutant systems with production run, the last 400 ns was used for this analysis. Diagonalization of this matrix resulted in a diagonal matrix of eigenvalues and their corresponding eigenvectors. The first eigenvector which indicate the first principal component was used to visualize the dominant global motions of all complexes through porcupine plots. The principal components were used to calculate and plot the approximate free energy landscape (aFEL). We refer to the free energy landscape produced by this approach to be approximate in that the ensemble with respect to the first few PC?s (lowest frequency quasiharmonic modes) is not close to convergence, but the analysis can still provide valu- able information and insight. Hydrogen bonds were analyzed in VMD where the distance cutoff was 3.2 A? and the angle cutoff between donor and acceptor was 30?. The dynamic cross-correlation maps (DCCM) were obtained using MD TASK package to identify the correlated motions of RBD residues. In DCCM the cross-correlation matrix Ci j is obtained from displacement of backbone C? atoms in a time interval ?t. The DCCM was constructed using the last 400ns of SARS-COV and nCOV-2019 and the extended 153 mutant systems and last 200ns of all other mutant systems with a 100ps time interval. Binding free energy from MMPBSA method The Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) method was used to calculate the binding free energy between RBD and ACE2 in all complexes.[310, 243]for SARS-COV and nCOV-2019, 200 snapshots of the last 400 ns and for the mutant systems, 100 snapshots of the last 200 ns simulation were used for the calculation of bind- ing free energies with an interval of 2 ns. Simulation for a few mutant systems (Y449A, T478I, Y489A and S494P) were extended to 400 ns a the binding energies were calculated for the last 400 ns to assess the convergence of free energies. The binding free energy of a ligand-receptor complex can be calculated as: ?Gbind,aq = ?H?T ?S = ?Gcomplex? [?Gprotein +?Gligand] (4.1) ?Gbind,aq = ?EMM +?Gbind,solv?T ?S (4.2) ?EMM = ?Ecovalent +?Eelect +?EV DW (4.3) ?Gbind,solv = ?Gpolar +?Gnon?polar (4.4) ?Gnon?polar = ??SASA+b (4.5) In the above equations ?EMM, ?Gbind,solv and ?T ?S are calculated in the gas phase, ?EMM is the gas phase molecular mechanics energy changes which includes covalent, elec- trostatic, and van der Waals energies. Based on previous studies the entropic changes dur- ing binding is neglected in these calculations.[338, 111, 243] ?Gbind,solv is the solvation free energy which comprises the polar and non-polar components. The polar activation is calculated using the MMPBSA method by setting the value of 80 and 2 for solvent and solute dielectric constants. The non-polar free energy is simply estimated from solvent 154 Figure 4.3: C? RMSD plots for nCOV-2019 and SARS-COV and mutants of SARS-COV-2 accessible surface area (SASA) of the solute from equation 4.5 4.1.3 Results Structural dynamics To compute the RMSD of systems, the rotational and translational movements were removed by first fitting the C? atoms of the RBD to the crystal structure and then computing the RMSD with respect to C? atoms of RBD in each system. Figure 4.3 shows the RMSD plot in RBD of SARS-COV, nCOV-2019 and some of its variants. Comparison of RMSD of SARS and nCOV-2019 shows SARS-COV has a larger RMSD throught the 500ns simulation. In nCOV-2019, RMSD is very stable with a value of 1.5A? whereas in SASR-COV the RMSD increases up to 4A? after 100ns and then fluctuates between 3 and 4 A?. The change in RMSD of SARS is partially related to the motions in the C-terminal which is flexible loop. In most variants of nCOV-2019 the RMSD is very stable during the 300ns simulation which shows tolerance of interface for mutations. Although some mutations Y489A, Y505A and N487A the RMSD slightly increases. To characterize the dynamic behavior for each amino acid in the RBD, we analyzed the root mean square fluctuation (RMSF) of all systems. The RMSF plots for nCOV-2019, 155 Figure 4.4: RMSF plots for nCOV-2019-WT, SARS-COV, Y505A, N487A, G496A and E484A mutants. The red shaded region shows the fluctuations in L1 and the green shaded area shows teh flucturations in L3. The orange shaded region in SARS-COV shows the fluctuations in C-terminal. For comparison, RMSF of nCOV-2019 WT shows in Cyan in other plots. SARS-COV and four other mutations are shown in figure 4.4. nCOV-2019 shows less fluctuations compared to SARS-COV. L3 in nCOV-2019 corresponding to residues 476 to 487 (shown in red in Figure 4.4) have smaller RMSF (1.5A?) than SARS-COV L3 residues 463 to 474. L1 in both nCOV-2019 and SARS-COV (green) have small fluctuation (less than 1.5 A?). Moreover, the C-terminal residues of SARS-COV show high fluctuations (orange). Few mutants show higher fluctuation in L1. Mutants Y505A and S494A had a RMSF of 2.5 A? and mutant N487A had RMSF of about 4 A? in the L1. Mutation Y449A had a higher RMSF of about 3 A? in the L1. Mutants G496A, E484A and G447A show a high fluctuation of about 4.5 A? in the L3. PCA and approximate free energy landscape Most of the combined motions were captured by the first ten eigenvectors generated from the last 400 ns for SARS-COV, nCOV-2019 and extended mutant systems and the last 156 200 ns for other nCOV-2019 mutants. The percentage of the motions captured by the first three eigenvectors was 51% for nCOV-2019 and 68% for SARS-COV. In all mutations, more than 50% of the motions were captured by the first three eigenvectors. The first few PC?s describe the highest motions in a protein which are related to a functional motion such as binding or unbinding of protein from receptor. The first three eigenvectors were used to calculate the approximate FEL (aFEL) using the last 400 ns of simulation for nCOV- 2019 and SARS-COV shown in Figure 4.5, which displays the variance in conformational motion. . SARS-COV showed two distinct low free energy states shown as blue separated by a metastable state. There is a clear separation between the two regions by a free energy barrier of about 6-7.5 kcal/mol. These two states correspond to the loop motions in the L3 as well as the motion in C-terminal residues of SARS-COV. The L3 motion in nCOV-2019 is stabilized by H-bond between N487 on RBD and Y83 on ACE2 as well as a ?-stacking interaction between F486 and Y83. It is evident that the nCOV-2019 RBD is more stable than SASR-COV RBD and exist in one conformation whereas the SARS-COV interface fluctuates and the aFEL is separated into two different regions. The first two eigenvectors were used to calculate and plot the aFEL as a function of first two principal components using the last 200 ns of the simulation for mutant systems. Dynamic Cross Correlation Maps (DCCM) The correlated motions of RBD atoms were analyzed with DCCM based on hte C? atoms of RBD from the last 400 ns of simulation for nCOV-2019, SARS-COV and extended mutatnt systems and the last 200ns for other mutations. (figure 4.6) DCCM for nCOV- 2019 showed a correlation of motions between residues 490-505 (containing ?5, L4 and ?5 regions) and residues 440-455 (containing ?4, L1 and ?5 regions) shown in red rectangle in figure 4.6. Another important correlation that appears in DCCM of nCOV-2019 is between 157 Figure 4.5: Mapping the principal components of RBD for the aFEL from the last 400ns of simula- tions for SARS-COV (top-row) and nCOV-2019-WT (bottom row). The color bar is relative to the lowest free energy state residues 473-481 with residues 482-491. These residues are in L3 and ?6 regions and their correlation in nCOV-2019 is stronger than SARS-COV. This is due to the presence of ?6 strand in nCOV-2019, whereas in SARS-COV these residues all belong to L3. This indicates that L3 in nCOV-2019 has evolved from SARS-COV to adopt a new secondary structure, which causes strong correlation and make the loop act as a recognition region for binding. Some of the mutations disrupted the patterns of correlation and anti-correlation in nCOV-2019 RBD. Mutation N487A showed a stronger correlation in L3 and ?6 strand than the wild-type RBD. For mutation E484A, correlation in L3 is stronger than wild type. It is worth mentioning that mutation F486A disrupts the DCCM of nCOV-2019 by introducing strong correlations in the core region of RBD as well as the extended loop region. Residue F486 resides in L3 and plays a crucial role in stabilizing the recognition loop by making a ?-stacking interaction with residue Y83A on ACE2. 158 Figure 4.6: Dynamic cross correlation maps for nCOV-2019, SARS-COV and mutants with residue numbers of the RBD domain. The RED box shows teh correlation between ?5, L4 and ?5 regions and ?4, L1 and ?5. Blue box shows the correlation in L3 and ?6 regions. Binding free energies The binding energetics between ACE2 and the RBD of SARS-COV, nCOV-2019 and all its mutants were investigated by the MMPBSA method.[310]. The binding energy was partitioned into its individual components including: electrostatic, van der Waals, polar solvation and solvent accessible surface area (SASA) to identify important factors affecting the interface of RBD and ACE2 in all complexes. nCOV-2019 has a total binding energy of ?50.22?1.93kcal/mol, whereas SARS-COV has a higher binding energy of?18.79?1.53 kcal/mol. Decomposition of binding energy to its components show that the most striking difference between nCOV-2019 and SARS-COV is the electrostatic contribution which is ?746.69? 2.66kcal/mol for nCOV-2019 and ?600.14? 7.65 kcal/mol for SARS-COV. This high electrostatic contribution is compensated by a large polar solvation free energy which is 797.30? 3.12 kcal/mol for nCOV-2019 and 659.61? 8.98 kcal/mol for SARS- 159 COV. nCOV-2019 also possess a higher VDW contribution (?89.93?0.46kcal/mol) than SARS-COV (?70.07? 1.22kcal/mol). Furthermore, the SASA contribution to binding for SARS-COV was ?8.30? 0.15kcal/mol and ?10.58kcal/mol for nCOV-2019. Both hydrophobic and electrostatic interactions play major roles in the higher affinity of nCOV- 2019 RBD than SARS-COV RBD to ACE2. The binding free energies for nCOV-2019 and SARS-COV were decomposed into a per-residue based binding energy to find the residues that contribute strongly to the bind- ing and are responsible for higher binding affinity of nCOV-2019 than SARS-COV (Figure 4.7). Most of the residues in the RBM of nCOV-2019 had more favorable contribution to total binding energy than SARS-COV. Residues Q498, Y505, N501, Q493 and K417 in nCOV-2019 RBM contributed more than 5 kcal/mol to binding affinity and are crucial for complex formation. A few residues such as E484 and S494 contributed unfavorably to the total binding energy. Among all the interface residues K417 had the highest contribution to the total binding energy (-12.34?0.23 kcal/mol). The corresponding residue in SARS-COV is V404 only had a -0.02?0.01 kcal/mol contribution, which points to the importance of this residue for nCOV-2019 binding to ACE2. Residue Q498 contributed -6.72?0.18 kcal/mol and its corresponding residue in SARS-COV is a Y484 that contributed to total binding by -1.83?0.06 kcal/mol. Other important residues Y505 and N501 have more negative contribution to total binding energy than their counterparts in SARS-COV residues Y491 and T487 respectively (Figure 4.7). Residue D480 in SARS-COV contributed positively to binding energy by 6.2?0.15 kcal/mol and the corresponding residue in nCOV-2019 which is a S494 residue lowered this positive contribution to only 1.17?0.06 kcal/mol. Mutation D480A/G appeared to be a dominant mutation in SARS-COV in 2002-2003.51 This muta- tion was reported to escape neutralization by antibody 80R.52 To investigate the effect of this point mutation on binding of SARS-COV RBD to ACE2 we performed an additional simulation and calculated the binding affinity for this mutant in SARS-COV RBD with the 160 Figure 4.7: Binding energy decomposition per residue for RBD of nCOV-2019 and SARS-COV same approach for other mutation in this study. D480A mutation showed a binding affinity of 23.46?3.07 kcal/mol which is about 5 kcal/mol higher than the wild-type SARS-COV RBD. In SARS-COV residue R426 had the highest contribution to total binding energy (-6.27?0.22 kcal/mol) although the corresponding residue in nCOV-2019 is N439 with a contribution of -0.32?0.02 kcal/mol. These important mutations on RBM of nCOV-2019 from SARS-COV caused RBD of nCOV-2019 to bind ACE2 with much stronger (about 30 kcal/mol) affinity. Binding free energy decomposition to its individual components for all mutants is rep- resented in Table 4.1. In all complexes, a large positive polar solvation free energy disfavors the binding and complex formation, which is compensated by a large negative electrostatic free energy of binding. All variants had similar solvent accessible surface area energies. The vdw free energy of binding ranged from -84.68 ?0.68 kcal/mol for mutant Q493A to -103.85?0.66 kcal/mol for Y489A. Mutant K417A had the lowest electrostatic contri- bution to binding -415.67?5.07 kcal/mol and mutants N439K and E484A had the highest electrostatic binding contribution of -989.80?5.6 kcal/mol and -941.20?3.95 kcal/mol re- spectively. Most alanine substitutions exhibited similar or lower total binding affinities to nCOV- 161 2019, however a few mutants had higher binding affinity than the wild type. Mutant Y489A had a total binding energy of -61.782.59 kcal/mol which was about 11 kcal/mol lower than wild type binding energy. Mutants G446A, G447A and T478I also demonstrated higher to- tal binding affinities than nCOV-2019. Other alanine substitutions had similar or lower to- tal binding energy than nCOV-2019. Mutant G502A has the lowest binding affinity among all the mutants with a binding energy of -24.312.98 kcal/mol. Mutant systems K417A, L455A, T500A and N501A are the other mutants with total binding affinities significantly lower than the wild type complex. The electrostatic component of binding contributes the most to the low binding affinities for these mutants. The contribution of RBM residues to binding with ACE2 for nCOV-2019 were mapped to the RBD structure and is shown in Figure 4.9B. Most natural mutants exhibited similar binding affinities compared to wild-type nCOV- 2019 with a few exceptions. Mutation T478I which is one of the most frequent muta- tions based on GISAID database has a binding affinity which is about 6 kcal/mol higher than wild-type. S494P and A475V showed a slightly lower binding affinity than the wild- type complex. Other natural mutants showed binding affinities similar to wild-type RBD. N439K demonstrated a high electrostatic energy which is compensated by large polar sol- vation energy and this mutant has a total binding energy of -48.27?3.07 kcal/mol which is similar to nCOV-2019. Important interactions at the RBD-ACE2 interface Important hydrogen bonds (H-bonds) and salt bridges between nCOV-2019 RBD or SARS-COV RBD and ACE2 for the last 400ns of trajectory are shown in Table 4.2. nCOV- 2019 RBD makes 10 H-bonds/1 salt bridge with ACE2 whereas SARS-COV makes only 5 H-bonds/1 salt bridge with ACE2 with more than 30% persistence. 162 Figure 4.8: Total binding energy of SARS-COV, nCOV-2019 and mutants. Natural mutants are shown with X at the bar base. Table 4.1: Binding free energy decomposition in kcal/mol for nCOV-2019, SARS-COV and mutants of SASR-COV-2 VDW Electrostatic Polar-solv SASA Total SARS-COV -70.07?1.22 -600.14?7.65 659.61?8.98 -8.39?0.15 -18.79?1.53 SARS-D480A -88.30?0.69 -897.14?3.8 972.07?3.90 -10.14?0.07 -23.46?3.07 nCOV-2019 -89.93?0.46 -746.59?2.66 797.3?3.12 -10.58?0.05 -50.22?1.93 K417A -88.23?0.58 -415.67?5.07 484.87?4.89 -10.29?0.09 -29.56?2.95 N439K 95.4?0.63 -989.84?5.57 1047.70?5.08 -10.86?0.06 -48.27?3.07 G446A -91.9?0.50 -730.12?3.68 774.7?4.18 -10.6?0.08 -57.79?2.92 G447A -93.95?0.75 -756.61?4.63 803.48?5.02 -11.08?0.09 -58.37?2.32 Y449A -97.98?0.56 -717.67?2.71 774.49?3.50 -10.80?0.05 -51.91?2.33 Y453A -92.3?0.65 -712.76?3.61 765.63?3.98 -10.96?0.07 -49.98?2.92 L455A -84.96?0.56 -734.72?4.44 795.41?4.06 -10.63?0.08 -33.47?2.93 F456A -95.43?0.57 -770.12?3.38 832.56?4.14 -11.59?0.07 -44.84?3.38 Y473A -90.48?0.54 -725.17?4.27 779.23?4.16 -10.61?0.06 -47.23?2.59 A475V -93.12?0.49 712.12?4.59 769.23?4.71 -10.68?0.08 46.07?4.11 G476A -92.84?0.57 -746.46?4.92 796.97?4.84 -10.99?0.08 -53.57?2.58 G476S -92.25?0.64 712.08?4.10 767.40?4.32 -10.77?0.08 47.38?2.86 T478I -88.95?0.69 753.93?4.00 797.03?4.01 -10.33?0.08 56.06?2.83 V483A -90.74?0.60 737.48?3.94 789.22?3.74 -10.76?0.07 49.55?3.17 V483F 87.23?0.6 738.08?4.73 782.82?4.04 -10.59?0.09 53.70?3.35 E484A -95.7?0.62 -941.2?3.95 1002.21?4.92 -11.15?0.08 -46.67?2.89 F486A -90.07?0.66 -724.64?4.39 779.73?5.01 -10.68?0.09 -45.23?3.66 N487A -102.23?0.72 -724.44?3.69 791.41?4.3 -11.37?0.08 -46.33?2.73 Y489A -103.85?0.66 -773.72?3.15 827.52?4.36 -11.59?0.05 -61.78?2.59 Q493A -84.68?0.68 -713.28?3.67 758.56?3.63 -9.87?0.07 -48.19?2.64 S494A -94.98?0.66 -736.93?4.05 793.94?3.84 -11.02?0.07 -49.09?3.26 S494P -89.39?0.60 737.54?5.25 789.35?5.74 -10.67?0.08 47.90?2.59 G496A -93.17?0.55 -728.93?4.39 784.81?4.93 -10.89?0.07 -48.38?2.67 Q498A -90.48?0.61 -756.18?4.6 812.84?4.75 -11.02?0.09 -45.4?2.74 T500A -93.64?0.62 -704.44?4.1 769.65?4.73 -10.86?0.08 -39.27?3.18 N501A -88.59?0.66 -730.53?3.63 788.41?4.3 -10.75?0.08 -40.36?3.3 G502A -87.61?0.63 -706.13?4.56 780.08?4.61 -10.51?0.07 -24.31?2.98 Y505A -91.35?0.7 -746.12?4.31 802.16?5.29 -10.78?0.08 -46.49?2.92 163 The evolution of the coronavirus from SARS-COV to nCOV-2019 has reshaped the in- terfacial hydrogen bonds with ACE2. G502 in nCOV-2019 has a persistent H-bond with residue K353 on ACE2. This residue was G488 in SARS-COV, which also makes H-bond with K353 on ACE2. Q493 in nCOV-2019 makes H-bond with E35 and another H-bond with K31 on ACE2. This residue was an N479 in SARS-COV, which only makes one H-bond with K31 on ACE2. An important mutation from SARS-COV to nCOV-2019 is residue Q498, which was Y484 in SARS-COV. Q498 makes two H-bonds with residues D38 and K353 on ACE2, whereas Y484 in SARS-COV does not make any H-bonds. Im- portantly a salt bridge between K417 and D30 in nCOV-2019/ACE2 complex contributes to total binding energy by -12.34?0.23 kcal/mol. This residue is V404 in SARS-COV which is not able to make any salt-bridge and does not make H-bond with ACE2. Gao et al. used a free energy perturbation (FEP) approach and showed that mutation V404 to K417 low- ers the binding energy of nCOV-2019 RBD to ACE2 by -2.2?0.9 kcal/mol. A salt bridge between R426 on RBD and E329 on ACE2 stabilizes the complex in SARS-COV/ACE2. This residue is N439 in nCOV-2019 which is unable to make salt-bridge with ACE2 residue E329. One of the most observed mutations in nCOV-2019 according to GISAID database is N439K which recovers some of the electrostatic interactions with ACE2 at this position. Y436 in SARS-COV and Y449 in nCOV-2019 both make H-bonds with D38 on ACE2. The unchanged T486 in SARS-COV corresponds to T500 in nCOV-2019, both of which make consistent H-bonds with ACE2 residue D355. Hydrophobic interactions also play an important role in stabilizing RBD/ACE2 com- plex in nCOV-2019. An important interaction between nCOV-2019 RBD and ACE2 is the -stacking interaction between F486 (RBD) and Y83 (ACE2). This interaction helps in sta- bilizing L3 in nCOV-2019 compared to SARS-COV where this residue is L472. It was observed by Gao et al. that mutation L472 to F486 in nCOV-2019 results in a net change in binding free energy of -1.2?0.2 kcal/mol. Other interfacial residues in nCOV-2019 164 RBD that participate in hydrophobic interaction with ACE2 are L455, F456, Y473, A475 and Y489. It is interesting to note that all these residues except Y489 have mutated from SARS-COV. Spinello and co-workers[281] performed long-timescale (1?s) simulation of nCOV-2019/ACE2 and SARS-COV ACE2 and found that L3 in nCOV-2019 is more stable due to presence of the ?6 strand and existence of two H-bonds in L3 (H-bonds G485-C488 and Q474-G476). Importantly, an amino acid insertion in L3 makes this loop longer than L3 in SARS-COV and enable it to act like a recognition loop and make more persistent H-bonds with ACE2. L455 in nCOV-2019 RBD is important for hydrophobic interaction with ACE2 and mutation L455A lowers the vdw contribution of binding affinity by about 5 kcal/mol. The H-bonds between RBD of nCOV-2019 and SARS-COV are shown in Figure 4.9A. The structural details discussed here are in agreement with other structural studies of nCOV-2019 RBD/ACE2 complex.[315, 290] H-bond analysis was also performed for the mutant systems and the results for H-bonds with more than 40%. Few of the alanine substitutions increase the number of interfa- cial H-bonds between nCOV-2019 RBD and ACE2. Interestingly, the ala-substitution at Y489A increased the number of H-bonds in the wild-type complex. Mutation in some of the residues having consistent H-bonds in the wild type complex such as Q498A and Q493A, stunningly maintain the number of H-bonds in the wild-type complex. This in- dicates that the plasticity in the network of H-bonds in RBM of nCOV-2019 which can reshape the network and strengthen other H-bonds upon mutation in these locations. How- ever, few mutations decrease the number of H-bonds from the wild-type complex. Alanine substitution at residue G502 has a significant effect on the network of H-bonds between nCOV-2019 and SARS-COV. This residue locates at the end of L4 loop near two other important residues Q498 and T500. This mutation breaks the H-bonds at these residues. Mutation K417A decreases the number of H-bonds to only 5 where the H-bond at residue Q498 is broken. This indicates the delicacy of H-bond from residue Q498 which can eas- 165 Table 4.2: H-bonds and salt bridges between nCOV-2019(salt bridges are shown as bold) # nCOV-2019 ACE2 % Occupancy SARS-COV ACE2 % Occupancy 1 G502 K353 89 Y436 D38 96 2 Q493 E35 83 R426 E329 87 3 N487 Y83 80 T486 D355 83 4 Q498 D38 73 G488 K353 80 5 K417 D30 55 N479 K31 52 6 T500 D355 53 Y440 H34 47 7 Y505 E37 52 8 Q498 K353 49 9 Y449 D38 45 10 G496 K353 37 11 Q493 K31 32 ily be broken upon ala-substitution at other residues. Furthermore, mutation N487 also decreases the number of H-bonds by breaking the H-bond at Q498. 4.1.4 Discussion and Conclusion In this work, we preformed MD simulations to unveil the detailed molecular mech- anism for receptor binding of nCOV-2019 and comparing it with SARS-COV. The role of key residues at the interface of nCOV-2019 with ACE2 were investigated by compu- tational ala-scanning. A rigorous 500 ns MD simulation was performed for nCOV-2019, SARS-COV and few mutants (Y449, T478I, Y489A and S494P) as well as 300 ns MD simulation on each mutant. These simulations aiding in our understanding in the dynamic role of RBD/ACE2 interface residues and estimating the binding free energy of these vari- ants, which shed light on crucial residues for the RBD/ACE2 complex stability. Moreover, numerous mutations have been identified in the RBD of different nCOV-2019 strains from all over the world not known to be critical for infection.[100] The effect of these mutations on the stability of RBD/ACE2 complex were investigated to shed light on their role in the viral infection of coronavirus. Changes in RBD structure of nCOV-2019, SARS-COV and mutants from their crystal 166 Figure 4.9: A) H bonds between RBD of nCOV-2019 and ACE2 B) Mapping contribution of inter- face residues to structure in RBD of nCOV-2019. The RBD domain is purple and ACE2 is yellow. The RBD in contact with ACE2 is rendered in surface format with red being a favorable contribution to binding (more negative) and blue unfavorable (more positive) 167 structure were analyzed by RMSD and RMSF. nCOV-2019 showed a stable structure with a RMSD=1.5 A?, whereas SARS-COV had a larger RMSD value between 3-4 A? during the simulation. Most mutations of nCOV-2019 maintained similar stability to the wild-type. However, a few nCOV-2019 mutations resulted in larger deviations (?2 A?), i.e., Y489A, F456A, Y505A, N487A, K417A, Y473A, Y449A. We further investigated the structure of the extended loop domain (Figure 4.1B) and discovered that nCOV-2019 is stable with RMSD of less than 1 A?, whereas the extended loop in SARS-COV shows an RMSD of about 3A? during simulation. Some mutants showed high RMSD in this region. Alanine- substitution at residue N487 increased the extended loop RMSD to 2.5 A?. Other mutations that increased the extended loop RMSD (?2 A?) include Y449A, G477A and E484A. The dynamic behavior of RBD was further investigated by analyzing the RMSF of all systems. As shown in Figure 4.4, nCOV-2019 shows less fluctuation in L3 than SARS-COV. This is due to the presence of a 4-residue motif (GQTQ) in nCOV-2019 L3, which forces the loop to adopt a compact structure by making 2 H-bonds (G485-C488 and Q474-G476) and thereby reducing the fluctuations in the loop. Residues F486 and N487 play major roles in stabilizing the recognition loop by making ?-stacking and H-bond interactions with residue Y83 on ACE2. Alanine substitution at N487 introduced large RMSF to L1. Mutation L472 to F486 in SARS-COV was shown to favor binding by -1.2?0.2 kcal/mol using FEP.[317] In addition, this mutation was shown to be among the five mutations that produce a su- per affinity ACE2 binder based on SARS-COV RBD.[309] Alanine mutations at residues Y449, G447 and E484 increased the motion in L3 characterized by large RMSF in this region. Using principal component analysis, the approximate free energy landscape for nCOV-2019 and SARS-COV demonstrated that the former occupies only one low energy state whereas the latter forms two distinct low energy basins separated by a metastable state with a barrier of about 6-7.5 kcal/mol. This confirms that the level of binding for the RBD domain is weaker in SARS-COV due to the presence of two basins. Similarly, alanine- 168 substitution for a few residues caused the free energy landscape to degenerate into separate multiple low energy regions. To better characterize the functional motions of RBD, DCCM for all systems are con- structed and demonstrated in Figure 4.6. nCOV-2019 showed large correlation between the ?4- L1- ?5 and ?5- L4- ?5 region. This correlation was stronger in SARS-COV and few mutants such as Y449A, G447A and E484A. Another important correlation in nCOV-2019 is inside L3 and ?6. This correlation is stronger in nCOV-2019 than SARS-COV due to the presence of ?6 which makes the loop to adopt correlated motions. Few mutants impact the correlation in this region such as N487A. Interestingly, mutant F486A which is in L3 and participates in binding by ?-stacking interaction with Y83 on ACE2, disrupts the DCCM of wild-type nCOV-2019 and introduces strong correlation in the extended loop region as well as the core structure of RBD. The details of hydrogen bond and salt-bridge pattern in nCOV-2019 and SARS-COV to ACE2 (Table 4.2) are key to the virus attachment to the host. nCOV-2019 residues participate in 10 H-bonds/1 salt bridge with ACE2, whereas SARS-COV only has 5 H- bonds/1 salt bridge with ACE2. This significantly contributes to 30 kcal/mol difference in total binding free energy of nCOV-2019 and SARS-COV. The binding energies calculated here for nCOV-2019 and SARS-COV (-50.22?1.93 and -18.79?1.53 kcal/mol , respec- tively) are in good agreement with the binding energies calculated using Generalized Born method (GB) by Spinello et al.[281] Moreover, the patterns of H-bonds between nCOV- 2019 and ACE2 was also already characterized by other groups[317, 281] which agrees with our work. An important H-bond between nCOV-2019 and ACE2 is between G502 on RBD and K353 of ACE2. G502 is in L4 region, which is populated by 5 H-bonds between RBD and ACE2. The contribution of this residue to the total binding energy is -2.03?0.04 kcal/mol and the Ala-substitution at G502 has the highest effect on the binding energy among all the residues by lowering the total binding affinity to 24.31?2.98 kcal/mol 169 which is the lowest among all mutants. This mutation breaks the other H-bonds in L4 such as H-bonds from residues Q498 and T500. This residue is preserved and corresponds to residue G488 in SARS-COV, which also makes a H-bond with residue K353 on ACE2. Residue Q493 in nCOV-2019 participates in binding ACE2 by making two H-bonds with residues E35 and K31 on ACE2. Q493 corresponds to residue N479 in SARS-COV, which only makes one H-bond with residue Lys31 on ACE2. This caused Q493 to have more contribution to total binding than its counterpart N479. However, alanine substitution at Q493 did not affect the total binding energy and this mutant had a total binding energy similar to wild-type complex as it maintains the number H-bonds in the wild-type com- plex. Residues Q498 and T500 in nCOV-2019 are crucial for binding by making H-bonds with ACE2 residues D38, D355 and K353. Residue Q498 corresponds to residue Y484 in SARS-COV which does not make any H-bond in SARS-COV/ACE2 complex. Q498 con- tributes to binding by -6.72?0.18 kcal/mol which is more than the contribution of Y484 in SARS-COV (-1.83?0.06 kcal/mol ). Ala-substitution at Q498 did not show large im- pact on total binding energy. Residue T500 is conserved and corresponds to residue T486 which also makes a H-bond with Asp355 on ACE2. Mutation of T500 to Alanine low- ers the binding affinity by about 10 kcal/mol. Residue N487 in nCOV-2019 locates in L3 and plays a crucial role in stabilizing the recognition loop by making H-bond with Y83 on ACE2. This residue contributes to total binding energy of nCOV-2019 by -1.52?0.06 kcal/mol, whereas its corresponding residue in SARS-COV does not show any contribu- tion to binding energy (-0.02?0.05 kcal/mol). This demonstrates that L3 in SARS-COV has evolved to be an important recognition loop in nCOV-2019, which participates in bind- ing with ACE2. Residue K417 in nCOV-2019 has the most contribution to total binding energy (-12.34?0.23 kcal/mol) by making a salt-bridge with residue D30 on ACE2. This residue is crucial for binding of RBD and ACE2 and alanine substitution lowers the to- tal binding energy to -29.56?2.95 kcal/mol. This salt-bridge is found to be important for 170 Figure 4.10: Binding energy decomposition for systems: nCOV-2019, T478I and N439K. stability of the crystal structure of RBD/ACE2 complex in nCOV-2019 K417 is Val404 in SARS-COV which does not participate in binding ACE2. Another important residue in nCOV-2019 is L455 which contributes to binding by -1.86?0.03 kcal/mol. This residue is important for hydrophobic interaction with ACE2 and mutating it to alanine lowers the total binding affinity by about 17 kcal/mol. The hydrophobic residue F456 in nCOV-2019 also has a favorable contribution to binding energy and F456A lowers the binding affinity by 5 kcal/mol. These results are in fair agreement with experimental binding measurements with deep mutational scanning of RBD in nCOV-2019 where they used flow cytometry for different ACE2 concentrations to measure the dissociation constant KD.[282] It was shown that mutations at K417, N487, T500 and G502 are detrimental for binding to ACE2 which agrees with the results here. These experiments showed that mutations at Q493 and Q498 does not impact the binding affinity of RBD to ACE2 which demonstrates the high plas- ticity of the network of H-bonds at the interface where upon mutation at these residues the network can reshape to form new H-bonds. Mutations at hydrophobic residues L455 and F456 are shown to reduce the binding affinity in these experiments. Total binding energy calculation of all the variants showed that mutation Y489A has the highest binding affinity among all systems which is about 11 kcal/mol stronger than 171 that of nCOV-2019 complex. This residue is located in ?6 , which is part of the recogni- tion region of RBD for binding to ACE2. Removal of this bulky hydrophobic residue at the interface with ACE2 caused the extended loop to move closer to ACE2 interface and make more H-bonds with ACE2. A high electrostatic interaction energy is the reason for higher binding energy of mutant Y489A than wild-type complex. It is interesting to note that among the 5 residues L455, F456, Y473, A475 and Y489 that make hydrophobic inter- actions with ACE2, Y489 is the only residue that is conserved from SARS-COV. However, the experimental binding affinity measurements using deep mutational scanning showed that mutations at this position lower the binding affinity to ACE2. Other alanine substi- tutions that increase the binding energy are G446A, G447A. Residues G446 and G447 reside in L1 and mutation to alanine can make L1 take a more rigid form. However, exper- iment showed that these mutations have similar or lower binding affinities to ACE2 than the wild-type RBD and care must be taken when interpreting these results. This discrep- ancy could be due to forcefield inaccuracy and the deficiencies in the PBSA method for the treatment of solvent in binding energy calculation. Further studies are needed to investigate whether these mutations will increase the binding affinity to ACE2. Deep mutational scan- ning using Flow cytometry is a qualitative method to measure the impact of a large number of mutations of protein-protein interactions and further experiments such as Surface Plas- mon Resonance (SPR) or Isothermal Titration Calorimetry (ITC) which are conventional method for measuring binding affinities are needed to study the effect of these mutations in detail. Important mutations found in naturally occurring nCOV-2019 appear to influence to some extent the binding to ACE2. Mutation T478I which is one of the most frequent muta- tions according to GISAID database, increases the binding affinity of nCOV-2019 to ACE2 by about 6 kcal/mol. Mutation N439K has the highest occurrence among all strains of coro- navirus in the GISAID database which demonstrated the highest electrostatic interaction 172 among all studied systems. This residue corresponds to R426 in SARS-COV which exerts a salt-bridge interaction with E329 on ACE2. Mutation N439K recovers some of this ACE2 interaction however, it exerts a binding affinity similar to that of wild-type RBD. Contribu- tion of important interface residues to binding affinity was compared for mutations T478I, N439K and wild-type nCOV-2019 (Figure 4.10). The most striking differences between wild-type RBD and mutation T478I are residues Y449 and Q498 which have significantly higher contribution to binding than the wild type residue. Most other residues at the in- terface have similar binding affinities to the nCOV-2019. A higher H-bond persistence is also seen for these two residues Y449 and Q498 compared to the wild type RBD which is the reason for their higher contribution to total binding energy. Mutation N439K has a slightly lower binding affinity to ACE2 than the wild type RBD. Per residue binding energy decomposition showed that K439 in this system has a favorable contribution of -1.80?0.15 kcal/mol to the total binding energy which is balanced by a lower contribution of K417 which resulted in a binding affinity similar to wild-type RBD. Mutant E484A which is also one of the observed mutations based on GISAID database, demonstrates a high electrostatic interaction with ACE2. E484 contributes to binding by 3.56?0.15 kcal/mol whereas the corresponding residue in SARS-COV, P469 contributes to binding of SARS-COV to ACE2 by -0.27?0.01 kcal/mol. This residue is close to D30 on ACE2 and have electrostatic re- pulsion with this residue. Most natural mutants including N439K, A475V, G476S, V483A, V483F, E484A and S494P showed similar or slightly lower binding affinities to ACE2 com- pared to wild-type complex which agrees with experimental binding measurements.[282] However, the experimental binding affinity for T478I also showed similar binding affin- ity to wild-type complex. This difference could be due to the use of MMPBSA approach for calculation of polar solvation and further studies are needed to study the effect of this mutation on viral infectivity of coronavirus. Additional sequence differences between nCOV-2019 and SARS-COV influence RBD/ACE2 173 binding. Residue D480 in SARS-COV contributes negatively to total binding energy (6.25?0.14 kcal/mol) and mutating this residue to S494 in nCOV-2019 lowers this negative contribu- tion to 1.17?0.06 kcal/mol. D480 in SARS-COV is located in a region of high negative charge from residues E35, E37 and D38 on ACE2. Electrostatic repulsion between D480 on SARS-COV and the acidic residues on ACE2 is the reason for highly negative contribu- tion of this residue to binding of SARS-COV to ACE2. Mutation to S494 in this location removes this highly negative contribution. Gao and coworkers[315] computed the relative free energies of binding due to mutations from the RBD-ACE2 of SARS-COV to the cor- responding residues in nCOV-2019. They used a free energy perturbation approach and showed that mutation D480S in SARS-COV changed the binding free energy by -1.9?0.8 kcal/mol which is consistent with our study. Furthermore, we performed an additional sim- ulation on D480A mutant in SARS-COV and found that this mutation has a binding affinity of 23.46?3.07 kcal/mol which is about 5 kcal/mol higher than the wild-type SARS-COV RBD. In addition, experimental binding affinity measurements showed that mutations of S494 to an acidic residue highly reduces the binding affinity to ACE2 which confirms the hypothesis here. Previous computational studies have found that nCOV-2019 binds to ACE2 with a to- tal binding affinity which was about 30 kcal/mol stronger than SARS-COV and is in fair agreement with the results here. The critical role of interface residues and residues are computationally investigated here and in other articles and the results of all the studies in- dicate the importance of these residues for the stability of the complex and finding hot-spot residues for the interaction with receptor ACE2.[317, 281, 35] It is interesting to note the role of L3 in the stability of the RBD/ACE2 complex. The amino acid insertions in L3 for nCOV-2019 has converted as unessential part of RBD in SARS to a functional domain of the RBD. This loop participates in binding ACE2 by making H-bond as well as -stacking interaction with ACE2, which makes this region to act as a recognition loop. Previous 174 studies on SARS-COV have shown there is a correlation between higher binding affinity to receptor and higher infection rate by coronavirus.[309, 178] Higher binding affinity of nCOV-2019 for ACE2 than SARS-COV to ACE2 is suggested to be the reason for its high infection rate. Most natural mutations showed similar binding affinities to wild-type ACE2 which indicates that the virus was already effective at the beginning of the crisis for binding ACE2. A few mutations such as N489A and T478I are shown to increase the binding affin- ity to ACE2. However, more studies are needed to investigate the effect of these mutations in detail. Mutations of nCOV-2019 RBD that do not change the binding affinity and com- plex stability, could have implications for antibody design purposes since they could act as antibody escape mutants. Escape from monoclonal antibodies are observed for mutations of SARS-COV in 2002 and these mutations should be considered for any antibody design endeavors against these escape mutations. 175 4.2 Exploring dynamics and network analysis of spike glycoprotein of SARS-COV-2 2The ongoing pandemic caused by coronavirus SARS-COV-2 continues to rage with devastating consequences on human health and global economy. The spike glycoprotein on the surface of coronavirus mediates its entry into host cells and is the target of all cur- rent antibody design efforts to neutralize the virus. The glycan shield of the spike helps the virus to evade the human immune response by providing a thick sugar-coated barrier against any antibody. To study the dynamic motion of glycans in the spike protein, we performed microsecond-long MD simulation in two different states that correspond to the receptor binding domain in open or closed conformations. Analysis of this microsecond- long simulation revealed a scissoring motion on the N-terminal domain of neighboring monomers in the spike trimer. Role of multiple glycans in shielding of spike protein in different regions were uncovered by a network analysis, where the high betweenness cen- trality of glycans at the apex revealed their importance and function in the glycan shield. Microdomains of glycans were identified featuring a high degree of intra-communication in these microdomains. An antibody overlap analysis revealed the glycan microdomains as well as individual glycans that inhibit access to the antibody epitopes on the spike protein. Overall, the results of this study provide detailed understanding of the spike glycan shield, which may be utilized for therapeutic efforts against this crisis. 4.2.1 Introduction Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) has rapidly spread worldwide since early 2020 and has been considered one of the most challenging global 2Taken from published paper: Ghorbani, M., Brooks, B. R., Klauda, J. B. (2021). Exploring dynamics and network analysis of spike glycoprotein of SARS-COV-2. Biophysical Journal, 120(14), 2902-2913. 176 health crises within the century. SARS-COV-2 has caused more than 100 million cases and more than 2 million deaths worldwide as of February 2021.[353] Drug and vaccine development are underway and multiple vaccines have entered clinical trials and some of them are in the last stages of development.[82, 184, 224] SARS-COV-2 is a lipid-enveloped single stranded RNA virus belonging to the beta- coronavirus family, which also includes MERS, SARS and bat related coronaviruses.[315, 309, 45, 68] A major characteristic of all coronaviruses is the spike protein (S), which pro- trudes outward from the viral membrane and plays a key role in the entry of the pathogen into the host cells by binding to human angiotensin converting enzyme-2 (h-ACE2).[315, 329, 337] The structure of each monomer of the trimeric spike protein in SARS-COV- 2 (Figure 4.11) can be divided into two subunits (S1 and S2), which can be cleaved at residues 685-686 (furin cleavage site) by TMPRSS protease after binding to host cell re- ceptor h-ACE2.[73] S1 subunit in S trimer includes a N-terminal domain (NTD) and a receptor binding domain (RBD) that is responsible for binding to h-ACE2.[307] S2 subunit contains fusion peptides (FP), heptad repeats (HR1 and HR2), a transmembrane (TM) and a cytoplasmic domain (CP). The homotrimeric spike protein is highly glycosylated with 22 predicted N-linked glycosylation and 4 O glycosylation sites per monomer, most of which are confirmed by Cryo-EM studies.[328, 271, 350] Glycosylation of proteins plays a cru- cial role in numerous biological process such as protein folding and evasion of immune response.[242] The spike protein in SARS-COV-2 is highly glycosylated with 22 N-linked glycosyla- tion sites and 4 O-linked glycosylation sites.[322, 323] For example, the HIV-1 envelope glycoprotein (Env) features about 93 N-linked glycosylation sites with mostly high man- nose glycans (Man5-9), which covers most of the surface of the spike protein in HIV-1 and comprises over half its mass.[284, 355, 66] N-linked glycosylation starts with synthesis of precursor oligosaccharides, which are modified to high mannose forms by glucosidases 177 Figure 4.11: Structure of spike protein and its glycosylation pattern A) Different regions of spike protein including N-terminal domain (NTD), receptor binding domain (RBD), Furin cleavage site for cleaving between S1 and S2 subdomains (FS), Fusion peptides (FP), Heptad repeat (HR2), transmembrane (TM) and cytoplasmic (CP) regions. The spike protein is divided into a head and a stalk region B) Glycans on the spike protein color-coded based on their types C) Sequence of full-length spike protein with domain assignments. and then trimmed to complex forms in the Golgi by glucosyltransferases for signaling and other glycobiological functions.[164] A higher degree of processing is usually indicative of exposure or accessibility of glycans to enzymes.[284] Dense crowding glycan regions limit the activity of processing enzymes at these locations. In this study, we report the microsecond long MD simulation of all-atom solvated fully glycosylated spike protein embedded in a viral membrane model in both RBD-up or open state (PDB:6VSB)[329] and RBD-down (PDB:6VXX)[307] or closed state. The struc- ture of the two states for glycosylated S protein in viral membrane were taken from the 178 CHARMM-GUI website for this study.[328] Details of modeling different regions were presented in detail by Im and coworkers.[328] Structural changes in the spike protein hap- pens on the order of microsecond timescale, where a scissoring motion is observed between the NTDs in the RBD-up conformation. We have used network analysis and centrality measures in graph theory to pinpoint the structural features of the glycan shield in the con- text of glycan-glycan interactions as well as binding of antibodies to the spike protein. A modularity algorithm helped us to find glycan microdomains featuring high glycan-glycan interactions with breaches between microdomains for antibodies to bind and neutralize the virus. 4.2.2 Methods Molecular dynamcis simulations Structures for glycosylated spike protein in the viral membrane with RBD-up and RBD- down states were taken from the CHARMM-GUI website. Im et. al.[328] provided 8 dif- ferent structures of spike protein in open and closed conformation where two models were built for heptad repeat linker (HR2), two models for the HR2-TM domain and two mod- els for the cytoplasmic (CP) region. We used model 1 2 1 in which the missing loops for RBD were built by template-based modeling and N-terminal loop was constructed based on electron density map. Ab initio monomer structure prediction and ab initio trimer docking were used for the HR2 linker domain using PDB:5SZS[308] from SARS-COV-1 where two models were built (model 1 used here). Two models were constructed for HR2-TM junc- tion using PDB:5JYN[81] as template, a model using a template structure (model 2 used here) and another model with more structural differences. Finally, the Cys-rich CP domain was constructed using PDB:5L5K[155] as a template. Moreover, the palmitoylation sites for this model are at residues C1236 and C1241.CHARMM36 forcefield[23, 153, 116] was 179 used for protein, lipids and carbohydrates in this study. The glycan composition for each site in the spike protein represents the most abundant based on experimental mass spec- troscopy data. The selected glycan sequences include 22 N-linked and 1 O-linked for each monomer of trimeric spike.[321, 271] GROMACS[301] software was used for molecular dynamics (MD) simulation. Energy minimization was performed in 5000 steps using the steepest descent algorithm. A LINCS algorithm in all steps constrained the bonds con- taining hydrogen atoms. Equilibration was performed with a standard 6-step equilibration scripts from CHARMM-GUI with restraints on protein, lipid and glycan atoms.[143] For the first three steps of equilibration, each step included 125 ps using Berendsen thermostat at temperature 310.15 K and a coupling constant of 1 ps. In the last 4 equilibration steps a Berendsen barostat was used to maintain the pressure at 1 bar. For the production step, all restraints were removed, and the system was simulated under a NPT ensemble using the Parrinello-Rahman barostat[226] with a compressibility of 4.510?5 bar?1 and a coupling constant of 5 ps. The temperature was maintained at 310 K using a Nose?-Hoover ther- mostat with a temperature coupling constant of 1 ps.[96] The production run lasted 1s for each system (RBD-up and RBD-down) with a 2 fs timestep and the particle-mesh Ewald (PME)[69] for long range electrostatic interactions using GROMACS 2018.3 package.[2] Solvent accessible surface area (SASA) SASA was calculated using VMD[135] with a probe radius of 7.2 A? which represents the hypervariable region of antibodies.[41] Multiple regions were chosen for SASA calcu- lation: RBM (residues 440 to 508). RBD (residues of RBD away from RBM 330-440 and 509-520) and NTD (residues 13 to 310) 180 Network analysis In the glycan network, each glycan is represented with a node (69 nodes for 69 gly- cans). To assign edges between nodes in the graph we first calculated the distance between heavy atoms of different glycans in the starting structure and if the distance between two glycan heavy atoms is less than 50 A?, an edge is assigned between the two nodes. Next to incorporate simulation data and dynamics of glycans into the network, we calculated the absolute value of average non-bonded interaction energy between every two glycans and normalized the values to be between 0 and 1 and assigned the as the edge weights. To sim- plify the network, we removed the edges with weights less than 0.05. This further reduced the number of edges form the starting graph. These normalized interaction energies repre- sent the adjacency matrix for the graph from which different centrality measurements can be made. Two different centrality measurements were used to analyze the network of gly- cans. Betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path connecting every two nodes in the graph and is an important indicator of the influence of the node within the network. Eigenvector centrality measures the node?s importance in the network by considering the importance of the neighbors of that node. If the node is connected to many other nodes that are themselves well connected, that node is assigned a high eigenvector centrality score. For a graph with adjacency matrix A, the relative centrality score of the node i is defined as: xi = 1/? j?M(i)x j (4.6) The sum is over all j such that nodes i and j are connected. To write this in a matrix form let x be the vector containing the centrality scores and A be the adjacency matrix. 181 Then we can write: Ax = ?x (4.7) With a constraint that all the components of the eigenvector x be positive, there is only one eigenvalue that satisfies the equation and therefore a unique centrality score is assigned to each node in the graph. A modularity algorithm in Gephi was used to find glycan microdomains that have high number of edges. All network analysis was performed in networkx and graph visualizations were done using Gephi.[25, 117, 12] Glycan-antibody overlap analysis Antibodies that bind and neutralize spike protein are divided into three categories:[185] antibodies that bind to the exposed part of RBD (RBM-binder), antibodies that bind to epitopes away from the RBM in RBD or RBD-binder and antibodies that bind to epitopes on NTD (NTD-binder). Three different antibodies were used in this study: B38 for RBM binder (PDB:7BZ5)[334], S309 for RBD binder (PDB:6WPT for open and PDB:6WPS for closed state)[233] and 4A8 antibody as NTD-binder (PDB:7C2L).[55] 4.2.3 Results Dynamical motions of the spike protein The root-mean-square deviation (RMSD) for each region of the spike protein in RBD- up and RBD-down states are represented in Figure 4.12. The stalk region in both RBD-up and RBD-down states show a fluctuating RMSD, which is due to bending motion in this region of spike protein. The tilting in the head of spike is also observed in high reso- lution cryo-ET images as well as other recent MD studies of glycosylated spike in viral 182 membrane[275, 298] The bending dynamics in the stalk region is suggested to assist the virus in scanning the cell surface for receptor proteins more efficiently.[298] The angle distribution for tilting of the head and stalk domains were calculated from total 2 ?s sim- ulation. The head region had an angle distribution of 18?10? and the stalk region angle distribution was 34.5?6?. The head of the spike protein was shown to bend up to 90? toward the membrane However, observing larger tilting angles requires more sampling of spike protein in viral membrane.[340] A snapshot of the open system at 1 ?s represented in Figure 4.13, which shows the incline between the head and the stalk domains of spike. Gly- can root-mean-square fluctuation (RMSF) were calculated for heavy atoms in each glycan and then computing the mean and standard deviation for the glycan (4.14). Consistently, in all chains, few of the NTD glycans such as N74 show high RMSF, which is due to the high solvent exposure of this glycan in NTD compared to other regions. The glycans near the RBD in chain A (RBD-up), such as N234 and T323 showed less fluctuation than other chains. N234 and T323 are sandwiched between RBD and NTD of neighboring monomers in the trimeric spike protein. Glycans in the stalk region showed high RMSF values demon- strating the effective shielding of the spike in this region for both RBD-up and RBD-down states. A principal component analysis (PCA) was performed on the head region (residues 1- 1140) of both RBD-up and RBD-down states to extract the fundamental motions of the trimeric protein. The first two PCs are visualized in 2D plot (Figure 4.15A and B). Both RBD-down and -up states feature similar behaviors in their PCA plot. In RBD up, PC-1 captures 56% of conformational motion whereas in RBD-down, PC-1 captures only 44% of all conformational motion. This shows the higher conformational change in RBD-up state compared to RBD-down state. The first eigenvector was used to construct the porcupine plots to visualize the most dominant motions in RBD up (Figure 4.16A) and RBD-down (Figure 4.16C) states. In the RBD-up state, a scissoring motion is observed between the 183 Figure 4.12: RMSD of different regions of the spike protein for A)open and B)closed conformation Figure 4.13: A) snapshot of spike in open state after 1000ns. Different monomers of the spike trimer are color coded with monomer A(up) in red, B in blue and C in purple B) RMSF of glycans in the open state of spike for different chains A to C from top to bottom. 184 Figure 4.14: RMSF of glycans in the open and closed state of spike ofr different chains A to C from top to bottom NTD of chain A and NTD of chain B. Based on the PCA for RBD-up state, the total simulation time was separated into three clusters; cluster-1:0-200ns, cluster-2: 200-600ns and cluster-3: 600-1000ns. The distribution of distance between center of mass of NTD of different chains are calculated for the three different clusters and shown in Figure 4.15B. In the open state, NTD of chain B goes toward the center of apex. The distribution of distance between the center of apex and each NTD is shown in Figure 4.16C. In the RBD-down state, PC-1 shows the motion in RBD of chain A toward the open conformation (Figure 4.16B). The simulation for RBD-down states was separated into two clusters and distribution of the distance between center of mass of RBD and the apex center was calculated for all the chains in both clusters. A clear separation is observed where RBD of chain A is more distant from the apex center in the second cluster (Figure 4.15D). 185 Figure 4.15: A) 2-dimensional PCA for the open state of spike protein head B) distribution of dis- tance between the center of masses of NTDs on different monomers in the spike trimer for clusters of simulation data 0-200ns, 200-600ns and 600-1000ns C) 2D PCA for the closed spike head D) distribution of distance between the center of mass of RBD of different monomer and the center of apex for 0-500ns and 500-1000ns of simulation 186 Figure 4.16: A) porcupine plot of first principal component (PC1) for RBD-up state B) distribution of distance between center of mass of NTD of each monomer in open state from the center of apex. From top to bottom corresponds to 0-200 ns, 200-600ns and 600-1000ns of trajectory C) porcupine plot of (PC1) for the closed state. 187 Occupancy of spike protein by glycans Despite the highly dense glycan shield in the spike protein, there are breaches within the shield that antibodies can bind and neutralize the virus.[41] A volume map of glycans in the spike protein in both up and down conformations are shown in Figure 4.17, where isosurfaces were visualized for glycans from the 1 ?s MD simulation trajectory. The stalk region in the spike protein is highly shielded by the glycans and it is unlikely for antibodies to bind to this region. The spike head on the other hand, shows glycan holes providing opportunities for antibodies to bind. Importantly, NTD of all chains show epitopes free from glycans. The RBM of chain A in the up conformation is completely free from glycans presents the least shielded domain of spike. Furthermore, there are regions on the RBD away from the RBM, that show epitopes for antibodies and not shielded by glycans. We have computed the solvent accessible surface area (SASA) of different regions of the spike for antibodies with a probe radius of 7.2 A? that represents the hypervariable domains of antibodies.[41] RBM of chain A in up conformation shows the highest SASA in all chains followed by RBM of chains B and C. On the other hand, RBD of chain A shows the lowest SASA among the three monomers. NTD of all chains show high SASA in all chains. In the closed state, RBM and NTD of chain A shows higher SASA than other chains which is due its conformational change toward the open state. In summary, epitopes on RBM, RBD and NTD of spike protein show high SASA for antibodies to bind and neutralize the spike. These findings are consistent with observations from Amaro et al. In their study RBM of the open conformation showed higher SASA than closed using different probe sizes. Moreover, using a probe radius of 7.2 A? they showed that RBD has a lower SASA than RBM and NTD which is consistent with the findings in our study. 188 Figure 4.17: A) Glycan occupacy (grey surface) of different regions in the spike head for RBD up state. Monomer A (up) shown as red, monomer B as blue and C as cyan. Glycan-Glycan interaction and network analysis of glycans A network analysis was carried out on the glycan shield of the spike protein in both RBD-up and RBD-down states to find the glycans that are most important for an effec- tive shield. This approach has recently been applied to the glycan shield of HIV-1 spike protein.[170, 44, 20] In the glycan network, each glycan represents a node in the graph and two nodes are connected by an edge if the glycans have a distance less than 50 A? in the starting structure. Edges are weighted by the normalized absolute value of non-bonded interaction energy (vdw+electrostatic) between each pair of glycans. A network was built for the whole simulation time in both RBD-up and -down states. All network analysis was performed in networkx and graph visualizations were done using Gephi. The adjacency matrix for these two networks are calculated. Two centrality measurements in graph the- ory were utilized to analyze the network of glycans. Betweenness centrality quantifies the 189 Figure 4.18: Centrality measurements (eigenvector and betweenness) for A) RBD-up and B) RBD- down states. number of times a node acts as a bridge along the shortest path connecting every two nodes in the graph and is an important indicator of the influence of the node within the network. Eigenvector centrality measures the node?s importance in the network by considering the importance of the neighbors of that node. If the node is connected to many other nodes that are themselves well connected, that node is assigned a high eigenvector centrality score (Figure 4.18). Glycans in the stalk region show high eigenvector centrality in both RBD-up and down states. This means that glycan-glycan interactions in the stalk region is strong and glycans in this region are well connected which result in effective shielding of the stalk against antibodies. In contrast, connections in the spike head and specially at the apex are sparser as the eigenvector centralities are small for this region. Two apex glycans (N234A and N165B) near RBD of chain A (up) in the open state, show a high betweenness centrality (BC). These two glycans show a low BC in the closed state. Glycan N616B in RBD-up and N603B in RBD-down show highest BC in their corresponding network which shows the great impact of this glycan in the proper shielding of the spike protein. Glycans N603 and 190 N616 connect lower head with upper head glycans and are highly central in the network. Glycans in head region are well separated from glycans that shield the stalk of spike protein. Consequently, we performed network analysis of glycans in the head region of spike protein for the total simulation time. Centrality measurements for open and closed conformations of spike head are shown in Figures 4.19A and B. A modularity maximiza- tion algorithm is used, which resulted in identifying 5 different glycan microdomains for RBD-up (Figure 4.20) and 4 glycan microdomains in RBD-down states (Figure 4.20B). Microdomains feature a high glycan-glycan interaction among them and lower number of edges between different microdomains.[170] The higher number of microdomains in RBD- up state shows that the spike protein is more vulnerable when RBD is in up conformation. Glycans in the lower head all belong to the same microdomain (Cyan-I) in both RBD- up and RBD-down states. This demonstrates the effective shielding of the lower head by glycans regardless of the RBD conformation as all these glycans belong to the same mi- crodomain. Glycan microdomains in RBD-up and -down conformations were mapped onto the spike head to visualize these clusters on the protein (figure 4.20). Overall connectiv- ity is lower near the RBD as this region is divided into three different microdomains in RBD-down and four microdomains in RBD-up state. In RBD-down state, most glycans have similar BC and only one glycan N603B shows a relatively higher BC in the network. This is a high mannose glycan, which connects the upper head with the lower head region in the glycan network and is crucial for effective shielding. In both RBD-up and -down states, glycans at the lower head also showed high eigenvector centrality, indicating the ef- fective shield of this region. When RBD is in up conformation, glycans near RBD of chain A (up) can interact with glycans from NTD of chain B. As a result, when RBD is open, glycans N234A, T323A, N331A and N343A all belong to the microdomain that comprises glycans of chain B (shown as Green). This leads to encompassing the RBD of chain (A) in RBD-up state by the same microdomain (Green-III), which enhances the shielding of RBD 191 Figure 4.19: A) network centralities for the head region in open state of spike and the total simula- tion time B) network centralities for the head in closed state and total simulation time. C) change in betweenness centrality of open state between first (0-200 ns) and second (200-600 ns) BC(1-2) and second and third (600 ? 1000 ns) BC(2-3). D) change in betweenness centrality of closed state from 0-500 ns to 500-1000 ns. 192 region away from the RBM. This is also demonstrated by the lower SASA of the region of RBD of chain A away from the binding interface with ACE2 (Figure 4.17B). However, in RBD-down state, the mentioned glycans of chain A are distant from glycans of chain B and therefore they belong to the microdomain that includes other glycans from chain A (shown as Orange-II). Furthermore, glycans N234A and N165B show high BC in RBD-up state, which is due to their interaction with other glycans from the space left open from RBD of chain A. Glycan N616, which is a fucosylated complex glycan, also shows a high BC in RBD-up state. Interestingly, most of the glycans in the lower head region are oligo- mannose, whereas the glycans at the upper head region (apex) are mostly complex glycans. Complex glycans have a higher degree of processing by glycan processing enzymes. This correlates with the higher number of microdomains in upper head region where connec- tions between glycans are sparser than the lower head region. Glycan sparsity was shown before to correlate with the degree of processing. Dynamic motions of the spike protein can affect the patterns of glycan-glycan interac- tions by bringing glycans of different domains in closer proximity. To study whether the conformational motions identified by PCA are coupled with any changes in centrality of glycans, we performed network analysis on the clusters of simulation found from PCA ( three cluster for open and two for closed conformation). BCs were calculated for glycans in these clusters and changes in betweenness centrality ?BC were measured between these networks of glycans built on different clusters (Figure 4.19C and D). For the open state simulation, the scissoring motion in NTD is coupled with increasing the BC of N234A, N603A and N165B. As the NTD of chains A and B come closer together glycans in these two chains make stronger connections especially near RBD of chain A where due to the open state, glycan N234A, which is inserted in the space left open by RBD in open con- formation, can freely interact with glycans of chain B. N603 is a high mannose glycan in the middle regions of spike and the NTD scissoring motion grants a higher BC for this 193 Figure 4.20: A) microdomains in the open state of spike head with each microdomain color-coded. Glycans are connected through and the thickness of the edge shows the edge weight. B) mi- crodomains in the closed state of spike head C) microdomains color coded on the spike protein in open state D) Microdomains in the closed state of spike head 194 glycan in RBD-up state. In the RBD-down state, the motion in RBD of chain A to the up conformation increases the BC of glycans such as T323A and N657A and does not affect the BC of most other glycans. Glycan T323A is located at the tail of RBD and N657A is in close proximity of RBD of chain A in the middle head region. The conformational change in RBD brings these glycans closer resulting in a more compact network of glycans in the middle head region and higher BC of these glycans and other neighboring glycans in chain A (Figure 4.19D). Antibody overlap analysis Neutralizing antibodies look for breaches in the glycan shield, where the glycan densi- ties are sparse.[170] Within a microdomain, glycans are highly connected by a high number of edges in the network and most antibodies bind the regions between these microdomains as comparatively sparse edges connect different microdomains.[284] Therefore, these mi- crodomains help identify susceptible regions of spike protein for immunological studies. Antibodies for spike protein were divided into three categories: RBM-binder, RBD-binder (region of RBD away from RBM) and NTD-binder. The antibodies chosen for each cate- gory are presented in the methods section. To investigate the relation between binding of antibodies to known epitopes of the spike protein and the identified glycan microdomains, we utilized an antibody overlap analysis.[284] Antibodies were first overlaid with the spike protein by fitting to their corresponding region in the spike protein. Next, we calculated the average number of clashes between each overlaid antibody and the glycan heavy atoms in each microdomain with a cutoff distance of 5 A? during 1 s simulation. Results of this analysis are shown in Figure 4.21A for RBD-up and 4.21B for RBD-down states. The RBM-binder antibody had the lowest number of clashes among all antibodies with microdomains. RBD-binder antibody of chain A had the highest number of clashes among 195 Figure 4.21: A) Antibody overlap analysis for the open state. Microdomains are representated by their group and their color in figure 6 with Yellow-V (Y), Purple-IV (P), Green-III (G), Orange-II (O) and Cyan-I (C). The character after each region in the x-axis specifies the chain on which the antibody was overlaid, and the number of clashes was calculated B) Antibody overlap analysis for closed state. Since all the RBD?s are in closed conformation we didn?t calculate clashes with RBM- A C) overlaid antibodies with spike protein. Top left RBD binder antibody (pink) with closed state spike. Top right NTD binder antibody (silver) with closed spike. Lower left RBD binder antibody (pink) with open state of spike and lower right RBM binder antibody (blue) with spike protein. all chains with glycans in microdomain III(G). RBD-binder antibodies of chains B and C have lower number of clashes with glycan microdomains than chain A, where RBD is in the up state. Antibody binding to NTD of chain C (NTD-C) in open state shows a high number of clashes with microdomain IV(P). Similarly, in the closed state, NTD-binder antibody in chain C (NTD-C) also shows a high number of clashes with cluster IV(P). NTD of chain C also showed a lower SASA than NTD of other chains. In the open state, microdomain III(G) comprises the high BC glycans N234A, N165B and T323B. The high BC of glycans in this microdomain correlates with its high number of clashes with antibodies of RBD-A and NTD-A. In RBD-down state, RBD-binder antibodies seem to have similar number of clashes with different glycan microdomains. To identify glycans that have most effect on antibody binding, we also quantified the number of clashes of antibodies with each glycan in different chains and averaged over the simulation time of 1?s (Figure 4.22). RBM 196 antibody has only a low number of clashes with N165A glycan. In open state, RBD-A antibody has a high number of clashes with multiple glycans (N122B, N149B, N331A and N343A). RBD antibodies bound to the other chains (RBD-B and RBD-C) show lower number of clashes with glycans N122 and N165. NTD antibody of chain C (NTD-C) in both open and closed states show a high number of clashes with glycans N74C and N149C. 4.2.4 Discussion and Conclusion Understanding the structure and dynamics of glycan shield in the spike protein of SARS-COV-2 is an indispensable requirement for any antibody and vaccine design endeavors.[323, 51] To this end, we have performed MD simulations of fully glycosylated spike protein of SARS-COV-2 in both open and closed states. Analysis of dynamics for 2 ?s trajectories showed a tilting motion in the stalk region, which was also demonstrated with experimental cryo-ET images[275, 298, 340] and suggested to aid the virus with screening the host cells for receptor proteins (Figure 4.13). Glycan motions were characterized by RMSF (Figures 4.14), which showed higher values for stalk glycans demonstrating the high shielding po- tential of this normally solvent-exposed region. PCA of the head region of open state of spike demonstrated a scissoring motion between NTDs of neighboring monomers A(up) and B. This scissoring motion resulted in trimer asymmetry with NTD of monomer B advancing toward the center of the apex region and NTD of monomer A showing an an- gular motion centering on the apex center and toward NTD of monomer B. Based on the PCA, the simulation was divided into three different clusters with distribution of distances between the center of masses of NTDs of different monomers showing different asymmet- rical trimers in each cluster (Figures 4.15 and 4.16). A scissoring motion in the trimeric spike of HIV-1 on the sub-microsecond timescale was observed by Leminn et. al[170] which was suggested to be essential for receptor binding. The NTD scissoring motion is 197 Figure 4.22: Avg number of clashes of each glycan with the overlaid antibody in A) open and B) closed states of the spike only observed in the open state and this could suggest, the scissoring motion is a means employed by the virus to camouflage parts of the spike protein (such as regions of RBD excluding the RBM) when RBD is in the up conformation. The first PC for RBD-down state was visualized (Figure 4.16C) and shows the conformational change of RBD in chain A from down toward up conformation. Distribution of distance between center of mass of RBD in each monomer from the center of apex region (Figure 4.15D) exhibited this motion for two clusters of data 0-500ns and 500-1000 ns separated based on the PCA of open state simulation. The abundance of information in MD simulations of glycosylated spike protein may hinder identifying important biological features of the glycan shield. Therefore, a net- work analysis approach is used to identify collective behavior of glycans. The most central region of spike based on eigenvector centrality of the network, is shown to be the stalk domain and the lower head region of spike where a dense array of glycans gives rise to 198 resilience to enzymatic actions. Most of the glycans at the lower head region and upper stalk domain are high mannose with the lower degree of processing which correlates with their high centralities in the graph. High eigenvector centrality of lower head and the stalk glycans also makes it hard for neutralizing enzymes to target and to date, no epitopes for antibodies have been found that target this region.[350, 322]Glycans on the head region demonstrated different behaviors depending on the RBD state (up or down conformation). Interestingly two glycan at the open state, N234A and N165B show a high BC. N234A occupies the volume left open by the RBD in the open state, whereas in the closed state it is directed toward the solvent. The highly conserved glycan N165B is inserted between RBD of chain A(up) and NTD of chain B and in the open state it occupies the volume of left vacant by RBD (A) in the up conformation. Amaro and coworkers[41] studied the fully glycosylated spike protein of SARS-COV-2 computationally and showed that N234A and N165B are crucial for stabilizing the RBD in the up conformation in the open state of spike. Their simulation showed that mutating N234 and N165 to Ala destabilized the RBD in the open state. Furthermore, experimental negative stain electron microscopy and single particle cryo-EM showed that the equilibrium population ratio between the open and closed state is 1:1. Deletion of N234 glycan shifts this ratio to 1:4 favoring the closed state and deletion of glycan at N165 increased the population of open state with a ratio of 2:1. It was shown that N234 glycan stabilizes the open state of the RBD and inhibits the up-to-down conformational change and N165 glycan sterically inhibits the down-to-up conformational change of RBD. Here we have shown that these two glycans in the open state exhibit a high BC, which is due to their interaction with each other as well as other neighboring glycans at RBD of chain A as well as NTD of chain B. We have further demonstrated that the BC of glycans in the head region is coupled with the scissoring motion between NTDs of monomers A and B. In addition, the scissoring motion gives rise to high BC for gly- cans in the middle region of spike for glycans N616A and N616B. This is caused by the 199 tighter packing of glycans in the asymmetric trimer. In the closed state, the BC of most glycans do not change, which correlates with lower fluctuation of RBD-down simulation. It was demonstrated for spike protein of HIV virus that glycans with high BC display a high degree of interaction with other neighboring glycans and are less accessible to glycan processing enzymes. These highly central glycans such as N603 and N234 are essential to maintaining the mannose character of the glycan shield. Regions with dense crowding glycans have steric constraints on glycans that limit their processing by carbohydrate pro- cessing enzymes. Experimental studies using mass spectroscopy along with site-directed glycan removal are needed to understand how deletion of these glycans such as N603, N616 and N234 can affect the glycan processing of the neighboring glycans on the spike protein. Modularity maximization in network analysis allowed us to find 5 microdomains of glycans in RBD-up and 4 microdomains in RBD-down states. The higher number of microdomains at the apex of RBD-up state shows the more vulnerability of spike protein to antibodies in the open state. Glycans at the lower head in both open and closed states belong to the same microdomain (Cyan-I), which shows large number of edges between glycans in this region and thereby effective shielding. Apex glycans are divided into three microdomains in closed and four microdomains in open state. Glycan microdomains was shown in HIV-1 Env glycan shield to have a broad implication for anticipating immune escape.[120] The antibody overlap analysis showed that RBM binder antibody in open state (up) shows the lowest number of clashes with glycan microdomains (Figure 4.20A). The RBM of chain A also showed the highest SASA among other epitopes on spike protein which shows its great potential for antibody design strategies. NTD-binder antibodies can also bind to epi- topes on the NTD of spike protein. These antibodies showed high number of clashes with glycans N74 and N149 of the respective monomer that they bind. Experimental studies are needed to explore the effect of deletion of these glycans on the neutralization effect by antibodies. The glycans on the surface of spike protein exert a collective behavior, which 200 is an important property that needs to be considered in the context of vaccine and antibody design. In this work, we have studied the microsecond time dynamics and network analysis of the glycosylated spike protein in SARS-COV-2. To answer the need for quantification of the glycan shield of the spike protein of SARS-COV-2, we have utilized MD simulation and network analysis to aid in understanding the collective behavior of glycans. The role of glycans N234A and N165B as the central glycans in the network of glycans in the RBD-up system is discussed. Glycan microdomains are identified featuring high interaction inside them and lower interaction of glycans between different microdomains, which indicates most neutralizing antibodies would bind to regions in between these microdomains. Higher number of microdomains in the open state suggest the higher vulnerability of spike protein in the open state. An antibody overlap analysis identified the microdomains of glycans with higher number of clashes with antibodies. Collectively, this work present insights to design antibodies and vaccines against coronavirus. 201 4.3 Molecular dynamics of ligand binding to PAS domain of EAG channel The KCNH voltage gated potassium channels (including ether-a-go-go EAG, ERG, ELK) are major regulators of cellular excitability and play important roles in diseases such as epilepsy, schizophrenia, cancer and cardiac long QT syndrome type 2.[320, 14, 256, 348, 40] KCNH channels are tetrameric proteins containing 6 transmembrane (TM) helices (S1- S6). The S1-S4 construct the voltage sensing domain (VS) whereas the S5 and S6 of all four subunits together with the pore-forming loops form the centrally located pore domain of the channel.[318] A PAS domain (Per-Arnt-Sim) exists in the N-terminal interacellular part of EAG channels which is structurally similar to the PAS domain of non-ion channels where they act as ligand binding domains. The interacellular C-terminal contains a cyclic nucleotide binding homology domain (CNBHD) which is connected to the channel pore via a C-linker domain. The CNBHD region in KCNH is structurally similar to the cyclic nucleotide binding domain of hyperpolarization-activated cyclic nucleotide-gated (HCN) channels and cyclic nucleotide gated (CNG) channels.[316, 103] However, CNBHD of KCNH channels are not directly modulated by binding of cyclic nucleotides but instead a short beta strand known as intrinsic ligand occupies the cavity where the cyclic nucleotide would bind to HCN and CNG channels.[34, 33] Structure of full-length EAG channel is shown in figure 4.23. Despite a highly variable amino acid sequence, the PAS domain fold is well conserved. [103] PAS domains in other proteins act as ligand binding domains. [127, 205]. However, the ability of PAS to regulate KCNH channel via ligand binding is not well studied. Bre- lidze et al. have previously shown that a small molecule drug chlorpromazine hydrochlo- ride (CPZ) binds to the PAS domain of EAG channel (figure 3.25) and inhibits the current through the channel in a concentration dependent manner.[318] CPZ was found as a strong binder of PAS domain with a binding affinity KD of 1?0.7?M. According to their study, 202 deletion of PAS domain significantly reduced the apparent affinity and channel inhibition by CPZ which points to the fact that PAS domain regulates EAG through binding to CPZ. CPZ is a widely used antipsychotic drug, can be repurposed to be used for cancer treatment and neurological disorders associated with increased EAG activity.[318] Deletion of PAS domain significantly reduced the inhibition by CPZ in mEAG1 channel in their study. The IC50 of CPZ was 29.7? 0.7?M for the WT channel and 53.6? 8.2?M for ?PAS mEAG channel. It is important to note that most functional mutations in EAG channels associ- ated with epilepsy and Zimmerman-Laband and Temple-Baraitser syndromes involve an increase in the EAG channel activity.[351, 277] Therefore, inhibition of EAG channel by small molecules has a high therapeutic potential for treatment of cancer and different neu- rological disorders. Here, we study ligand binding to PAS domain of EAG channel through molecular dynamics simulations and electrophysiology measurements (experiments done in Georgetown University in Brelidze lab). We performed docking and binding free energy calculation of CPZ and a few other ligands to the PAS domain of EAG channel. Impor- tantly a residue Tyr71 was found to be blocking the entrance to the binding pocket of PAS. Replica exchange solute tempering (REST2) simulations were performed to find a structure of PAS where the binding pocket is open for ligands. In the end, we studied the structural effects of ligand binding to the PAS domain on the full-length structure of EAG by molecu- lar simulation and network analysis. Our analysis showed that there is an allostery between the ligand binding site on the PAS and the channel pore. Network of residue-residue fluc- tuations causes the current inhibition in the channel upon ligand binding. 4.3.1 Methods Replica exchange solute tempering: Initial docking of CPZ to the PAS domain using AutoDock Vina[295] led to a binding pose outside the binding cavity. A Tyr71 residue was 203 Figure 4.23: Structure of full-length EAG channel embedded in membrane. VSD stands for voltage sensing domain which includes S1-S4 transmembrane helices. Pore domain (PD) includes trans- membranes S1 and S2 found to block the entrance of cavity in the PAS crystal structure. To sample conformations of the PAS domain where the cavity is accessible for ligands, we used replica exchange solute tempering (REST2).[186] The details of this method are given in the introduction section of the dissertation. In summary, initial structure of the protein (pdb id: 4hoi)[3] was prepared in CHARMM-GUI.[143] Na+ and Cl? ions were added to the system up to a buffer concentration of 0.15M. MD simulations were performed using CHARMM36 [23] forcefield for the protein and TIP3P model for the waters.[144] Simulations were per- formed using NAMD program with REST2 support.[230] A force-switching function was used for van der Waals and electrostatic interactions between 10 and 12 A?.[283] Long range interactions were computed with Particle Mesh Ewald (PME) method.[69] Langevin piston was used to maintain pressure at 1 bar.[97] A timestep of 2fs was used for the equilibration. An integration time-step of 2 fs was used for all simulations using SHAKE algorithm to constrain hydrogen atoms. [252] We first ran a 10 ns standard MD simulation to equilibrate the PAS domain. REST simulations ran with 20 replicas between effective temperatures of 204 310 and 610 K. Replica exchanges were attempted every 2ps between neighboring replicas along the temperature scale. Each replica ran for 50ns and the total accumulated simula- tions time was 1 ?s. The protein was chosen as the hot region in REST2 simulations. PAS-ligand simulations and binding free energy calculations: To simulate the lig- and bound PAS with different ligands, we selected a snapshot of replica exchange sim- ulation where the binding pocket was readily open and the Tyr71 was no longer block- ing the entrance (shown in figure 4.24B). Next we used Autodock Vina to dock 5 small molecules to the binding pocket. These include Chlorpromazine (CPZ), Imipramine (IMP), Promazine (PRZ), Cyamemazine (CMZ) and 2-Chlorophenothiazine (CFT). These led to binding of ligand inside the cavity of PAS. MD simulations were done in GROMACS 2018 software.[301] AMBER99SB-ILDN[23] forcefield was used for the protein, TIP3P model was used for waters and the ligands were parameterized with general amber forcefield (GAFF).[312] Na+ and Cl? were added to the system to a final concentration of 0.15M. Simulations were performed with a 2 fs timestep at 310K temperature and pressure of 1bar. A velocity rescaling thermostat was used to maintain the temperature at 310K. During equi- libration pressure was maintained at 2bar using Berendsen barostat. During production run the system was simulation under NPT ensemble with Parinello-Rahman barostat to main- tain pressure at 1bar using a compressibility of 4.5?10?5 bar?1 and a coupling constant of 0.5 ps. The simulations lasted for 100ns for each complex. Binding free energies were calculated using MMPBSA method. A description of MMPBSA is given in chapter 3.1 for the RBD-ACE2 free energy calculation in the SARS-COV-2. Here we used 80 and 2 as the solute and solvent dielectric constants. Full length EAG channel simulation The full length EAG channel (pdb id:5K7L )[327] was simulated in both apo and ligand bound state to study the effect of ligand bind- ing on the structure of channel. The apo state was prepared in CHARMM-GUI with 500 POPC lipids per leaflet. AMBER99SB-ILDN[181] parameters were used for the protein, 205 AMBER parameters were used for the lipids. The protein-membrane system was solvated with TIP3P water and Na+ and Cl? ions were added to a final concentration of 0.15M. To simulate the bound state, we aligned the PAS domain of each chain in the tetramer with the ligand-bound PAS from docking and used the coordinates of the ligand-bound PAS domain. This was done because in the crystal structure of EAG, the Tyr71 residue blocks the binding pocket and to dock the ligand, we had done replica exchange simulations. (figure 4.24B) The system was then prepared as for the apo state. Both apo and bound states embedded in membrane were equilibrated according to the CHARMM-GUI 6 step procedure. During the equilibration phase, the temperature was controlled with Berendsen thermost and the pressure was maintained at 1bar using a Berendsen barostat. For the production run, we used Nose-Hoover thermostat with coupling constant of 1ps?1 to maintain temperature at 310K and Parrinello-Rahman barostat with a compressibility of 4.5? 10?5 bar ?1 for the pressure. The simulations ran for 1?s for each apo and ligand-bound systems. Current flow analysis from MD simulation We followed the method laid out by Dele- motte and coworkers[147] to perform current flow analysis through the channel. In this approach, first a continuous contact map is calculated given the distance di j(t) between atoms i and j at time t: ????1 di j(t)<= c K(di j(t)) =??? (4.8)2exp ?d[ i j(t) ]/exp[? c2?2 2?2 ] otherwise the cutoff c = 4.5A? was used for the heavy atoms in the simulation as suggested. We also used K(d ?5cut) = 10 , where dcut = 0.8nm leading to ? ? 0.138. The final contact map was then averaged over frames. Correlation of residue movements was calculated through mutual information (MI). MI between residues si and s j was estimated based on distances from their equilibrium positions where the position of each residue was defined as the 206 centroid of heavy atoms in each residue. Thus MI is calculated as : MIi j = Hi +H j?Hi j (4.9) where Hi is the entropy of residue si defined as : ? Hi =? ?i(x) ln?i(x)dx (4.10) X where the density ?i(x) was estimated using Gaussian mixture model as proposed by Dele- motte et al.[147] A 10 times bootstrapping was performed and the final MI matrix was averaged over all bootstrap samples. The MI matrix and the semi-binary contact map were used to build the full adjacency matrix Ai j = Ci jIi j. Using this adjacency matrix, current flow or information flow which measures the flow of information from a set of source (S0) to a set of sink (S1) nodes are calculated. The results highlight the nodes that carry the most information from source to sink and give valuable information about allosteric pathways in the protein. For a tetrameric protein, the current flow were each subunit is replicated and summed over the structure and then averaged. This approach was used by Delemotte et al. to study allostery in KCNQ potassium channels and other membrane proteins. The source nodes in our analysis were the PAS domain residues (13-138) and as the sink nodes we used the residues in the lineup of the pore in channel (residue Gln503 on each monomer). 4.3.2 Results Initially we attempted to dock the CPZ ligand to the binding pocket of PAS domain. This led to a binding pose outside the cavity of PAS domain. We then performed three replicas of MD simulation for the docked pose. After 500ns of simulation the ligand drifted away from the binding pocket. After careful examination of the structure of PAS domain 207 we found that the entrance to the binding pocket was blocked by a Tyr71 residue. Next we performed Replica exchange solute tempering simulations (REST2) to sample differ- ent conformations of the PAS domain where the cavity is open for ligand binding. In this type of enhanced sampling, the conformational exchanges are done for the hot region (so- lute or the protein) while the solvent remains cold. Figure 4.24A shows the distribution of enthalpies P(?H,T ) of replicas at different effective temperatures. The significant overlap between replicas leads to frequent exchanges between neighboring replicas and the average exchange rate was calculated to be 20%. Figure 4.24C shows the random walk of replicas over temperatures for the first three replicas at the lowest temperatures. The frequent ex- change of these replicas with other replicas with higher temperatures shows that sampling is effective and REST is able to sample conformations at higher temperatures. After the replica exchange simulations we found a conformation of PAS domain where the Tyr71 residue drifted away from the binding pocket and was no longer blocking the cavity. The conformational change of PAS domain and the Tyr71 residue from the crystal structure to after replica exchange is shown in figure 4.24B. The ligand bound to PAS domain is shown in figure 4.24D. Next, we docked 5 different small molecule ligands that were experimentally shown to regulate EAG channel to the PAS domain after the REST2 simulations. The docked pose of these ligands are shown in figure 4.23. In most of the binding poses, the ligand faces the binding pocket from its 3 membered ring rather than its tail except for the PRZ ligand which docked the binding pocket from the tail. All 5 complexes were then subjected to a 100ns MD simulation using GROMACS. In these simulations, the ligands remained in the binding pocket. However, in PRZ-PAS complex, the ligand drifted away from the binding pocket. This is most probably due to the initial binding pose for this ligand where the lid was inside the pocket and the hydrophobic rings were outside. This was contrary to other ligands where rings are inside the binding pocket. This pose led to a lower binding affinity 208 Figure 4.24: Replica exchange solute tempering. A) distribution of enthalpies for different replicas showing the high overlap B) transition of Tyr71 residue from the crystal structure to a state after REST simulations where it no longer blocks the binding pocket C) Random walk of resplicas 1,2 and 3 in the replica space with their neighbors over the course of simulations D) Docking CPZ to the binding pocket of PAS after REST simulation. for PRZ. Binding free energies were calculated using MMPBSA.[310] A detailed description of this approach to binding free energy calculation is given in the methods section of the dissertation. Different components of the binding free energy including Van der Waals (vdw), electrostatic, polar solvation and solvent accessible surface area (SASA) are given in table 4.3. The breakdown of binding free energies to its components shows that the binding is driven by hydrophobic interactions between hydrophobic residues in the binding pocket and the hydrophobic rings of the ligands. Electrostatic interactions play a negligible role in 209 Figure 4.25: Docked poses of the 5 ligands Table 4.3: Binding free energies kcal/mol VDW Elec Polar Solv SASA Total CPZ -43.34?0.67 -1.0?0.1 12.33?0.45 -4.07?0.05 -36.32?0.79 IMP -42.73?0.23 -0.2?-0.1 12.59?0.17 -4.40?0.02 -34.76?0.25 PRZ -27.26?0.32 -0.33?0.05 9.08?0.17 -3.05?0.02 -21.51?0.25 CMZ -48.27?0.22 -2.45?0.07 14.73?0.16 -4.55?0.02 -40.55?0.28 CFT -33.34?0.23 -0.45?0.05 9.60?0.10 -3.07?0.01 -27.25?0.25 210 Figure 4.26: Binding free energy decomposition for residues with higher than 0.5 kcal/mol contri- bution to binding free energy of different ligands the binding for all molecules. PRZ has the lowest binding free energy which is due to its initial binding pose facing the binding pocket from its lid rather than from its hydrophobic ring. CFT had the next lowest binding affinity from all tested ligands. This was also shown in experiments not to bind PAS domain and increase rather than decrease the current through channel. Unlike other ligands, CFT does not have a lid in this structure and only has a 3-membered ring. We found that during MD simulation this ligand can penetrate deeper into the cavity of the PAS domain. On the hand, for other ligands the lid makes the binding more stable. We have also decomposed the binding free energies for ligands into a per-residue basis to find residues that contribute most to the binding affinity. Most of the residues are found to be hydrophobic. Residues with more than 0.5 kcal/mol contribution to the binding are shown in figure 4.26. Figure 4.27 shows the electrostatic character of residues in the binding pocket. Most residues interacting with the ligands in the binding pocket are hydrophobic which drives the binding as also shown in binding free energy calculations (high vdw component). The 211 Figure 4.27: electrostatic nature of binding pocket. Acidic residues are shown in red, basic residues in blue, hydrophobic residues in white and polar residues in green Tyr71 acts as a gatekeeper of the PAS domain and suggest its important role in the ligand entry into the binding cavity. The role of this residue in ligand binding has also been con- firmed in other computational and experimental studies.[74] Phe87 in CPZ-PAS complex has a high contribution to the binding affinity which is majorly driven by non-polar inter- actions. Other residues that contribute to binding affinity are Val31, Trp40, Cys67, Val80, Ile113, Phe126 and Cys128. Full length EAG simulation It is unknown how conformational changes in PAS correlate with channel?s voltage dependent activation process. To investigate how the conformational changes upon ligand binding at the PAS domain affects the gate opening in EAG channel, we performed MD simulations of mEAG channel at physiological temperature in both apo and bound state with CPZ to all 4 PAS domains. Each apo and bound state ran for 1 ?s. RMSD of each region of the EAG channel during the simulation for both apo and bound state is shown in 212 Figure 4.28: RMSD of different regions of the EAG channel for apo and bound states figure 4.28. PAS domain in the bound state shows a higher RMSD than apo while other domains have a similar RMSD between apo and bound states. since most of the difference in RMSD is in the PAS domain, next we compared the root mean squared fluctuations (RMSF) of the PAS domain in the apo and bound states. This is shown in figure 4.29 where each region of the PAS domain is colored to show different locations. Most regions had a higher RMSF in the bound state than in apo. The PAS-cap residues have a similar to even lower fluctuation in the bound state than in apo state. Residues near the binding pocket in ?C and ?D helices have a high fluctuation in the bound state. ?A and and ?B residues which are at the interface with CNBHD also show a higher fluctuation in the bound state. These high fluctuations could affect the interface of PAS with CNBHD. We also computed the H-bonds and salt-bridges between PAS and CNBHD during the simulation for apo and bound states. The results are shown in figure 4.30. The hydrogen bonds and salt bridges are mostly similar between apo and bound states. While some of 213 Figure 4.29: RMSF of different regions of the PAS Figure 4.30: H-bonds and salt-bridges between PAS and CNBHD for apo and bound state the H-bonds such as N34-E633 and Q62-V635 are weakened after ligand binding, other h- bonds or electrostatic interactions are enhanced such as Y198-E627, Q62-T698 and K122- E633. The binding free energy is expected to not change considerably after the binding. It was shown by Brelizde et al. using SPR that ligand binding slightly enhances the the affinity of PAS-CNBHD complex. We can reason that this is due to the formation of new electrostatic interactions that were not present in the apo state. The conformational changes are communicated to the pore domain via a network of residue-residue interactions[67, 346] network analysis was used to identify the allosteric 214 pathways in the bound and apo conformations between PAS and the pore. Network analysis has been previously used by Delemotte and coworkers[147] to shed light on the VSD-pore coupling pathway. In this approach, first the MD trajectories are converted to a residue in- teraction network where each node cooresponds to an individual amino acid residue within the full length EAG channel. The weights on the edges are defined by the spatial proximity and correlated motions of residues in the MD trajectories. The final network encodes all residues (nodes) and interactions (edges) in the channel. The allosteric pathway is mea- sured by calculating the flow of information through the network by defining the PAS domain (binding pocket) as the source of information and the channel pore as the sink. The underlying idea is that the perturbation of residue interactions such as those induced by the movements in the PAS domain spreads to other residues (pore) via diffusion in the network of residue residue interactions. A current flow betweenness analysis is performed to account for all pathways between source and sink. Key residues and pathways for the allostery are identified by this method. PAS domain residues were chosen as the source and gate residues were chosen as the sink for current flow analysis. Figure 4.31 shows the projected current flow on the structure of full length EAG channel such that darker color shows higher current flow. The CNBHD region due to a high interface with the PAS domain carries most of the information flow to the pore domain. To study how the lig- and binding influences the allosteric network, we calculated the different in current flow profiles of apo and bound states of the channel. The resulting difference in information flow (Delta-Information) is projected onto the full-length structure and is shown in figure 4.32 This shows the difference at different regions of the EAG due to the ligand binding conformational changes. The Apo state showed higher peaks in information flow at PAS- CNBHD interface compared to the bound state. Also the post-CNBHD region showed a higher value of information flow in the bound state. The post-CNBHD-PAS interaction is attributed to the ligand bound state. Some of the residues at the interface of PAS-CNBHD 215 Figure 4.31: Current flow analysis for the bound state and the current flow plots with higher CF in the bound state are Val43, His56, Phe17, Gln14 on the PAS domain and Tyr666, Val634, Ser625 and Gly639 on CNBHD. Residues interacting with intrinsic ligand have lower CF in the bound state. Residues on Post-CNBHD have high CF in the bound state. Some notable residues with higher CF in the bound state than apo are Ile37, Arg24, Phe17, Tyr44 and on CNBHD Gln598, Val600, Ala603, Gly624, Gly639 and Cys667. The peak at the S1-S4 and PAS interface are weaker compared to the PAS-CNBHD interface showing that the allosteric coupling between PAS and S1S4 is weaker than PAS- CNBHD. Concurrent flow reduction and increase in the pore residues and PAS domain hints that these two regions are allosterically ant coupled together. Reduction in current flow at the pore coincides with the inactivation of the channel in the bound state. The critical coupling motifs on the CNBHD, PAS and C-linker are closely correlated with the locations that features enhanced information flow in the apo state. This means that the bound conformation less efficiently transmits allosteric signal through structural regions that are important. The flow strength decrease in the channel pore is accompanied by the 216 Figure 4.32: current flow difference flow reduction at the CNBHD and flow increase at the PAS domain. the PAS-cap region was shown to undergo huge conformational changes when CNBHD is not bound. To identify key structural motifs that might be allosterically downstream of PAS-CNBHD interaction. This indicates that the perturbation in PAS is propagated to the pore through interactions with CNBHD. 4.3.3 Discussion and Conclusion In this study, we investigated ligand binding to the PAS domain of EAG potassium channel and its effect on the channel activity using MD simulations, free energy calcula- tions and network analysis. Experiments were performed by Brelidze lab at Georgetown university on the electrophisiology of EAG channel. A PAS domain in the N-term intra- cellular part of EAG channel is known for its ligand binding in other proteins. However, its ligand binding properties have not been studied in detail for EAG channel which could have major implications for pharmaceutics targeting EAG channel. Brelidze et al. previ- 217 ously showed a small molecule ligand, CPZ binds to the PAS domain of EAG and inhibits the current through the channel.[318] Inhibition of EAG channel by CPZ or other small molecule ligands has a high potential for therapeutic use in treatment of cancer and other neurological disorders. We performed docking of ligands to the PAS domain. A residue Tyr71 was found to block the entrance to the binding pocket. We performed a replica ex- change solute tempering simulation to sample conformations of PAS where the binding pocket is open which led to drifting of Tyr71 away from the binding pocket entrance. Next we docked 5 different small molecule ligands provided to us by our experimental collab- orator to the binding pocket and performed 100ns MD simulation. Binding free energy calculation using MMPBSA showed a favorable binding of these ligands to the PAS do- main which was majorly driven by hydrophobic interaction. The binding pose of all ligands were facing the binding pocket from the three membered ring where the tail was outside. PAS domain of EAG channel has been investigated in several studies. Some evidence suggest that the PAS domain interacts with S1-S4 linker and directly regulates the VS do- main movements.[177]. Other evidence points to the interaction of PAS with CNBHD.[114, 211] Zagotta and coworkers used fluorescence anisotropy and found that PAS domain in EAG channel directly interacts with CNBHD with an affinity of 13.2?2.3?M. [118] The interface of PAS and CNBHD is shown in figure 4.30. Mapping disease related muta- tions in KCNH channel onto the structure of PAS-CNBHD complex, it was shown that most LQT2 and cancer-related mutations are located at the interface of PAS with CNBHD. For example, N34 in PAS domain forms a hydrogen bond with V634 on CNBHD and its mutation (N33T in hERG1) was shown to cause LQT2. Y44 of PAS interacts with I637 and G639 and mutation Y44H in hEAG1 correlates with large intestine carcinoma[7] and mutation in hERG1 (Y43C) causes LQT2.[214] The interface between PAS and CNBHD can be divided into three subregions: 1) the intrinsic ligand of CNBHD interacts with ?B helix of PAS 2) ?A and ?B strands of PAS interact with post-CNBHD helix 3) The 218 N-terminal of PAS contains an amphipathic helix ?-CAP which interacts with the ? - roll of CNBHD.[118] Mutations in the intrinsic ligand were shown to regulate channel activity.[114, 7] The post-CNBHD region which is immediately after the intrinsic ligand of CNBHD also interacts with the PAS domain. It was shown that this region regulates KCNH channels by a variety of cellular signaling events such as phosphorylation and interaction with Ca2+ calmodulin.[54] The interface also includes a salt bridge between R57 on the B helix of PAS and D642 on 6 of CNBHD. This salt bridge is conserved throughout the KCNH family and mutations in this site cause LQT2 in hERG1. [118] The PAS cap includes the first 25 residues of the PAS domain which was shown to be critical for activation and inactivation of KCNH channels.[114, 118] The PAS-cap helix interacts with CNBHD and the ?-CAP is positioned near the ?4??5 strands and ?8??9 loop of CNBHD. Alignment of PAS-CAP region with other NMR structures point to the fact that it takes very different orientations from isolated EAG domains from hERG1 and ELK channels.[3] It was therefore proposed that the PAS-cap exerts its function through in- teraction with CNBHD domain. The surrounding of PAS-cap are rich with cancer-related mutations (hEAG1 E19D) and other hERG1 LQT2 mutations E788D (E627 in mEAG1) in ?4 strand of CNBHD.[118, 114] These mutations can change the gating properties of the channel by destabilizing interaction of PAS CAP with CNBHD. They studied the gating properties of (R7A-R8A and R7E-R8E) mutations of PAS and (E727A,E627R) mutations in CNBHD which are highly conserved residues in KCNH family. The activation of muta- tions in the PAS cap shifted to more depolarized potentials compared to wild-type channels. Similarly the CNBHD mutations demonstrated a robust depolarizing shift in potential. Potential interactions between different residues were computed by looking at the dis- tance between non-hydrogen residues. Our results showed that there is an allostery between conformational fluctuations in PAS and CNBHD and the pore residues. The movement of residues in the PAS domain are transmitted to the pore through interaction with the CNBHD 219 which is linked to the C-linked region directly in contact with the pore. This chain of inter- actions constitutes the coupling pathways between ligand binding site and the channel pore which leads to the inactivation. We simulated the full length EAG channel in both apo and CPZ bound to all PAS domain to study the effect of ligand binding on the interface of PAS and CNHBD and the channel pore. H-bonds were computed between PAS and CNBHD. Some of the hydrogen bonds are destablized by ligand binding such as N34-E633 and Q62- V635 while other hydrogen bonds appear such as Q62-T698, Y198-E627 and K122-E633 Therefore, the binding affinity between PAS and CNBHD is expected to be slightly higher in the bound state. Experimental measurements using SPR at Brelidze lab showed that the binding affinity increases upon ligand binding between PAS and CNBHD. We used infor- mation flow analysis to find if there is an allosteric pathway between the ligand binding site and the channel pore and regions and residues along the pathway that carry most of the information flow. Our analysis showed that CNBHD carries most of information flow from PAS to channel pore. Moreover, the channel residues had a lower current flow in the bound state. Since the simulated protein is in the inactive state, this implies a further stabilization of the closed state of the EAG channel. 220 Chapter 5: Conclusions and open problems In this dissertation, I have explored various computational techniques such as molecular dynamics, Markov state modeling and machine learning to study biomolecular processes, for example protein folding, protein-ligand binding and protein-membrane interactions. In the second chapter, I studied membrane active peptides (MAP). Cell penetrating peptides are a class of MAPs with the ability to cross the cell membrane and deliver biomolecular cargo inside the cell. Secondary structure of CPPs during their interaction with cell mem- brane affect their translocation efficacy. We studied two cell penetrating peptides MPG and Hst5. Our results showed that MPG enters the membrane via its hydrophobic N-terminus whereas Hst5 remains attached to the phosphate plane. Further simulations of MPG showed that this peptide forms a ? -sheet conformation early during interaction with membrane but upon deeper insertion into membrane core, it adopts an ?-helical conformation. This structural polymorphism is important for the internalization route of CPPs. Antimicrobial peptides are another class of MAPs that have been proposed as potential solution against multi-drug resistant pathogens. Designing AMPs requires exploration of a vast chemical space, making it a challenging problem. In the second part of chapter 2, we developed a machine learning model, variational attention based variational autoencoder to generate novel diverse and high quality antimicrobial peptides. This machine learning model learns a latent space of real AMPs which are represented as sequences of unique numbers. The attention mechanism helps with diversity and quality of the generated AMPs. The future direction of this project will be further evaluating the generated peptides as post-generation 221 evaluation to check the minimum inhibitory concentration (MIC), toxicity and other bio- logical properties of the in-silico generated AMPs. This will give candidate sequences that could be synthesized and experimentally evaluated. In the third chapter, I studied kinetics and thermodynamics of protein folding using Markov state models and molecular dynamics. In the first part of this chapter, I studied an amyloid forming protein ?2m. Conformational fluctuations in the monomeric form of this protein could lead to aggregation prone intermediate states. I performed MD simulations on this protein for 250?s and then applied a MSM on the trajectories. Transitions between folded and misfolded/partially folded states happens at 10?s of ?s. The intermediate states have unfolded outer strands of the protein which leads to exposure of the hydrophobic core of the protein to the solvent which is an important factor in formation of higher order oligomers and eventually aggregates. It will be interesting to compare the kinetics of mis- folding of native state ?2m with that of natural variants such as D76N and ?N6 mutants. In the second part of chapter 3, I develop a machine learning model GMVAE for simultaneous dimensionality reduction and clustering of protein folding trajectories. GMVAE can learn a reduced representation of free energy landscape of protein folding where metastable states are highly separated clusters. We showed that GMVAE embedding resembles the folding funnel for protein folding trajectories. The future direction of this section would be to in- clude a lagtime in the GMVAE to allow it to learn kinetically metastable states. Another direction would be to use graph neural networks as the features instead of pairwise dis- tances which is what we used here. In the last section of this chapter, I develop a novel method GraphVAMPNet to learn a low-dimensional representation and linear dynamical model of simulation trajectories in an end-to-end manner. We combined VAMPNet which is based on the variational approach for Markov processes and uses a neural network to learn coarse-grained dynamical model with graph neural networks as the feature repre- sentation of the molecule. This gives the model advantages of graph representation and 222 uses graph message passing and generated embedding of each datapoint in VAMPNet. We showed that this type of representation results in a higher resolution and more interpretable Markov models than standard VAMPNet. Moreover, attention in the neighbors of each residue in the graph gave us insight about the importance of residues for each meatastable state. It would be interesting to further develop GraphVAMPNet with different pooling mechanisms. For example, one can use graph pooling such that a VAMP score is maxi- mized for each domain of the protein. This means learning the local dynamics of different domain or different parts of a protein. On the other hand, since we have not used any hand crafted features, the GraphVAMPNet model is transferable. It would be interesting to study the transferability of the learned embeddings in GraphVAMPNet. In chapter 4, I studied two membrane proteins: the spike protein in SARS-COV-2 and the EAG potassium channel. The spike protein of SARS-COV-2 makes contact with cell receptor protein hACE2 through its receptor binding domain (RBD). The RBD of SARS- COV-2 is ripe with mutations from an earlier SARS-COV in 2002. We showed that these mutations have given the RBD of SARS-COV-2 a higher affinity to bind hACE2 than SARS-COV which is the reason for its higher infection rate. The important residues at the interface of RBD and hACE2 in SARS-COV-2 were identified by decomposing binding free energies per residues. We found residues whose mutation highly enhanced the binding affinity ( such as V404 to K417) or mutations lowering the binding affinity (such as R426 to N439). Furthermore, we simulated mutants of RBD (either natural mutants or alanine scanning) and found residues that are crucial for binding between RBD and hACE2. In the second part of our SARS-COV-2 work, we investigated the dynamics of glycans in the spike protein using MD simulation and network analysis. The glycan shield on the spike protein provides a barrier against antibodies. Our network analysis unraveled the role of different glycans in providing an effective shield using betweenness centrality measurements. We uncovered microdomains of glycans that feature high degree of intra-communication and 223 used antibody overlap analysis to find microdomains that inhibit access to antibody epi- topes. In the last section of chapter 4, I study the ligand binding to the PAS domain of EAG channel. We showed that a Tyr71 residue blocks the entrance of the binding pocket. Using replica exchange solute tempering, we found structures of PAS where Tyr71 drifted away from the binding pocket which allowed us to dock ligands to the binding pocket. Binding free energy computations using MMPBSA showed the binding affinity of different ligands and the residues contributing mostly to the binding affinity. Using mutual information and information flow analysis from the MD simulation of full-length EAG channel in the bound state we studied the allosteric pathways that lead to current inhibition in the channel as well as residues that are important in the pathway. Interestingly, we found that ligand binding in the PAS domain reduces the information flow at the sink residues (channel pore) which coincides with the current inhibition through the channel. For the future work it will be interesting to study the effect of ligand binding on the channel opening using enhanced sampling methods such as metadynamics. 224 Bibliography [1] Herve? Abdi and Lynne J Williams. Principal component analysis. Wiley interdisci- plinary reviews: computational statistics, 2(4):433?459, 2010. [2] Mark James Abraham, Teemu Murtola, Roland Schulz, Szila?rd Pa?ll, Jeremy C Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simu- lations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1:19?25, 2015. [3] Ricardo Adaixo, Carol A Harley, Artur F Castro-Rodrigues, and Joa?o H Morais- Cabral. Structural properties of pas domains from the kcnh potassium channels. PloS one, 8(3):e59265, 2013. [4] Diletta Ami, Stefano Ricagno, Martino Bolognesi, Vittorio Bellotti, Silvia Maria Doglia, and Antonino Natalello. Structure, stability, and aggregation of ? -2 mi- croglobulin mutants: insights from a fourier transform infrared study in solution and in the crystalline state. Biophysical journal, 102(7):1676?1684, 2012. [5] Hareesh Bahuleyan. Natural language generation with neural variational models. arXiv preprint arXiv:1808.09012, 2018. [6] Mukund Balasubramanian, Eric L Schwartz, Joshua B Tenenbaum, Vin de Silva, and John C Langford. The isomap algorithm and topological stability. Science, 295(5552):7?7, 2002. [7] Sally Bamford, Emily Dawson, Simon Forbes, Jody Clements, Roger Pettett, Ahmet Dogan, A Flanagan, Jon Teague, P Andrew Futreal, Michael R Stratton, et al. The cosmic (catalogue of somatic mutations in cancer) database and website. British journal of cancer, 91(2):355?358, 2004. [8] Rahul Banerjee, Honggao Yan, and Robert I Cukier. Conformational transition in signal transduction: metastable states and transition pathways in the activation of a signaling protein. The Journal of Physical Chemistry B, 119(22):6591?6602, 2015. 225 [9] Alessandro Barducci, Massimiliano Bonomi, and Michele Parrinello. Metadynam- ics. Wiley Interdisciplinary Reviews: Computational Molecular Science, 1(5):826? 843, 2011. [10] Alessandro Barducci, Giovanni Bussi, and Michele Parrinello. Well-tempered meta- dynamics: a smoothly converging and tunable free-energy method. Physical review letters, 100(2):020603, 2008. [11] Bipasha Barua, Jasper C Lin, Victoria D Williams, Phillip Kummler, Jonathan W Neidigh, and Niels H Andersen. The trp-cage: optimizing the stability of a globular miniprotein. Protein Engineering, Design & Selection, 21(3):171?185, 2008. [12] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: an open source software for exploring and manipulating networks. In Proceedings of the interna- tional AAAI conference on web and social media, volume 3, pages 361?362, 2009. [13] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018. [14] CK Bauer and JR Schwarz. Physiology of eag k+ channels. The Journal of mem- brane biology, 182(1):1?15, 2001. [15] Javier L Baylon, Josh V Vermaas, Melanie P Muller, Mark J Arcario, Taras V Pogorelov, and Emad Tajkhorshid. Atomic-level description of protein?lipid in- teractions using an accelerated membrane model. Biochimica et Biophysica Acta (BBA)-Biomembranes, 1858(7):1573?1583, 2016. [16] Kyle A Beauchamp, Daniel L Ensign, Rhiju Das, and Vijay S Pande. Quan- titative comparison of villin headpiece subdomain simulations and triplet?triplet energy transfer experiments. Proceedings of the National Academy of Sciences, 108(31):12734?12739, 2011. [17] Kyle A Beauchamp, Robert McGibbon, Yu-Shan Lin, and Vijay S Pande. Simple few-state models reveal hidden complexity in protein folding. Proceedings of the National Academy of Sciences, 109(44):17807?17813, 2012. [18] Vittorio Bellotti, Maurizio Gallieni, Sofia Giorgetti, and Diego Brancaccio. Dynamic of ?2-microglobulin fibril formation and reabsorption: The role of proteolysis. In Seminars in dialysis, volume 14, pages 117?122. Wiley Online Library, 2001. [19] Rafael C Bernardi, Marcelo CR Melo, and Klaus Schulten. Enhanced sampling techniques in molecular dynamics simulations of biological systems. Biochimica et Biophysica Acta (BBA)-General Subjects, 1850(5):872?877, 2015. 226 [20] Zachary T Berndsen, Srirupa Chakraborty, Xiaoning Wang, Christopher A Cottrell, Jonathan L Torres, Jolene K Diedrich, Cesar A Lo?pez, John R Yates, Marit J van Gils, James C Paulson, et al. Visualization of the hiv-1 env glycan shield across scales. Proceedings of the National Academy of Sciences, 117(45):28014?28025, 2020. [21] Martina Bertazzo, Dorothea Gobbo, Sergio Decherchi, and Andrea Cavalli. Machine learning and enhanced sampling simulations for computing the potential of mean force and standard binding free energy. Journal of chemical theory and computation, 17(8):5287?5300, 2021. [22] Robert B Best, Gerhard Hummer, and William A Eaton. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences, 110(44):17874?17879, 2013. [23] Robert B Best, Xiao Zhu, Jihyun Shim, Pedro EM Lopes, Jeetain Mittal, Michael Feig, and Alexander D MacKerell Jr. Optimization of the additive charmm all-atom protein force field targeting improved sampling of the backbone ? , ? and side-chain ?1 and ?2 dihedral angles. Journal of chemical theory and computation, 8(9):3257? 3273, 2012. [24] Debsindhu Bhowmik, Shang Gao, Michael T Young, and Arvind Ramanathan. Deep clustering of protein folding simulations. BMC bioinformatics, 19(18):47?58, 2018. [25] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. [26] Luigi Bonati, Yue-Yu Zhang, and Michele Parrinello. Neural networks-based vari- ationally enhanced sampling. Proceedings of the National Academy of Sciences, 116(36):17641?17647, 2019. [27] Massimiliano Bonomi and Michele Parrinello. Enhanced sampling in the well- tempered ensemble. Physical review letters, 104(19):190601, 2010. [28] Berend Jan Bosch, Ruurd Van der Zee, Cornelis AM De Haan, and Peter JM Rottier. The coronavirus spike protein is a class i virus fusion protein: structural and func- tional characterization of the fusion core complex. Journal of virology, 77(16):8801? 8811, 2003. [29] Gregory R Bowman, Kyle A Beauchamp, George Boxer, and Vijay S Pande. Progress and challenges in the automated construction of markov state models for full protein systems. The Journal of chemical physics, 131(12):124101, 2009. 227 [30] Gregory R Bowman, Vijay S Pande, and Frank Noe?. An introduction to Markov state models and their application to long timescale molecular simulation, volume 797. Springer Science & Business Media, 2013. [31] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015. [32] GD Brand, MHS Ramada, TC Genaro-Mattos, and C Bloch. Towards an experimen- tal classification system for membrane active peptides. Scientific reports, 8(1):1?11, 2018. [33] Tinatin I Brelidze, Anne E Carlson, Banumathi Sankaran, and William N Zagotta. Structure of the carboxy-terminal region of a kcnh channel. Nature, 481(7382):530? 533, 2012. [34] Tinatin I Brelidze, Anne E Carlson, and William N Zagotta. Absence of direct cyclic nucleotide modulation of meag1 and herg1 channels revealed with fluorescence and electrophysiological methods. Journal of Biological Chemistry, 284(41):27989? 27997, 2009. [35] Esther S Brielle, Dina Schneidman-Duhovny, and Michal Linial. The sars-cov-2 exerts a distinctive strategy for interacting with the ace2 human receptor. Viruses, 12(5):497, 2020. [36] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van- dergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18?42, 2017. [37] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013. [38] Nicolae-Viorel Buchete and Gerhard Hummer. Peptide folding kinetics from replica exchange molecular dynamics. Physical Review E, 77(3):030902, 2008. [39] Giovanni Bussi, Davide Donadio, and Michele Parrinello. Canonical sampling through velocity rescaling. The Journal of chemical physics, 126(1):014101, 2007. [40] Javier Camacho. Ether a go-go potassium channels and cancer. Cancer letters, 233(1):1?9, 2006. [41] Lorenzo Casalino, Zied Gaieb, Jory A Goldsmith, Christy K Hjorth, Abigail C Dom- mer, Aoife M Harbison, Carl A Fogarty, Emilia P Barros, Bryn C Taylor, Jason S McLellan, et al. Beyond shielding: the roles of glycans in the sars-cov-2 spike pro- tein. ACS central science, 6(10):1722?1734, 2020. 228 [42] Michele Ceriotti, Gareth A Tribello, and Michele Parrinello. Simplifying the repre- sentation of complex free-energy landscapes using sketch-map. Proceedings of the National Academy of Sciences, 108(32):13023?13028, 2011. [43] Samitabh Chakraborti, Ponraj Prabakaran, Xiaodong Xiao, and Dimiter S Dimitrov. The sars coronavirus s glycoprotein receptor binding domain: fine mapping and functional characterization. Virology journal, 2(1):1?10, 2005. [44] Srirupa Chakraborty, Zachary T Berndsen, Nicolas W Hengartner, Bette T Korber, Andrew B Ward, and S Gnanakaran. Quantification of the resilience and vulner- ability of hiv-1 native glycan shield at atomistic detail. Iscience, 23(12):101836, 2020. [45] Jasper Fuk-Woo Chan, Kin-Hang Kok, Zheng Zhu, Hin Chu, Kelvin Kai-Wang To, Shuofeng Yuan, and Kwok-Yung Yuen. Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting wuhan. Emerging microbes & infections, 9(1):221?236, 2020. [46] Charles H Chen, Charles G Starr, Evan Troendle, Gregory Wiedman, William C Wimley, Jakob P Ulmschneider, and Martin B Ulmschneider. Simulation-guided rational de novo design of a small pore-forming antimicrobial peptide. Journal of the American Chemical Society, 141(12):4839?4848, 2019. [47] Wei Chen and Andrew L Ferguson. Molecular enhanced sampling with autoen- coders: On-the-fly collective variable discovery and accelerated free energy land- scape exploration. Journal of computational chemistry, 39(25):2079?2102, 2018. [48] Wei Chen, Hythem Sidky, and Andrew L Ferguson. Capabilities and limitations of time-lagged autoencoders for slow mode discovery in dynamical systems. The Journal of Chemical Physics, 151(6):064123, 2019. [49] Wei Chen, Hythem Sidky, and Andrew L Ferguson. Nonlinear discovery of slow molecular modes using state-free reversible vampnets. The Journal of chemical physics, 150(21):214114, 2019. [50] Wei Chen, Aik Rui Tan, and Andrew L Ferguson. Collective variable discovery and enhanced sampling using autoencoders: Innovations in network architecture and error function design. The Journal of chemical physics, 149(7):072312, 2018. [51] Xiangyu Chen, Ren Li, Zhiwei Pan, Chunfang Qian, Yang Yang, Renrong You, Jing Zhao, Pinghuang Liu, Leiqiong Gao, Zhirong Li, et al. Human monoclonal antibodies block the binding of sars-cov-2 spike protein to angiotensin converting enzyme 2 receptor. Cellular & molecular immunology, 17(6):647?649, 2020. 229 [52] Xumin Chen, Chen Li, Matthew T Bernards, Yao Shi, Qing Shao, and Yi He. Sequence-based peptide identification, generation, and property prediction with deep learning: a review. Molecular Systems Design & Engineering, 6(6):406?428, 2021. [53] John TJ Cheng, John D Hale, Melissa Elliot, Robert EW Hancock, and Suzana K Straus. Effect of membrane composition on antimicrobial peptides aurein 2.2 and 2.3 from australian southern bell frogs. Biophysical Journal, 96(2):552?565, 2009. [54] Alessia Cherubini, Giovanna Hofmann, Serena Pillozzi, Leonardo Guasti, Olivia Crociani, Emanuele Cilia, Paola Di Stefano, Simona Degani, Manuela Balzi, Mas- simo Olivotto, et al. Human ether-a-go-go-related gene 1 channels are physically linked to ?1 integrins and modulate adhesion-dependent signaling. Molecular biol- ogy of the cell, 16(6):2972?2983, 2005. [55] Xiangyang Chi, Renhong Yan, Jun Zhang, Guanying Zhang, Yuanyuan Zhang, Meng Hao, Zhe Zhang, Pengfei Fan, Yunzhu Dong, Yilong Yang, et al. A neu- tralizing human antibody binds to the n-terminal domain of the spike protein of sars-cov-2. Science, 369(6504):650?655, 2020. [56] Fabrizio Chiti, Palma Mangione, Alessia Andreola, Sofia Giorgetti, Massimo Ste- fani, Christopher M Dobson, Vittorio Bellotti, and Niccolo? Taddei. Detection of two partially structured species in the folding process of the amyloidogenic protein ?2-microglobulin. Journal of molecular biology, 307(1):379?391, 2001. [57] Jae-Hyun Cho, Wenli Meng, Satoshi Sato, Eun Young Kim, Hermann Schindelin, and Daniel P Raleigh. Energetically significant networks of coupled interactions within an unfolded protein. Proceedings of the National Academy of Sciences, 111(33):12079?12084, 2014. [58] Kyunghyun Cho, Bart Van Merrie?nboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014. [59] John D Chodera and Frank Noe?. Markov state models of biomolecular conforma- tional dynamics. Current opinion in structural biology, 25:135?144, 2014. [60] John D Chodera, Nina Singhal, Vijay S Pande, Ken A Dill, and William C Swope. Automatic discovery of metastable states for the construction of markov models of macromolecular conformational dynamics. The Journal of chemical physics, 126(15):04B616, 2007. [61] John D Chodera, William C Swope, Jed W Pitera, and Ken A Dill. Long-time pro- tein folding dynamics from short-time molecular dynamics simulations. Multiscale Modeling & Simulation, 5(4):1214?1226, 2006. 230 [62] Song-Ho Chong and Sihyun Ham. Examining a thermodynamic order parameter of protein folding. Scientific reports, 8(1):1?9, 2018. [63] Song-Ho Chong, Jooyeon Hong, Sulgi Lim, Sunhee Cho, Jinkeong Lee, and Sihyun Ham. Structural and thermodynamic characteristics of amyloidogenic intermediates of ? -2-microglobulin. Scientific reports, 5(1):1?9, 2015. [64] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models for speech recognition. Advances in neural information processing systems, 28, 2015. [65] Melissa M Coughlin and Bellur S Prabhakar. Neutralizing human monoclonal an- tibodies to severe acute respiratory syndrome coronavirus: target, mechanism of action, and therapeutic potential. Reviews in medical virology, 22(1):2?17, 2012. [66] Max Crispin, Andrew B Ward, and Ian A Wilson. Structure and immune recognition of the hiv glycan shield. Annual review of biophysics, 47:499?523, 2018. [67] Jianmin Cui. Voltage-dependent gating: novel insights from kcnq1 channels. Bio- physical journal, 110(1):14?25, 2016. [68] Jie Cui, Fang Li, and Zheng-Li Shi. Origin and evolution of pathogenic coron- aviruses. Nature Reviews Microbiology, 17(3):181?192, 2019. [69] Tom Darden, Darrin York, and Lee Pedersen. Particle mesh ewald: An n log (n) method for ewald sums in large systems. The Journal of chemical physics, 98(12):10089?10092, 1993. [70] Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Sebastian Gehrmann, Flaviu Cipcigan, Vijil Chenthamarakshan, Hendrik Strobelt, Cicero Dos Santos, Pin-Yu Chen, et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613?623, 2021. [71] Payel Das, Kahini Wadhawan, Oscar Chang, Tom Sercu, Cicero Dos Santos, Matthew Riemer, Vijil Chenthamarakshan, Inkit Padhi, and Aleksandra Mojsilovic. Pepcvae: Semi-supervised targeted design of antimicrobial peptide sequences. arXiv preprint arXiv:1810.07743, 2018. [72] Ruslan L Davidchack, Richard Handel, and MV Tretyakov. Langevin thermostat for rigid body dynamics. The Journal of chemical physics, 130(23):234101, 2009. [73] Andrew D Davidson, Maia Kavanagh Williamson, Sebastian Lewis, Deborah Shoe- mark, Miles W Carroll, Kate J Heesom, Maria Zambon, Joanna Ellis, Philip A Lewis, Julian A Hiscox, et al. Characterisation of the transcriptome and proteome of sars-cov-2 reveals a cell passage induced in-frame deletion of the furin-like cleavage site from the spike glycoprotein. Genome medicine, 12(1):1?15, 2020. 231 [74] Joa?o V de Souza, Sylvia Reznikov, Ruidi Zhu, and Agnieszka K Bronowska. Drug- gability assessment of mammalian per?arnt?sim [pas] domains using computational approaches. Medchemcomm, 10(7):1126?1137, 2019. [75] Scott N Dean, Barney M Bishop, and Monique L Van Hoek. Natural and synthetic cathelicidin peptides with anti-microbial and anti-biofilm activity against staphylo- coccus aureus. BMC microbiology, 11(1):1?13, 2011. [76] Damon Deming, Timothy Sheahan, Mark Heise, Boyd Yount, Nancy Davis, Amy Sims, Mehul Suthar, Jack Harkema, Alan Whitmore, Raymond Pickles, et al. Vac- cine efficacy in senescent mice challenged with recombinant sars-cov bearing epi- demic and zoonotic spike variants. PLoS medicine, 3(12):e525, 2006. [77] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1?22, 1977. [78] Nan-jie Deng, Wei Dai, and Ronald M Levy. How kinetics within the unfolded state affects protein folding: An analysis based on markov state models and an ultra-long md trajectory. The Journal of Physical Chemistry B, 117(42):12787?12799, 2013. [79] Daniele Derossi, Alain H Joliot, Gerard Chassaing, and Alain Prochiantz. The third helix of the antennapedia homeodomain translocates through biological membranes. Journal of Biological Chemistry, 269(14):10444?10450, 1994. [80] Se?bastien Deshayes, Marc Decaffmeyer, Robert Brasseur, and Annick Thomas. Structural polymorphism of two cpp: an important parameter of activity. Biochimica et Biophysica Acta (BBA)-Biomembranes, 1778(5):1197?1205, 2008. [81] Jyoti Dev, Donghyun Park, Qingshan Fu, Jia Chen, Heather Jiwon Ha, Fadi Ghan- tous, Tobias Herrmann, Weiting Chang, Zhijun Liu, Gary Frey, et al. Structural ba- sis for membrane anchoring of hiv-1 envelope spike. Science, 353(6295):172?175, 2016. [82] Kuldeep Dhama, Khan Sharun, Ruchi Tiwari, Maryam Dadar, Yashpal Singh Malik, Karam Pal Singh, and Wanpen Chaicumpa. Covid-19, an emerging coronavirus in- fection: advances and prospects in designing and developing vaccines, immunother- apeutics, and therapeutics. Human vaccines & immunotherapeutics, 16(6):1232? 1238, 2020. [83] Ken A Dill, S Banu Ozkan, M Scott Shell, and Thomas R Weikl. The protein folding problem. Annu. Rev. Biophys., 37:289?316, 2008. [84] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering 232 with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016. [85] Christopher M Dobson. Protein folding and misfolding. Nature, 426(6968):884? 890, 2003. [86] Han Du, Sumant Puri, Andrew McCall, Hannah L Norris, Thomas Russo, and Mira Edgerton. Human salivary protein histatin 5 has potent bactericidal activity against eskape pathogens. Frontiers in Cellular and Infection Microbiology, 7:41, 2017. [87] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Go?mez- Bombarelli, Timothy Hirzel, Ala?n Aspuru-Guzik, and Ryan P Adams. Convo- lutional networks on graphs for learning molecular fingerprints. arXiv preprint arXiv:1509.09292, 2015. [88] Timo Eichner, Arnout P Kalverda, Gary S Thompson, Steve W Homans, and Sheena E Radford. Conformational conversion during amyloid formation at atomic resolution. Molecular cell, 41(2):161?172, 2011. [89] Timo Eichner and Sheena E Radford. A generic mechanism of ?2-microglobulin amyloid assembly at neutral ph involving a specific proline switch. Journal of molec- ular biology, 386(5):1312?1326, 2009. [90] Emel??a Eir??ksdo?ttir, Karidia Konate, U?lo Langel, Gilles Divita, and Se?bastien De- shayes. Secondary structure of cell-penetrating peptides controls membrane in- teraction and insertion. Biochimica et Biophysica Acta (BBA)-Biomembranes, 1798(6):1119?1128, 2010. [91] Stefan Elbe and Gemma Buckland-Merrett. Data, disease and diplomacy: Gisaid?s innovative contribution to global health. Global challenges, 1(1):33?46, 2017. [92] Charles A English and Angel E Garc??a. Charged termini on the trp-cage roughen the folding energy landscape. The Journal of Physical Chemistry B, 119(25):7874? 7881, 2015. [93] G Esposito, R Michelutti, G Verdone, P Viglino, H Hernandez, CV Robinson, A Amoresano, F Dal Piaz, M Monti, P Pucci, et al. Removal of the n-terminal hexapeptide from human ?2-microglobulin facilitates protein aggregation and fibril formation. Protein Science, 9(5):831?845, 2000. [94] Ulrich Essmann, Lalith Perera, Max L Berkowitz, Tom Darden, Hsing Lee, and Lee G Pedersen. A smooth particle mesh ewald method. The Journal of chemical physics, 103(19):8577?8593, 1995. 233 [95] S??lvia G Esta?cio, Heinrich Krobath, Diogo Vila-Vic?osa, Miguel Machuqueiro, Eu- gene I Shakhnovich, and Patr??cia FN Fa??sca. A simulated intermediate state for folding and aggregation provides insights into ?n6 ?2-microglobulin amyloidogenic behavior. PLoS computational biology, 10(5):e1003606, 2014. [96] Denis J Evans and Brad Lee Holian. The nose?hoover thermostat. The Journal of chemical physics, 83(8):4069?4074, 1985. [97] Scott E Feller, Yuhong Zhang, Richard W Pastor, and Bernard R Brooks. Constant pressure molecular dynamics simulation: The langevin piston method. The Journal of chemical physics, 103(11):4613?4621, 1995. [98] Jhosimar Arias Figueroa and Ad??n Ram??rez Rivera. Is simple better?: Revisit- ing simple generative models for unsupervised clustering. In Second workshop on Bayesian Deep Learning (NIPS), 2017. [99] Centers for Disease Control, Prevention, et al. Antibiotic resistance threats in the United States, 2019. US Department of Health and Human Services, Centres for Disease Control and . . . , 2019. [100] Peter Forster, Lucy Forster, Colin Renfrew, and Michael Forster. Phylogenetic net- work analysis of sars-cov-2 genomes. Proceedings of the National Academy of Sci- ences, 117(17):9241?9243, 2020. [101] Alan D Frankel and Carl O Pabo. Cellular uptake of the tat protein from human immunodeficiency virus. Cell, 55(6):1189?1193, 1988. [102] Haoyi Fu, Zicheng Cao, Mingyuan Li, and Shunfang Wang. Acep: improving an- timicrobial peptides recognition through automatic feature fusion and amino acid embedding. BMC genomics, 21(1):1?14, 2020. [103] Barry Ganetzky, Gail A Robertson, Gisela F Wilson, Matthew C Trudeau, and Steven A Titus. The eag family of k+ channels in drosophila and mammals. An- nals of the New York Academy of Sciences, 868(1):356?369, 1999. [104] Mahdi Ghorbani, Samarjeet Prasad, Jeffery B Klauda, and Bernard R Brooks. Vari- ational embedding of protein folding simulations using gaussian mixture variational autoencoders. The Journal of Chemical Physics, 155(19):194108, 2021. [105] Aldo Glielmo, Brooke E Husic, Alex Rodriguez, Cecilia Clementi, Frank Noe?, and Alessandro Laio. Unsupervised learning methods for molecular simulation data. Chemical Reviews, 2021. [106] Vladimir Gligorijevic?, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Le- man, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M Fisk, Hera Vlamakis, et al. Structure-based protein function prediction using graph convolutional networks. Nature communications, 12(1):1?14, 2021. 234 [107] Ba?rbara Gomes, Marcelo T Augusto, Ma?rio R Fel??cio, Axel Hollmann, Octa?vio L Franco, So?nia Gonc?alves, and Nuno C Santos. Designing improved active peptides for therapeutic approaches against infectious diseases. Biotechnology advances, 36(2):415?429, 2018. [108] Zifan Gong, Svetlana P Ikonomova, and Amy J Karlsson. Secondary structure of cell-penetrating peptides during interaction with fungal cells. Protein Science, 27(3):702?713, 2018. [109] Helmut Grubmu?ller, Helmut Heller, Andreas Windemuth, and Klaus Schulten. Gen- eralized verlet algorithm for efficient molecular dynamics simulations with long- range interactions. Molecular Simulation, 6(1-3):121?142, 1991. [110] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in convolutional neural networks. Pattern Recognition, 77:354?377, 2018. [111] Shan-shan Guan, Wei-wei Han, Hao Zhang, Song Wang, and Ya-ming Shan. Insight into the interactive residues between two domains of human somatic angiotensin- converting enzyme and angiotensin ii by mm-pbsa calculation and steered molecular dynamics simulation. Journal of Biomolecular Structure and Dynamics, 34(1):15? 28, 2016. [112] Chunsheng Guo, Jialuo Zhou, Huahua Chen, Na Ying, Jianwu Zhang, and Di Zhou. Variational autoencoder with optimizing gaussian mixture model priors. IEEE Ac- cess, 8:43992?44005, 2020. [113] Anvita Gupta and James Zou. Feedback gan for dna optimizes protein functions. Nature Machine Intelligence, 1(2):105?111, 2019. [114] Ahleah S Gustina and Matthew C Trudeau. Herg potassium channel regulation by the n-terminal eag domain. Cellular signalling, 24(8):1592?1598, 2012. [115] Olgun Guvench and Alexander D MacKerell. Comparison of protein force fields for molecular dynamics simulations. Molecular modeling of proteins, pages 63?88, 2008. [116] Olgun Guvench, Sairam S Mallajosyula, E Prabhu Raman, Elizabeth Hatcher, Kenno Vanommeslaeghe, Theresa J Foster, Francis W Jamison, and Alexander D MacK- erell Jr. Charmm additive all-atom force field for carbohydrate derivatives and its utility in polysaccharide and carbohydrate?protein modeling. Journal of chemical theory and computation, 7(10):3162?3180, 2011. [117] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008. 235 [118] Yoni Haitin, Anne E Carlson, and William N Zagotta. The structural mechanism of kcnh-channel regulation by the eag domain. Nature, 501(7467):444?448, 2013. [119] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 1025?1035, 2017. [120] Audra A Hargett, Qing Wei, Barbora Knoppova, Stacy Hall, Zhi-Qiang Huang, Amol Prakash, Todd J Green, Zina Moldoveanu, Milan Raska, Jan Novak, et al. Defining hiv-1 envelope n-glycan microdomains through site-specific heterogeneity profiles. Journal of virology, 93(1):e01177?18, 2019. [121] Matthew P Harrigan, Mohammad M Sultan, Carlos X Herna?ndez, Brooke E Husic, Peter Eastman, Christian R Schwantes, Kyle A Beauchamp, Robert T McGibbon, and Vijay S Pande. Msmbuilder: statistical models for biomolecular dynamics. Bio- physical journal, 112(1):10?15, 2017. [122] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770?778, 2016. [123] Niels HH Heegaard. ?2-microglobulin: from physiology to amyloidosis. Amyloid, 16(3):151?173, 2009. [124] Rainer Hegger, Alexandros Altis, Phuong H Nguyen, and Gerhard Stock. How complex is the dynamics of peptide folding? Physical review letters, 98(2):028102, 2007. [125] Martin Held, Philipp Metzner, Jan-Hendrik Prinz, and Frank Noe?. Mechanisms of protein-ligand association and its modulation by protein mutations. Biophysical journal, 100(3):701?710, 2011. [126] Iddo Heller, Gerrit Sitters, Onno D Broekmans, Ge?raldine Farge, Carolin Menges, Wolfgang Wende, Stefan W Hell, Erwin JG Peterman, and Gijs JL Wuite. Sted nanoscopy combined with optical tweezers reveals protein dynamics on densely cov- ered dna. Nature methods, 10(9):910?916, 2013. [127] Jonathan T Henry and Sean Crosson. Ligand-binding pas domains in a genomic, cellular, and structural context. Annual review of microbiology, 65:261?286, 2011. [128] Carlos X Herna?ndez, Hannah K Wayment-Steele, Mohammad M Sultan, Brooke E Husic, and Vijay S Pande. Variational encoding of complex dynamics. Physical Review E, 97(6):062412, 2018. [129] Ren Higashida and Yasuhiro Matsunaga. Enhanced conformational sampling of nanobody cdr h3 loop by generalized replica-exchange with solute tempering. Life, 11(12):1428, 2021. 236 [130] Sepp Hochreiter and Ju?rgen Schmidhuber. Long short-term memory. Neural com- putation, 9(8):1735?1780, 1997. [131] Moritz Hoffmann, Martin Konrad Scherer, Tim Hempel, Andreas Mardt, Brian de Silva, Brooke Elena Husic, Stefan Klus, Hao Wu, J Nathan Kutz, Steven Brunton, and Frank Noe?. Deeptime: a python library for machine learning dynamical models from time series data. Machine Learning: Science and Technology, 2021. [132] William G Hoover. Canonical dynamics: Equilibrium phase-space distributions. Physical review A, 31(3):1695, 1985. [133] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359?366, 1989. [134] Adam Hospital, Josep Ramon Gon?i, Modesto Orozco, and Josep L Gelp??. Molecular dynamics simulations: advances and applications. Advances and applications in bioinformatics and chemistry: AABC, 8:37, 2015. [135] William Humphrey, Andrew Dalke, and Klaus Schulten. Vmd: visual molecular dynamics. Journal of molecular graphics, 14(1):33?38, 1996. [136] Brooke E Husic, Nicholas E Charron, Dominik Lemm, Jiang Wang, Adria? Pe?rez, Maciej Majewski, Andreas Kra?mer, Yaoyi Chen, Simon Olsson, Gianni de Fabritiis, et al. Coarse graining molecular dynamics with graph neural networks. The Journal of Chemical Physics, 153(19):194101, 2020. [137] Brooke E Husic, Robert T McGibbon, Mohammad M Sultan, and Vijay S Pande. Optimized parameter selection reveals trends in markov state models for protein folding. The Journal of chemical physics, 145(19):194103, 2016. [138] Brooke E Husic and Vijay S Pande. Ward clustering improves cross-validated markov state models of protein folding. Journal of chemical theory and compu- tation, 13(3):963?967, 2017. [139] Brooke E Husic and Vijay S Pande. Markov state models: From an art to a science. Journal of the American Chemical Society, 140(7):2386?2396, 2018. [140] Piet Hut, Jun Makino, and Steve McMillan. Building a better leapfrog. The Astro- physical Journal, 443:L93?L96, 1995. [141] Rieko Ishima and Dennis A Torchia. Protein dynamics from nmr. Nature structural biology, 7(9):740?743, 2000. [142] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble- softmax. In International Conference on Learning Representations (ICLR 2017). OpenReview. net, 2017. 237 [143] Sunhwan Jo, Taehoon Kim, Vidyashankara G Iyer, and Wonpil Im. Charmm-gui: a web-based graphical user interface for charmm. Journal of computational chemistry, 29(11):1859?1865, 2008. [144] William L Jorgensen, Jayaraman Chandrasekhar, Jeffry D Madura, Roger W Impey, and Michael L Klein. Comparison of simple potential functions for simulating liquid water. The Journal of chemical physics, 79(2):926?935, 1983. [145] Atsushi Kameda, Masaru Hoshino, Takashi Higurashi, Satoshi Takahashi, Hironobu Naiki, and Yuji Goto. Nuclear magnetic resonance characterization of the refolding intermediate of ?2-microglobulin trapped by non-native prolyl peptide bond. Jour- nal of molecular biology, 348(2):383?397, 2005. [146] Motoshi Kamiya and Yuji Sugita. Flexible selection of the solute region in replica exchange with solute tempering: Application to protein-folding simulations. The Journal of chemical physics, 149(7):072304, 2018. [147] Po Wei Kang, Annie M Westerlund, Jingyi Shi, Kelli McFarland White, Alex K Dou, Amy H Cui, Jonathan R Silva, Lucie Delemotte, and Jianmin Cui. Calmodulin acts as a state-dependent switch to control a cardiac potassium channel opening. Science advances, 6(50):eabd6798, 2020. [148] Xinyue Kang, Fanyi Dong, Cheng Shi, Shicai Liu, Jian Sun, Jiaxin Chen, Haiqi Li, Hanmei Xu, Xingzhen Lao, and Heng Zheng. Dramp 2.0, an updated data repository of antimicrobial peptides. Scientific data, 6(1):1?10, 2019. [149] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley. Molecular graph convolutions: moving beyond fingerprints. Journal of computer- aided molecular design, 30(8):595?608, 2016. [150] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [151] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [152] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convo- lutional networks. arXiv preprint arXiv:1609.02907, 2016. [153] Jeffery B Klauda, Richard M Venable, J Alfredo Freites, Joseph W O?Connor, Dou- glas J Tobias, Carlos Mondragon-Ramirez, Igor Vorobyov, Alexander D MacK- erell Jr, and Richard W Pastor. Update of the charmm all-atom additive force field for lipids: validation on six lipid types. The journal of physical chemistry B, 114(23):7830?7843, 2010. 238 [154] Kai J Kohlhoff, Diwakar Shukla, Morgan Lawrenz, Gregory R Bowman, David E Konerding, Dan Belov, Russ B Altman, and Vijay S Pande. Cloud-based simulations on google exacycle reveal ligand modulation of gpcr activation pathways. Nature chemistry, 6(1):15?21, 2014. [155] Youxin Kong, Bert JC Janssen, Tomas Malinauskas, Vamshidhar R Vangoor, Char- lotte H Coles, Rainer Kaufmann, Tao Ni, Robert JC Gilbert, Sergi Padilla-Parra, R Jeroen Pasterkamp, et al. Structural basis for plexin activation and regulation. Neuron, 91(3):548?560, 2016. [156] Susanna Kube and Marcus Weber. A coarse graining method for the identification of transition rates between molecular conformations. The Journal of chemical physics, 126(2):024103, 2007. [157] Jan Kubelka, Thang K Chiu, David R Davies, William A Eaton, and James Hofrichter. Sub-microsecond protein folding. Journal of molecular biology, 359(3):546?553, 2006. [158] Jan Kubelka, William A Eaton, and James Hofrichter. Experimental tests of villin subdomain folding simulations. Journal of molecular biology, 329(4):625?630, 2003. [159] Jun Lan, Jiwan Ge, Jinfang Yu, Sisi Shan, Huan Zhou, Shilong Fan, Qi Zhang, Xuanling Shi, Qisheng Wang, Linqi Zhang, et al. Structure of the sars-cov-2 spike receptor-binding domain bound to the ace2 receptor. Nature, 581(7807):215?220, 2020. [160] Stefan M Larson, Christopher D Snow, Michael Shirts, and Vijay S Pande. Fold- ing@ home and genome@ home: Using distributed computing to tackle previously intractable problems in computational biology. arXiv preprint arXiv:0901.0866, 2009. [161] Ramanan Laxminarayan, Thomas Van Boeckel, Isabel Frost, Samuel Kariuki, Ejaz Ahmed Khan, Direk Limmathurotsakul, DG Joakim Larsson, Gabriel Levy- Hara, Marc Mendelson, Kevin Outterson, et al. The lancet infectious diseases com- mission on antimicrobial resistance: 6 years later. The Lancet Infectious Diseases, 20(4):e51?e60, 2020. [162] Tanguy Le Marchand, Matteo de Rosa, Nicola Salvi, Benedetta Maria Sala, Loren B Andreas, Emeline Barbet-Massin, Pietro Sormanni, Alberto Barbiroli, Riccardo Por- cari, Cristiano Sousa Mota, et al. Conformational dynamics in crystals reveal the molecular bases for d76n beta-2 microglobulin aggregation propensity. Nature com- munications, 9(1):1?11, 2018. 239 [163] Eric H Lee, Jen Hsin, Marcos Sotomayor, Gemma Comellas, and Klaus Schulten. Discovery through the computational microscope. Structure, 17(10):1295?1306, 2009. [164] Hui Sun Lee, Yifei Qi, and Wonpil Im. Effects of n-glycosylation on protein con- formation and dynamics: Protein data bank analysis and molecular dynamics simu- lation study. Scientific reports, 5(1):1?7, 2015. [165] Jumin Lee, Xi Cheng, Jason M Swails, Min Sun Yeom, Peter K Eastman, Justin A Lemkul, Shuai Wei, Joshua Buckner, Jong Cheol Jeong, Yifei Qi, et al. Charmm-gui input generator for namd, gromacs, amber, openmm, and charmm/openmm simu- lations using the charmm36 additive force field. Journal of chemical theory and computation, 12(1):405?413, 2016. [166] Young-Ho Lee and Yuji Goto. Kinetic intermediates of amyloid fibrillation studied by hydrogen exchange methods with nuclear magnetic resonance. Biochimica et Biophysica Acta (BBA)-Proteins and Proteomics, 1824(12):1307?1323, 2012. [167] Hongxing Lei, Yao Su, Lian Jin, and Yong Duan. Folding network of villin head- piece subdomain. Biophysical journal, 99(10):3374?3384, 2010. [168] AS Lemak and NK Balabaev. On the berendsen thermostat. Molecular Simulation, 13(3):177?187, 1994. [169] Tobias Lemke and Christine Peter. Encodermap: Dimensionality reduction and gen- eration of molecule conformations. Journal of chemical theory and computation, 15(2):1209?1215, 2019. [170] Thomas Lemmin, Cinque Soto, Jonathan Stuckey, and Peter D Kwong. Microsecond dynamics and network analysis of the hiv-1 sosip env trimer reveal collective behav- ior and conserved microdomains of the glycan shield. Structure, 25(10):1631?1639, 2017. [171] Ange?lique Lewies, Lissinda H Du Plessis, and Johannes F Wentzel. Antimicrobial peptides: the achilles? heel of antibiotic resistance? Probiotics and antimicrobial proteins, 11(2):370?381, 2019. [172] Chenkai Li, Darcy Sutherland, S Austin Hammond, Chen Yang, Figali Taho, Lauren Bergman, Simon Houston, Rene? L Warren, Titus Wong, Linda Hoang, et al. Am- plify: attentive deep learning model for discovery of novel antimicrobial peptides effective against who priority pathogens. BMC genomics, 23(1):1?15, 2022. [173] Fang Li. Structural analysis of major species barriers between humans and palm civets for severe acute respiratory syndrome coronavirus infections. Journal of vi- rology, 82(14):6984?6991, 2008. 240 [174] Fang Li. Receptor recognition and cross-species infections of sars coronavirus. An- tiviral research, 100(1):246?254, 2013. [175] Fang Li. Receptor recognition mechanisms of coronaviruses: a decade of structural studies. Journal of virology, 89(4):1954?1964, 2015. [176] Fang Li. Structure, function, and evolution of coronavirus spike proteins. Annual review of virology, 3:237?261, 2016. [177] Qingxin Li, Shovanlal Gayen, Angela Shuyi Chen, Qiwei Huang, Manfred Raida, and CongBao Kang. Nmr solution structure of the n-terminal domain of herg and its interaction with the s4?s5 linker. Biochemical and biophysical research communi- cations, 403(1):126?132, 2010. [178] Wenhui Li, Thomas C Greenough, Michael J Moore, Natalya Vasilieva, Mohan So- masundaran, John L Sullivan, Michael Farzan, and Hyeryun Choe. Efficient repli- cation of severe acute respiratory syndrome coronavirus in mouse cells is limited by murine angiotensin-converting enzyme 2. Journal of virology, 78(20):11429?11433, 2004. [179] Ying Li, Hui Li, Frank C Pickard IV, Badri Narayanan, Fatih G Sen, Maria KY Chan, Subramanian KRS Sankaranarayanan, Bernard R Brooks, and Beno??t Roux. Machine learning force field parameters from ab initio data. Journal of chemical theory and computation, 13(9):4492?4503, 2017. [180] Kresten Lindorff-Larsen, Stefano Piana, Ron O Dror, and David E Shaw. How fast- folding proteins fold. Science, 334(6055):517?520, 2011. [181] Kresten Lindorff-Larsen, Stefano Piana, Kim Palmo, Paul Maragakis, John L Klepeis, Ron O Dror, and David E Shaw. Improved side-chain torsion potentials for the amber ff99sb protein force field. Proteins: Structure, Function, and Bioin- formatics, 78(8):1950?1958, 2010. [182] Ross A Lippert, Cristian Predescu, Douglas J Ierardi, Kenneth M Mackenzie, Michael P Eastwood, Ron O Dror, and David E Shaw. Accurate and efficient in- tegration for molecular dynamics simulations at constant temperature and pressure. The Journal of chemical physics, 139(16):10B621 1, 2013. [183] Jennifer Lippincott-Schwartz, Erik Snapp, and Anne Kenworthy. Studying protein dynamics in living cells. Nature reviews Molecular cell biology, 2(6):444?456, 2001. [184] Cynthia Liu, Qiongqiong Zhou, Yingzhu Li, Linda V Garner, Steve P Watkins, Linda J Carter, Jeffrey Smoot, Anne C Gregg, Angela D Daniels, Susan Jervey, et al. Research and development on therapeutic agents and vaccines for covid-19 and related human coronavirus diseases, 2020. 241 [185] Lihong Liu, Pengfei Wang, Manoj S Nair, Jian Yu, Micah Rapp, Qian Wang, Yang Luo, Jasper F-W Chan, Vincent Sahi, Amir Figueroa, et al. Potent neutralizing antibodies against multiple epitopes on sars-cov-2 spike. Nature, 584(7821):450? 456, 2020. [186] Pu Liu, Byungchan Kim, Richard A Friesner, and BJ Berne. Replica exchange with solute tempering: A method for sampling biological systems in explicit water. Proceedings of the National Academy of Sciences, 102(39):13749?13754, 2005. [187] Yanxin Liu, Johan Stru?mpfer, Peter L Freddolino, Martin Gruebele, and Klaus Schulten. Structural characterization of ? -repressor folding from all-atom molecu- lar dynamics simulations. The journal of physical chemistry letters, 3(9):1117?1123, 2012. [188] Thomas Lo?hr, Kai Kohlhoff, Gabriella T Heller, Carlo Camilloni, and Michele Ven- druscolo. A kinetic ensemble of the alzheimer?s a? peptide. Nature Computational Science, 1(1):71?78, 2021. [189] Mazin Magzoub, LE Go?ran Eriksson, and Astrid Gra?slund. Conformational states of the cell-penetrating peptide penetratin when interacting with phospholipid vesicles: effects of surface charge and peptide concentration. Biochimica Et Biophysica Acta (BBA)-Biomembranes, 1563(1-2):53?63, 2002. [190] Ashish Malik, Dwarakanath Prahlad, Naveen Kulkarni, and Abhijit Kayal. Interfa- cial water molecules make rbd of spike protein and human ace2 to stick together. bioRxiv, 2020. [191] Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826, 2017. [192] Andreas Mardt and Frank Noe?. Progress in deep markov state modeling: Coarse graining and experimental data restraints. The Journal of Chemical Physics, 2021. [193] Andreas Mardt, Luca Pasquali, Frank Noe?, and Hao Wu. Deep learning markov and koopman models with physical constraints. In Mathematical and Scientific Machine Learning, pages 451?475. PMLR, 2020. [194] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noe?. Vampnets for deep learning of molecular kinetics. Nature communications, 9(1):1?11, 2018. [195] Alexandra K Marr, William J Gooderham, and Robert EW Hancock. Antibacterial peptides for therapeutic use: obstacles and realistic outlook. Current opinion in pharmacology, 6(5):468?472, 2006. [196] R Marton?a?k, Alessandro Laio, and Michele Parrinello. Predicting crystal structures: the parrinello-rahman method revisited. Physical review letters, 90(7):075503, 2003. 242 [197] Glenn J Martyna, Douglas J Tobias, and Michael L Klein. Constant pressure molecu- lar dynamics algorithms. The Journal of chemical physics, 101(5):4177?4189, 1994. [198] Deepika Mathur, Sandeep Singh, Ayesha Mehta, Piyush Agrawal, and Gajendra PS Raghava. In silico approaches for predicting the half-life of natural and modified peptides in blood. PloS one, 13(6):e0196829, 2018. [199] Robert T McGibbon, Kyle A Beauchamp, Matthew P Harrigan, Christoph Klein, Jason M Swails, Carlos X Herna?ndez, Christian R Schwantes, Lee-Ping Wang, Thomas J Lane, and Vijay S Pande. Mdtraj: a modern open library for the anal- ysis of molecular dynamics trajectories. Biophysical journal, 109(8):1528?1532, 2015. [200] Robert T McGibbon and Vijay S Pande. Variational cross-validation of slow dynamical modes in molecular kinetics. The Journal of chemical physics, 142(12):03B621 1, 2015. [201] Heleen Meuzelaar, Kristen A Marino, Adriana Huerta-Viga, Matthijs R Panman, Linde EJ Smeenk, Albert J Kettelarij, Jan H van Maarseveen, Peter Timmerman, Peter G Bolhuis, and Sander Woutersen. Folding dynamics of the trp-cage minipro- tein: evidence for a native-like intermediate from combined time-resolved vibra- tional spectroscopy and molecular dynamics simulations. The Journal of Physical Chemistry B, 117(39):11490?11501, 2013. [202] Igor Mezic?. Spectral properties of dynamical systems, model reduction and decom- positions. Nonlinear Dynamics, 41(1):309?325, 2005. [203] Igor Mezic?. Analysis of fluid flows via spectral properties of the koopman operator. Annual Review of Fluid Mechanics, 45:357?378, 2013. [204] A Brian Mochon and Haoping Liu. The antimicrobial peptide histatin-5 causes a spatially restricted disruption on the candida albicans surface, allowing rapid entry of the peptide into the cytoplasm. PLoS pathogens, 4(10):e1000190, 2008. [205] Andreas Mo?glich, Rebecca A Ayers, and Keith Moffat. Structure and signaling mechanism of per-arnt-sim domains. Structure, 17(10):1282?1294, 2009. [206] Viviana Monje-Galvan and Jeffery B Klauda. Modeling yeast organelle membranes and how lipid diversity influences bilayer properties. Biochemistry, 54(45):6852? 6861, 2015. [207] Toshifumi Mori and Shinji Saito. Molecular mechanism behind the fast fold- ing/unfolding transitions of villin headpiece subdomain: Hierarchy and heterogene- ity. The Journal of Physical Chemistry B, 120(45):11683?11691, 2016. 243 [208] May C Morris, Sebastien Deshayes, Frederic Heitz, and Gilles Divita. Cell- penetrating peptides: from molecular mechanisms to therapeutics. Biology of the Cell, 100(4):201?217, 2008. [209] May C Morris, Pierre Vidal, Laurent Chaloin, Fre?de?ric Heitz, and Gilles Divita. A new peptide vector for efficient delivery of oligonucleotides into mammalian cells. Nucleic acids research, 25(14):2730?2736, 1997. [210] Alex T Mu?ller, Gisela Gabernet, Jan A Hiss, and Gisbert Schneider. modlamp: Python for antimicrobial peptides. Bioinformatics, 33(17):2753?2755, 2017. [211] Frederick W Muskett, Samrat Thouta, Steven J Thomson, Alexander Bowen, Phillip J Stansfeld, and John S Mitcheson. Mechanistic insight into human ether-a- go-go-related gene (herg) k+ channel deactivation gating from the solution structure of the eag domain. Journal of Biological Chemistry, 286(8):6184?6191, 2011. [212] Boaz Nadler, Stephane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Diffu- sion maps, spectral clustering and eigenfunctions of fokker-planck operators. arXiv preprint math/0506090, 2005. [213] Boaz Nadler, Ste?phane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Dif- fusion maps, spectral clustering and reaction coordinates of dynamical systems. Ap- plied and Computational Harmonic Analysis, 21(1):113?127, 2006. [214] Carlo Napolitano, Silvia G Priori, Peter J Schwartz, Raffaella Bloise, Elena Ronchetti, Janni Nastoli, Georgia Bottelli, Marina Cerrone, and Sergio Leonardi. Genetic testing in the long qt syndrome: development and validation of an efficient approach to genotyping in clinical practice. Jama, 294(23):2975?2980, 2005. [215] Frank Noe? and Feliks Nuske. A variational approach to modeling slow processes in stochastic dynamical systems. Multiscale Modeling & Simulation, 11(2):635?655, 2013. [216] Frank Noe?, Christof Schu?tte, Eric Vanden-Eijnden, Lothar Reich, and Thomas R Weikl. Constructing the equilibrium ensemble of folding pathways from short off-equilibrium simulations. Proceedings of the National Academy of Sciences, 106(45):19011?19016, 2009. [217] Shuichi Nose?. A unified formulation of the constant temperature molecular dynam- ics methods. The Journal of chemical physics, 81(1):511?519, 1984. [218] Feliks Nuske, Bettina G Keller, Guillermo Pe?rez-Herna?ndez, Antonia SJS Mey, and Frank Noe?. Variational approach to molecular kinetics. Journal of chemical theory and computation, 10(4):1739?1752, 2014. 244 [219] Ruth Nussinov and Chung-Jung Tsai. Allostery in disease and in drug discovery. Cell, 153(2):293?305, 2013. [220] Y Zenmei Ohkubo, Taras V Pogorelov, Mark J Arcario, Geoff A Christensen, and Emad Tajkhorshid. Accelerating membrane insertion of peripheral proteins with a novel membrane mimetic model. Biophysical journal, 102(9):2130?2139, 2012. [221] Jim O?Neill. Tackling drug-resistant infections globally: final report and recommen- dations. 2016. [222] Jose? Nelson Onuchic and Peter G Wolynes. Theory of protein folding. Current opinion in structural biology, 14(1):70?75, 2004. [223] Albert C Pan, Thomas M Weinreich, Stefano Piana, and David E Shaw. Demon- strating an order-of-magnitude sampling enhancement in molecular dynamics sim- ulations of complex protein systems. Journal of chemical theory and computation, 12(3):1360?1367, 2016. [224] Junxiong Pang, Min Xian Wang, Ian Yi Han Ang, Sharon Hui Xuan Tan, Ruth Frances Lewis, Jacinta I-Pei Chen, Ramona A Gutierrez, Sylvia Xiao Wei Gwee, Pearleen Ee Yong Chua, Qian Yang, et al. Potential rapid diagnostics, vac- cine and therapeutics for 2019 novel coronavirus (2019-ncov): a systematic review. Journal of clinical medicine, 9(3):623, 2020. [225] Cheol Woo Park, Mordechai Kornbluth, Jonathan Vandermause, Chris Wolverton, Boris Kozinsky, and Jonathan P Mailoa. Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture. npj Computational Materials, 7(1):1?9, 2021. [226] Michele Parrinello and Aneesur Rahman. Polymorphic transitions in single crystals: A new molecular dynamics method. Journal of Applied physics, 52(12):7182?7190, 1981. [227] Michael Patra, Mikko Karttunen, Marja T Hyvo?nen, Emma Falck, Peter Lindqvist, and Ilpo Vattulainen. Molecular dynamics simulations of lipid bilayers: major arti- facts due to truncating electrostatic interactions. Biophysical journal, 84(6):3636? 3645, 2003. [228] Guillermo Pe?rez-Herna?ndez, Fabian Paul, Toni Giorgino, Gianni De Fabritiis, and Frank Noe?. Identification of slow molecular order parameters for markov model construction. The Journal of chemical physics, 139(1):07B604 1, 2013. [229] Anna-Carin Persson, Rene? JM Stet, and Lars Pilstro?m. Characterization of mhc class i and ?2-microglobulin sequences in atlantic cod reveals an unusually high number of expressed class i genes. Immunogenetics, 50(1):49?59, 1999. 245 [230] James C Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D Skeel, Laxmikant Kale, and Klaus Schulten. Scalable molecular dynamics with namd. Journal of computational chem- istry, 26(16):1781?1802, 2005. [231] Stefano Piana, Kresten Lindorff-Larsen, and David E Shaw. Protein folding kinet- ics and thermodynamics from atomistic simulation. Proceedings of the National Academy of Sciences, 109(44):17845?17850, 2012. [232] Thomas J Piggot, Angel Pineiro, and Syma Khalid. Molecular dynamics simulations of phosphatidylcholine membranes: a comparative force field study. Journal of chemical theory and computation, 8(11):4593?4609, 2012. [233] Dora Pinto, Young-Jun Park, Martina Beltramello, Alexandra C Walls, M Alejandra Tortorici, Siro Bianchi, Stefano Jaconi, Katja Culap, Fabrizia Zatta, Anna De Marco, et al. Structural and functional analysis of a potent sarbecovirus neutralizing anti- body. BioRxiv, 2020. [234] Malak Pirtskhalava, Andrei Gabrielian, Phillip Cruz, Hannah L Griggs, R Burke Squires, Darrell E Hurt, Maia Grigolava, Mindia Chubinidze, George Gogoladze, Boris Vishnepolsky, et al. Dbaasp v. 2: an enhanced database of structure and antimi- crobial/cytotoxic activity of natural and synthetic peptides. Nucleic acids research, 44(D1):D1104?D1112, 2016. [235] Geoffrey W Platt and Sheena E Radford. Glimpses of the molecular mechanisms of ?2-microglobulin fibril formation in vitro: Aggregation on a complex energy landscape. FEBS letters, 583(16):2623?2629, 2009. [236] Nuria Plattner and Frank Noe?. Protein conformational plasticity and complex ligand- binding kinetics explored by atomistic simulations and markov models. Nature com- munications, 6(1):1?10, 2015. [237] Jan-Hendrik Prinz, Hao Wu, Marco Sarich, Bettina Keller, Martin Senne, Mar- tin Held, John D Chodera, Christof Schu?tte, and Frank Noe?. Markov models of molecular kinetics: Generation and validation. The Journal of chemical physics, 134(17):174105, 2011. [238] Yifei Qi, Xi Cheng, Jumin Lee, Josh V Vermaas, Taras V Pogorelov, Emad Tajkhor- shid, Soohyung Park, Jeffery B Klauda, and Wonpil Im. Charmm-gui hmmm builder for membrane simulations with the highly mobile membrane-mimetic model. Bio- physical journal, 109(10):2012?2022, 2015. [239] Periathamby Antony Raj, Emil Marcus, and Dinesh K Sukumaran. Structure of hu- man salivary histatin 5 in aqueous and nonaqueous solutions. Biopolymers: Original Research on Biomolecules, 45(1):51?67, 1998. 246 [240] Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. Asap: Adaptive structure aware pooling for learning hierarchical graph representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5470?5477, 2020. [241] Lauren M Reid, Chandra S Verma, and Jonathan W Essex. The role of molecular simulations in understanding the mechanisms of cell-penetrating peptides. Drug Discovery Today, 24(9):1821?1835, 2019. [242] Colin Reily, Tyler J Stewart, Matthew B Renfrow, and Jan Novak. Glycosylation in health and disease. Nature Reviews Nephrology, 15(6):346?366, 2019. [243] Jiayi Ren, Xiaohui Yuan, Junqi Li, Shujian Lin, Bing Yang, Chun Chen, Jian Zhao, Weihong Zheng, Huaxin Liao, Zhiwei Yang, et al. Assessing the performance of the g mmpbsa tools to simulate the inhibition of oseltamivir to influenza virus neu- raminidase by molecular mechanics poisson?boltzmann surface area methods. Jour- nal of the Chinese Chemical Society, 67(1):46?53, 2020. [244] Joa?o Marcelo Lamim Ribeiro, Pablo Bravo, Yihang Wang, and Pratyush Tiwary. Reweighted autoencoded variational bayes for enhanced sampling (rave). The Jour- nal of chemical physics, 149(7):072301, 2018. [245] Susanna Ro?blitz and Marcus Weber. Fuzzy spectral clustering by pcca+: applica- tion to markov state models and data classification. Advances in Data Analysis and Classification, 7(2):147?179, 2013. [246] Barry Rockx, Davide Corti, Eric Donaldson, Timothy Sheahan, Konrad Stadler, An- tonio Lanzavecchia, and Ralph Baric. Structural basis for potent cross-neutralizing human monoclonal antibody protection against lethal human and zoonotic severe acute respiratory syndrome coronavirus challenge. Journal of virology, 82(7):3220? 3235, 2008. [247] Barry Rockx, Eric Donaldson, Matthew Frieman, Timothy Sheahan, Davide Corti, Antonio Lanzavecchia, and Ralph S Baric. Escape from human monoclonal antibody neutralization affects in vitro and in vivo fitness of severe acute respiratory syndrome coronavirus. The Journal of infectious diseases, 201(6):946?955, 2010. [248] Carlos HM Rodrigues, Yoochan Myung, Douglas EV Pires, and David B Ascher. mcsm-ppi2: predicting the effects of mutations on protein?protein interactions. Nu- cleic acids research, 47(W1):W338?W344, 2019. [249] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958. [250] Katy E Routledge, Gian Gaetano Tartaglia, Geoffrey W Platt, Michele Vendruscolo, and Sheena E Radford. Competition between intramolecular and intermolecular in- teractions in an amyloid-forming protein. Journal of molecular biology, 389(4):776? 786, 2009. 247 [251] Alexander M Rush, SEAS Harvard, Sumit Chopra, and Jason Weston. A neural attention model for sentence summarization. In ACLWeb. Proceedings of the 2015 conference on empirical methods in natural language processing, 2017. [252] Jean-Paul Ryckaert, Giovanni Ciccotti, and Herman JC Berendsen. Numerical inte- gration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. Journal of computational physics, 23(3):327?341, 1977. [253] S Kashif Sadiq, Frank Noe?, and Gianni De Fabritiis. Kinetic characterization of the critical step in hiv-1 protease maturation. Proceedings of the National Academy of Sciences, 109(50):20449?20454, 2012. [254] Michiko Sakata, Eri Chatani, Atsushi Kameda, Kazumasa Sakurai, Hironobu Naiki, and Yuji Goto. Kinetic coupling of folding and prolyl isomerization of ?2-microglobulin studied by mutational analysis. Journal of molecular biology, 382(5):1242?1255, 2008. [255] Benjamin Sanchez-Lengeling and Ala?n Aspuru-Guzik. Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400):360?365, 2018. [256] Michael C Sanguinetti and Martin Tristani-Firouzi. herg potassium channels and cardiac arrhythmia. Nature, 440(7083):463?469, 2006. [257] Carlo Santambrogio, Stefano Ricagno, Matteo Colombo, Alberto Barbiroli, Francesco Bonomi, Vittorio Bellotti, Martino Bolognesi, and Rita Grandori. De- loop mutations affect ?2 microglobulin stability, oligomerization, and the low-ph unfolded form. Protein science, 19(7):1386?1394, 2010. [258] Soumya Sanyal, Ivan Anishchenko, Anirudh Dagar, David Baker, and Partha Taluk- dar. Proteingcn: Protein model quality assessment using graph convolutional net- works. bioRxiv, 2020. [259] MA Saper, PJy Bjorkman, and DC Wiley. Refined structure of the human his- tocompatibility antigen hla-a2 at 2.6 a? resolution. Journal of molecular biology, 219(2):277?319, 1991. [260] Catherine A Sarisky and Stephen L Mayo. The ??? fold: explorations in sequence space. Journal of molecular biology, 307(5):1411?1418, 2001. [261] B Peter Sawaya, Josie P Briggs, and Jurgen Schnermann. Amphotericin b nephro- toxicity: the adverse consequences of altered membrane properties. Journal of the American Society of Nephrology, 6(2):154?164, 1995. [262] Martin K Scherer, Brooke E Husic, Moritz Hoffmann, Fabian Paul, Hao Wu, and Frank Noe?. Variational selection of features for molecular kinetics. The Journal of chemical physics, 150(19):194108, 2019. 248 [263] Martin K Scherer, Benjamin Trendelkamp-Schroer, Fabian Paul, Guillermo Pe?rez- Herna?ndez, Moritz Hoffmann, Nuria Plattner, Christoph Wehmeyer, Jan-Hendrik Prinz, and Frank Noe?. Pyemma 2: A software package for estimation, valida- tion, and analysis of markov models. Journal of chemical theory and computation, 11(11):5525?5542, 2015. [264] Martin K. Scherer, Benjamin Trendelkamp-Schroer, Fabian Paul, Guillermo Pe?rez- Herna?ndez, Moritz Hoffmann, Nuria Plattner, Christoph Wehmeyer, Jan-Hendrik Prinz, and Frank Noe?. PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. Journal of Chemical Theory and Computation, 11:5525?5542, October 2015. [265] Peter J Schmid. Dynamic mode decomposition of numerical and experimental data. Journal of fluid mechanics, 656:5?28, 2010. [266] Markus Scho?berl, Nicholas Zabaras, and Phaedon-Stelios Koutsourelakis. Predictive collective variable discovery with deep bayesian models. The Journal of chemical physics, 150(2):024109, 2019. [267] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673?2681, 1997. [268] Kristof T Schu?tt, Pieter-Jan Kindermans, Huziel E Sauceda, Stefan Chmiela, Alexandre Tkatchenko, and Klaus-Robert Mu?ller. Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. arXiv preprint arXiv:1706.08566, 2017. [269] Christian R Schwantes and Vijay S Pande. Improvements in markov state model construction reveal many non-native interactions in the folding of ntl9. Journal of chemical theory and computation, 9(4):2000?2009, 2013. [270] Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses with deep neural networks and symbolic ai. Nature, 555(7698):604?610, 2018. [271] Asif Shajahan, Nitin T Supekar, Anne S Gleinich, and Parastoo Azadi. Deducing the n-and o-glycosylation profile of the spike protein of novel coronavirus sars-cov- 2. Glycobiology, 30(12):981?988, 2020. [272] David E Shaw, Peter J Adams, Asaph Azaria, Joseph A Bank, Brannon Batson, Al- istair Bell, Michael Bergdorf, Jhanvi Bhatt, J Adam Butts, Timothy Correia, et al. Anton 3: twenty microseconds of molecular dynamics simulation before lunch. In Proceedings of the International Conference for High Performance Computing, Net- working, Storage and Analysis, pages 1?11, 2021. 249 [273] David E Shaw, JP Grossman, Joseph A Bank, Brannon Batson, J Adam Butts, Jack C Chao, Martin M Deneroff, Ron O Dror, Amos Even, Christopher H Fenton, et al. Anton 2: raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer. In SC?14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 41?53. IEEE, 2014. [274] Hythem Sidky, Wei Chen, and Andrew L Ferguson. High-resolution markov state models for the dynamics of trp-cage miniprotein constructed over slow folding modes identified by state-free reversible vampnets. The Journal of Physical Chem- istry B, 123(38):7999?8009, 2019. [275] Mateusz Sikora, So?ren von Bu?low, Florian EC Blanc, Michael Gecht, Roberto Covino, and Gerhard Hummer. Map of sars-cov-2 spike epitopes not shielded by glycans. BioRxiv, 2020. [276] Federica Simeoni, May C Morris, Frederic Heitz, and Gilles Divita. Insight into the mechanism of the peptide-based gene delivery system mpg: implications for delivery of sirna into mammalian cells. Nucleic acids research, 31(11):2717?2724, 2003. [277] Cas Simons, Lachlan D Rash, Joanna Crawford, Linlin Ma, Ben Cristofori- Armstrong, David Miller, Kelin Ru, Gregory J Baillie, Yasemin Alanay, Adeline Jacquinet, et al. Mutations in the voltage-gated potassium channel gene kcnh1 cause temple-baraitser syndrome and epilepsy. Nature genetics, 47(1):73?77, 2015. [278] Hugh I Smith, Nicolas Guthertz, Emma E Cawood, Roberto Maya-Martinez, Alexander L Breeze, and Sheena E Radford. The role of the it-state in d76n ?2- microglobulin amyloid assembly: A crucial intermediate or an innocuous bystander? Journal of Biological Chemistry, 295(35):12474?12484, 2020. [279] Wenfei Song, Miao Gui, Xinquan Wang, and Ye Xiang. Cryo-em structure of the sars coronavirus spike glycoprotein in complex with its host cell receptor ace2. PLoS pathogens, 14(8):e1007236, 2018. [280] Giulia Sormani, Alex Rodriguez, and Alessandro Laio. Explicit characterization of the free-energy landscape of a protein in the space of all its c? carbons. Journal of chemical theory and computation, 16(1):80?87, 2019. [281] Angelo Spinello, Andrea Saltalamacchia, and Alessandra Magistrato. Is the rigidity of sars-cov-2 spike receptor-binding motif the hallmark for its enhanced infectiv- ity? insights from all-atom simulations. The journal of physical chemistry letters, 11(12):4785?4790, 2020. 250 [282] Tyler N Starr, Allison J Greaney, Sarah K Hilton, Daniel Ellis, Katharine HD Craw- ford, Adam S Dingens, Mary Jane Navarro, John E Bowen, M Alejandra Tortorici, Alexandra C Walls, et al. Deep mutational scanning of sars-cov-2 receptor binding domain reveals constraints on folding and ace2 binding. Cell, 182(5):1295?1310, 2020. [283] Peter J Steinbach and Bernard R Brooks. New spherical-cutoff methods for long- range forces in macromolecular simulation. Journal of computational chemistry, 15(7):667?683, 1994. [284] Guillaume BE Stewart-Jones, Cinque Soto, Thomas Lemmin, Gwo-Yu Chuang, Ali- aksandr Druz, Rui Kong, Paul V Thomas, Kshitij Wagh, Tongqing Zhou, Anna- Janina Behrens, et al. Trimeric hiv-1-env structures define glycan shields from clades a, b, and g. Cell, 165(4):813?826, 2016. [285] Alexey Strokach, David Becerra, Carles Corbi-Verge, Albert Perez-Riba, and Philip M Kim. Fast and flexible protein design using deep graph neural networks. Cell Systems, 11(4):402?411, 2020. [286] Ernesto Sua?rez, Rafal P Wiewiora, Chris Wehmeyer, Frank Noe?, John D Chodera, and Daniel M Zuckerman. What markov state models can and cannot do: Correla- tion versus path-based observables in protein-folding models. Journal of Chemical Theory and Computation, 17(5):3119?3133, 2021. [287] Yuji Sugita and Yuko Okamoto. Replica-exchange molecular dynamics method for protein folding. Chemical physics letters, 314(1-2):141?151, 1999. [288] Jean-Marie Swiecicki, Margherita Di Pisa, Fabienne Burlina, Pascaline Le?corche?, Christelle Mansuy, Ge?rard Chassaing, and Solange Lavielle. Accumulation of cell- penetrating peptides in large unilamellar vesicles: A straightforward screening assay for investigating the internalization mechanism. Peptide Science, 104(5):533?543, 2015. [289] William C Swope, Jed W Pitera, and Frank Suits. Describing protein folding kinetics by molecular dynamics simulations. 1. theory. The Journal of Physical Chemistry B, 108(21):6571?6581, 2004. [290] Wanbo Tai, Lei He, Xiujuan Zhang, Jing Pu, Denis Voronin, Shibo Jiang, Yusen Zhou, and Lanying Du. Characterization of the receptor-binding domain (rbd) of 2019 novel coronavirus: implication for development of rbd protein as a viral at- tachment inhibitor and vaccine. Cellular & molecular immunology, 17(6):613?620, 2020. [291] Jan Ter Meulen, Edward N Van Den Brink, Leo L M Poon, Wilfred E Marissen, Cynthia S W Leung, Freek Cox, Chung Y Cheung, Arjen Q Bakker, Johannes A 251 Bogaards, Els Van Deventer, et al. Human monoclonal antibody combination against sars coronavirus: synergy and coverage of escape mutants. PLoS medicine, 3(7):e237, 2006. [292] Xiaolong Tian, Cheng Li, Ailing Huang, Shuai Xia, Sicong Lu, Zhengli Shi, Lu Lu, Shibo Jiang, Zhenlin Yang, Yanling Wu, et al. Potent binding of 2019 novel coro- navirus spike protein by a sars coronavirus-specific human monoclonal antibody. Emerging microbes & infections, 9(1):382?385, 2020. [293] Marcelo DT Torres, Shanmugapriya Sothiselvam, Timothy K Lu, and Cesar de la Fuente-Nunez. Peptide design principles for antimicrobial applications. Journal of molecular biology, 431(18):3547?3567, 2019. [294] Gareth A Tribello, Massimiliano Bonomi, Davide Branduardi, Carlo Camilloni, and Giovanni Bussi. Plumed 2: New feathers for an old bird. Computer physics commu- nications, 185(2):604?613, 2014. [295] Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of computational chemistry, 31(2):455?461, 2010. [296] Jonathan H Tu. Dynamic mode decomposition: Theory and applications. PhD thesis, Princeton University, 2013. [297] Andrejs Tucs, Duy Phuoc Tran, Akiko Yumoto, Yoshihiro Ito, Takanori Uzawa, and Koji Tsuda. Generating ampicillin-level antimicrobial peptides with activity-aware generative adversarial networks. ACS omega, 5(36):22847?22851, 2020. [298] Beata Turon?ova?, Mateusz Sikora, Christoph Schu?rmann, Wim JH Hagen, Sonja Welsch, Florian EC Blanc, So?ren von Bu?low, Michael Gecht, Katrin Bagola, Cindy Ho?rner, et al. In situ structural analysis of sars-cov-2 spike reveals flexibility medi- ated by three hinges. Science, 370(6513):203?208, 2020. [299] Jakob P Ulmschneider and Martin B Ulmschneider. Molecular dynamics simulations are redefining our view of peptides interacting with biological membranes. Accounts of chemical research, 51(5):1106?1116, 2018. [300] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008. [301] David Van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and Herman JC Berendsen. Gromacs: fast, flexible, and free. Journal of computational chemistry, 26(16):1701?1718, 2005. [302] Yasemin Bozkurt Varolgu?nes?, Tristan Bereau, and Joseph F Rudzinski. Interpretable embeddings from molecular simulations using gaussian mixture variational autoen- coders. Machine Learning: Science and Technology, 1(1):015012, 2020. 252 [303] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. [304] Petar Velic?kovic?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [305] Vincent A Voelz, Gregory R Bowman, Kyle Beauchamp, and Vijay S Pande. Molec- ular simulation of ab initio protein folding for a millisecond folder ntl9 (1- 39). Jour- nal of the American Chemical Society, 132(5):1526?1528, 2010. [306] Swapnil Wagle, Vasil N Georgiev, Tom Robinson, Rumiana Dimova, Reinhard Lipowsky, and Andrea Grafmu?ller. Interaction of snare mimetic peptides with lipid bilayers: effects of secondary structure, bilayer composition and lipid anchoring. Scientific reports, 9(1):1?14, 2019. [307] Alexandra C Walls, Young-Jun Park, M Alejandra Tortorici, Abigail Wall, Andrew T McGuire, and David Veesler. Structure, function, and antigenicity of the sars-cov-2 spike glycoprotein. Cell, 181(2):281?292, 2020. [308] Alexandra C Walls, M Alejandra Tortorici, Brandon Frenz, Joost Snijder, Wentao Li, Fe?lix A Rey, Frank DiMaio, Berend-Jan Bosch, and David Veesler. Glycan shield and epitope masking of a coronavirus spike protein observed by cryo-electron microscopy. Nature structural & molecular biology, 23(10):899?905, 2016. [309] Yushun Wan, Jian Shang, Rachel Graham, Ralph S Baric, and Fang Li. Receptor recognition by the novel coronavirus from wuhan: an analysis based on decade-long structural studies of sars coronavirus. Journal of virology, 94(7):e00127?20, 2020. [310] Changhao Wang, D?Artagnan Greene, Li Xiao, Ruxi Qi, and Ray Luo. Recent developments and applications of the mmpbsa method. Frontiers in molecular bio- sciences, 4:87, 2018. [311] Guangshun Wang, Xia Li, and Zhe Wang. Apd3: the antimicrobial peptide database as a tool for research and education. Nucleic acids research, 44(D1):D1087?D1093, 2016. [312] Junmei Wang, Romain M Wolf, James W Caldwell, Peter A Kollman, and David A Case. Development and testing of a general amber force field. Journal of computa- tional chemistry, 25(9):1157?1174, 2004. [313] Lingle Wang and BJ Berne. Efficient sampling of puckering states of monosac- charides through replica exchange with solute tempering and bond softening. The Journal of Chemical Physics, 149(7):072306, 2018. 253 [314] Lingle Wang, Richard A Friesner, and BJ Berne. Replica exchange with solute scaling: a more efficient version of replica exchange with solute tempering (rest2). The Journal of Physical Chemistry B, 115(30):9431?9438, 2011. [315] Qihui Wang, Yanfang Zhang, Lili Wu, Sheng Niu, Chunli Song, Zengyuan Zhang, Guangwen Lu, Chengpeng Qiao, Yu Hu, Kwok-Yung Yuen, et al. Structural and functional basis of sars-cov-2 entry by using human ace2. Cell, 181(4):894?904, 2020. [316] Weiwei Wang and Roderick MacKinnon. Cryo-em structure of the open human ether-a?-go-go-related k+ channel herg. Cell, 169(3):422?430, 2017. [317] Yingjie Wang, Meiyi Liu, and Jiali Gao. Enhanced receptor binding of sars-cov-2 through networks of hydrogen-bonding and hydrophobic interactions. Proceedings of the National Academy of Sciences, 117(25):13967?13974, 2020. [318] Ze-Jun Wang, Stephanie M Soohoo, Purushottam B Tiwari, Grzegorz Piszczek, and Tinatin I Brelidze. Chlorpromazine binding to the pas domains uncovers the ef- fect of ligand modulation on eag channel activity. Journal of Biological Chemistry, 295(13):4114?4123, 2020. [319] Michael D Ward, Maxwell I Zimmerman, Artur Meller, Moses Chung, SJ Swami- dass, and Gregory R Bowman. Deep learning the structural determinants of protein biochemical properties by comparing structural ensembles with diffnets. Nature communications, 12(1):1?12, 2021. [320] Jeffrey Warmke, Rachel Drysdale, and Barry Ganetzky. A distinct potassium chan- nel polypeptide encoded by the drosophila eag locus. Science, 252(5012):1560? 1562, 1991. [321] Yasunori Watanabe, Joel D Allen, Daniel Wrapp, Jason S McLellan, and Max Crispin. Site-specific glycan analysis of the sars-cov-2 spike. Science, 369(6501):330?333, 2020. [322] Yasunori Watanabe, Zachary T Berndsen, Jayna Raghwani, Gemma E Seabright, Joel D Allen, Oliver G Pybus, Jason S McLellan, Ian A Wilson, Thomas A Bowden, Andrew B Ward, et al. Vulnerabilities in coronavirus glycan shields despite extensive glycosylation. Nature communications, 11(1):1?10, 2020. [323] Yasunori Watanabe, Thomas A Bowden, Ian A Wilson, and Max Crispin. Exploita- tion of glycosylation in enveloped virus pathobiology. Biochimica et Biophysica Acta (BBA)-General Subjects, 1863(10):1480?1497, 2019. [324] Marcus Weber, Konstantin Fackeldey, and Christof Schu?tte. Set-free markov state model building. The Journal of chemical physics, 146(12):124133, 2017. 254 [325] Christoph Wehmeyer and Frank Noe?. Time-lagged autoencoders: Deep learning of slow collective variables for molecular kinetics. The Journal of chemical physics, 148(24):241703, 2018. [326] Annie M Westerlund and Lucie Delemotte. Inflecs: clustering free energy land- scapes with gaussian mixtures. Journal of chemical theory and computation, 15(12):6752?6759, 2019. [327] Jonathan R Whicher and Roderick MacKinnon. Structure of the voltage-gated k+ channel eag1 reveals an alternative voltage sensing mechanism. Science, 353(6300):664?669, 2016. [328] Hyeonuk Woo, Sang-Jun Park, Yeol Kyo Choi, Taeyong Park, Maham Tanveer, Yi- wei Cao, Nathan R Kern, Jumin Lee, Min Sun Yeom, Tristan I Croll, et al. De- veloping a fully glycosylated full-length sars-cov-2 spike protein model in a viral membrane. The journal of physical chemistry B, 124(33):7128?7137, 2020. [329] Daniel Wrapp, Nianshuang Wang, Kizzmekia S Corbett, Jory A Goldsmith, Ching- Lin Hsieh, Olubukola Abiona, Barney S Graham, and Jason S McLellan. Cryo- em structure of the 2019-ncov spike in the prefusion conformation. Science, 367(6483):1260?1263, 2020. [330] Hao Wu and Frank Noe?. Gaussian markov transition models of molecular kinetics. The Journal of chemical physics, 142(8):02B612 1, 2015. [331] Hao Wu and Frank Noe?. Variational approach for learning markov processes from time series data. Journal of Nonlinear Science, 30(1):23?66, 2020. [332] Hao Wu, Feliks Nu?ske, Fabian Paul, Stefan Klus, Pe?ter Koltai, and Frank Noe?. Variational koopman models: slow collective variables and molecular kinetics from short off-equilibrium simulations. The Journal of chemical physics, 146(15):154104, 2017. [333] Kailang Wu, Guiqing Peng, Matthew Wilken, Robert J Geraghty, and Fang Li. Mechanisms of host receptor adaptation by severe acute respiratory syndrome coro- navirus. Journal of Biological Chemistry, 287(12):8904?8911, 2012. [334] Yan Wu, Feiran Wang, Chenguang Shen, Weiyu Peng, Delin Li, Cheng Zhao, Zhao- hui Li, Shihua Li, Yuhai Bi, Yang Yang, et al. A noncompeting pair of human neutralizing antibodies block covid-19 virus binding to its receptor ace2. Science, 368(6496):1274?1278, 2020. [335] Tian Xie, Arthur France-Lanord, Yanming Wang, Yang Shao-Horn, and Jeffrey C Grossman. Graph dynamical networks for unsupervised learning of atomic scale dynamics in materials. Nature communications, 10(1):1?9, 2019. 255 [336] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Physical review letters, 120(14):145301, 2018. [337] Renhong Yan, Yuanyuan Zhang, Yaning Li, Lu Xia, Yingying Guo, and Qiang Zhou. Structural basis for the recognition of sars-cov-2 by full-length human ace2. Science, 367(6485):1444?1448, 2020. [338] Tianyi Yang, Johnny C Wu, Chunli Yan, Yuanfeng Wang, Ray Luo, Michael B Gon- zales, Kevin N Dalby, and Pengyu Ren. Virtual screening using molecular simula- tions. Proteins: Structure, Function, and Bioinformatics, 79(6):1940?1951, 2011. [339] Yi Isaac Yang, Qiang Shao, Jun Zhang, Lijiang Yang, and Yi Qin Gao. Enhanced sampling in molecular dynamics. The Journal of chemical physics, 151(7):070902, 2019. [340] Hangping Yao, Yutong Song, Yong Chen, Nanping Wu, Jialu Xu, Chujie Sun, Jiax- ing Zhang, Tianhao Weng, Zheyuan Zhang, Zhigang Wu, et al. Molecular architec- ture of the sars-cov-2 virus. Cell, 183(3):730?738, 2020. [341] Yuan Yao, Raymond Z Cui, Gregory R Bowman, Daniel-Adriano Silva, Jian Sun, and Xuhui Huang. Hierarchical nystro?m methods for constructing markov state models for conformational dynamics. The Journal of chemical physics, 138(17):05B602 1, 2013. [342] Guizi Ye, Hongyu Wu, Jinjiang Huang, Wei Wang, Kuikui Ge, Guodong Li, Jiang Zhong, and Qingshan Huang. Lamp2: a major update of the database linking an- timicrobial peptides. Database, 2020, 2020. [343] In-Chul Yeh and Gerhard Hummer. System-size dependence of diffusion coeffi- cients and viscosities from molecular dynamics simulations with periodic boundary conditions. The Journal of Physical Chemistry B, 108(40):15873?15879, 2004. [344] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and Jure Leskovec. Hierarchical graph representation learning with differentiable pool- ing. arXiv preprint arXiv:1806.08804, 2018. [345] PEter ZAvodszky, Jo?zsef Kardos, A?da?m Svingor, and Gregory A Petsko. Adjustment of conformational flexibility is a key event in the thermal adaptation of proteins. Proceedings of the National Academy of Sciences, 95(13):7406?7411, 1998. [346] Mark A Zaydman, Marina A Kasimova, Kelli McFarland, Zachary Beller, Panpan Hou, Holly E Kinser, Hongwu Liang, Guohui Zhang, Jingyi Shi, Mounir Tarek, et al. Domain?domain interactions determine the gating, permeation, pharmacology, and subunit modulation of the iks ion channel. Elife, 3, 2014. 256 [347] Dongdong Zhang, Jiaxi Wang, and Donggang Xu. Cell-penetrating peptides as non- invasive transmembrane vectors for the development of novel multifunctional drug- delivery systems. Journal of controlled release, 229:130?139, 2016. [348] Xiaofei Zhang, Federica Bertaso, Jong W Yoo, Karsten Baumga?rtel, Sinead M Clancy, Van Lee, Cynthia Cienfuegos, Carly Wilmot, Jacqueline Avis, Truc Hunyh, et al. Deletion of the potassium channel kv12. 2 causes hippocampal hyperexcitabil- ity and epilepsy. Nature neuroscience, 13(9):1056?1058, 2010. [349] Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, and Yann LeCun. Adver- sarially regularized autoencoders. In International conference on machine learning, pages 5902?5911. PMLR, 2018. [350] Peng Zhao, Jeremy L Praissman, Oliver C Grant, Yongfei Cai, Tianshu Xiao, Kate- lyn E Rosenbalm, Kazuhiro Aoki, Benjamin P Kellman, Robert Bridger, Dan H Barouch, et al. Virus-receptor interactions of glycosylated sars-cov-2 spike and hu- man ace2 receptor. Cell host & microbe, 28(4):586?601, 2020. [351] Yaxian Zhao, Marcel P Goldschen-Ohm, Joa?o H Morais-Cabral, Baron Chanda, and Gail A Robertson. The intrinsically liganded cyclic nucleotide?binding ho- mology domain promotes kcnh channel activation. Journal of General Physiology, 149(2):249?260, 2017. [352] Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy, Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identi- fication of potent ddr1 kinase inhibitors. Nature biotechnology, 37(9):1038?1040, 2019. [353] Peng Zhou, Xing-Lou Yang, Xian-Guang Wang, Ben Hu, Lei Zhang, Wei Zhang, Hao-Rui Si, Yan Zhu, Bei Li, Chao-Lin Huang, et al. A pneumonia outbreak asso- ciated with a new coronavirus of probable bat origin. nature, 579(7798):270?273, 2020. [354] Ruhong Zhou. Trp-cage: folding free energy landscape in explicit water. Proceed- ings of the National Academy of Sciences, 100(23):13280?13285, 2003. [355] Tongqing Zhou, Nicole A Doria-Rose, Cheng Cheng, Guillaume BE Stewart-Jones, Gwo-Yu Chuang, Michael Chambers, Aliaksandr Druz, Hui Geng, Krisha McKee, Young Do Kwon, et al. Quantification of the impact of the hiv-1-glycan shield on antibody elicitation. Cell reports, 19(4):719?732, 2017. [356] Na Zhu, Dingyu Zhang, Wenling Wang, Xingwang Li, Bo Yang, Jingdong Song, Xiang Zhao, Baoying Huang, Weifeng Shi, Roujian Lu, et al. A novel coronavirus from patients with pneumonia in china, 2019. New England journal of medicine, 2020. 257 List of publications 1. Mahdi Ghorbani, Bernard R Brooks, and Jeffery B Klauda. Critical sequence hotspots for bind- ing of novel coronavirus to angiotensin converter enzyme as evaluated by molecular simulations. The Journal of Physical Chemistry B, 124(45):10034?10047, 2020. 2. Mahdi Ghorbani, Bernard R Brooks, and Jeffery B Klauda. Exploring dynamics and network analysis of spike glycoprotein of sars-cov-2. Biophysical Journal, 120(14):2902?2913, 2021. 3. Mahdi Ghorbani, Phillip S Hudson, Michael R Jones, Fe?lix Aviat, Rube?n Meana-Pan?eda, Jef- fery B Klauda, and Bernard R Brooks. A replica exchange umbrella sampling (reus) approach to predict host?guest binding free energies in sampl8 challenge. Journal of computer-aided molecular design, 35(5):667?677, 2021. 4. Mahdi Ghorbani, Samarjeet Prasad, Bernard R Brooks, and Jeffery B Klauda. Deep attention based variational autoencoder for antimicrobial peptide discovery. Biorxiv, 2022. 5. Mahdi Ghorbani, Samarjeet Prasad, Jeffery B Klauda, and Bernard R Brooks. Variational em- bedding of protein folding simulations using gaussian mixture variational autoencoders. The Journal of Chemical Physics, 155(19):194108, 2021. 6. Mahdi Ghorbani, Samarjeet Prasad, Jeffery B Klauda, and Bernard R Brooks. Graphvampnet, using graph neural networks and variational approach to markov processes for dynamical mod- eling of biomolecules. The Journal of Chemical Physics, 156(18):184103, 2022. 7. Mahdi Ghorbani, Eric Wang, Andreas Kra?mer, and Jeffery B Klauda. Molecular dynamics simulations of ethanol permeation through single and double-lipid bilayers. The Journal of Chemical Physics, 153(12):125101, 2020. 8. S Nikfarjam, M Ghorbani, S Adhikari, AJ Karlsson, EV Jouravleva, TJ Woehl, and MA Anisi- mov. Irreversible nature of mesoscopic aggregates in lysozyme solutions. Colloid Journal, 81(5):546?554, 2019. 258 Talks and Presentations 1. Ghorbani M. ?GraphVAMPNet, using graph neural networks and variational ap- proach to Markov processes for dynamical modeling of biomolecules?, (Conference talk, ACS2022, San Diego, US) 2. Ghorbani M. ?Unraveling the allosteric activation of GPCRs using Metadynamics and deep learning? (Conference talk, BPS2022, San Francisco, US) 3. Ghorbani M. ?Dynamical coarse graining of molecular systems using GraphVAMP- Nets?(LCB seminar series, 2021, NIH, Bethesda) 4. Ghorbani M.; Brooks B. R.; Klauda J. B.; ?An integrative MD simulation and net- work analysis approach to study Glycosylation of spike in SARS-COV-2?(Virtual Poster Presentation, BPS2021) 5. Ghorbani M. ?Gausisan mixture variational autoencoders for dimensionality reduc- tion and clustering of protein folding simulations?(LCB seminar series, 2020, NIH, Bethesda) 6. Ghorbani M. ?Investingating dynamics and network analysis of spike protein in SARS-COV-2?(LCB seminar series, 2020, NIH) 7. Ghorbani M., Harron M., Wang E., Klauda J. B., ?Mechanism of permeability and toxicity of alcohols to cell membranes by MD simulations?(Poster Presentation ACS2019, San Diego, US) 8. Ghorbani M., Wang E., Klauda J. B., ?Calculating Ethanol Permeability of Mem- branes Through Molecular Dynamics Simulations?(Poster Presentation BPS2019, Baltimore, US) 259