ABSTRACT
Title of dissertation: MOLECULAR DYNAMICS SIMULATION AND
MACHINE LEARNING STUDY OF
BIOLOGICAL PROCESSES
Mahdi Ghorbani, Doctor of Philosophy, 2022
Dissertation directed by: Professor Jeffery B. Klauda
University of Maryland, Chemical Engineering
In this dissertation, I use computational techniques especially molecular dynamics (MD)
and machine learning to study important biological processes. MD simulations can effec-
tively be used to understand and investigate biologically relevant systems with lengths and
timescales that are otherwise inaccessible to experimental techniques. These include but
are not limited to thermodynamics and kinetics of protein folding, protein-ligand binding
free energies, interaction of proteins with membranes, and designing new therapeutics for
diseases with rational design strategies. The first chapter includes a detailed description of
the computational methods including MD, Markov state modeling and deep learning. In the
second chapter, we studied membrane active peptides using MD simulation and machine
learning. Two cell penetrating peptides MPG and Hst5 were simulated in the presence
of membrane. We showed that MPG enters the model membrane through its N-terminal
hydrophobic residues while Hst5 remains attached to the phosphate layer. Formation of
helical conformation for MPG helps its deeper insertion into membrane. Natural language
processing (NLP) and deep generative modeling using a variational attention based vari-
ational autoencoder (VAE) was used to generate novel antimicrobial peptides. These in
silico generated peptides have a high quality with similar physicochemical properties to
real antimicrobial peptides. In the third chapter, we studied kinetics of protein folding us-
ing Markov state models and machine learning. We studied the kinetics of misfolding in
?2-microglobulin using MSM analysis which gave us insights about the metastable states
of ?2m where the outer strands are unfolded and the hydrophobic core gets exposed to
solvent and is highly amyloidogenic. In the next part of this chapter, we propose a machine
learning model Gaussian mixture variational autoencoder (GMVAE) for simultaneous di-
mensionality reduction and clustering of MD simulations. The last part of this chapter
is about a novel machine learning model GraphVAMPNet which uses graph neural net-
works and variational approach to markov processes for kinetic modeling of protein fold-
ing. In the last chapter, we study two membrane proteins, spike protein of SARS-COV-2
and EAG potassium channel using MD simulations. Binding free energy calculations using
MMPBSA showed a higher binding affinity of receptor binding domain in SARS-COV-2
to its receptor ACE2 than SARS-COV which is one of the major reason for its higher in-
fection rate. Hotspots of interaction were also identified at the interface. Glycans on the
spike protein shield the spike from antibodies. Our MD simulation on the full length spike
showed that glycan dynamics gives the spike protein an effective shield. However, breaches
were found in the RBD at the open state for therapeutics using network analysis. In the last
section, we study ligand binding to the PAS domain of EAG potassium channel and show
that a residue Tyr71 blocks the binding pocket. Ligand binding inhibits the current through
EAG channel.
MOLECULAR DYNAMICS SIMULATION AND MACHINE
LEARNING STUDY OF BIOLOGICAL PROCESSES
by
Mahdi Ghorbani
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2022
Advisory Committee:
Professor Jeffery B. Klauda, Chair/Advisor
Dr. Bernard R. Brooks, Co-Advisor
Prof. Srinivasa R. Raghavan
Prof. Taylor J. Woehl
Prof. Pratyush Tiwary
?c Copyright by
Mahdi Ghorbani
2022
Dedication
I would like to dedicate this dissertation to my family and friends especially my mother
who has always been a source of inspiration and hard work to me. They have taught me
to have a callous mind, never give up when confronting daunting challenges and know
that hard work always pays off. I also dedicate this dissertation to all my friends who
supported me throughout the process. I appreciate what they have done for me especially
my dear friends Hamid Doosthosseini and Ehsan Faegh who always encouraged me to
pursue science.
ii
Acknowledgments
It is impossible to acknowledge everyone that supported me during my PhD. Firstly, I
would like to thank my advisor Prof. Klauda. He has been a wonderful mentor who has
always helped me in successfully completing my PhD and I am proud of working with him.
I would also like to thank Dr. Bernard Brooks my Co-advisor at NIH. I consider myself
lucky to work with such intelligent and caring people. Working alongside them and group
members at NIH and UMD gave me the expertise I needed to complete my degree and
without their help this wouldn?t have been possible. I would like to thank my collaborators
Dr. Karlsson at UMD and Dr. Brelidze at Georgetown university. I would like to thank the
computing resources at UMD (deepthought2) and NIH (biowulf and lobos) for providing
the computational time for my projects.
iii
List of Figures
1.1 Periodic boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Illustration of replica exchange molecular dynamics (REMD) . . . . . . . . 10
1.3 An illustration of metadynamics simulation technique . . . . . . . . . . . . 12
1.4 Representation of a perceptron . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 multilayer perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6 Gradient descent algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.7 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . . . . . 25
1.8 Unrolling a RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.9 Illustration of Long short term memory unit . . . . . . . . . . . . . . . . . 27
1.10 Variational autoencoder with encoder, decoder and a gaussian latent space . 30
2.1 Secondary structure of Hst5 and MPG . . . . . . . . . . . . . . . . . . . . 34
2.2 Starting conformations of peptides . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Translocation of CPPs using HMMM . . . . . . . . . . . . . . . . . . . . . 40
2.4 Heatmaps for insertion of peptides into the model membrane . . . . . . . . 41
2.5 Heatmap plots for Hst5 and MPG starting as helices . . . . . . . . . . . . . 42
2.6 Heatmaps for MPG and Hst5 starting from unfolded . . . . . . . . . . . . . 44
2.7 Heatmap for Hst5 for orientation 2 and 3 . . . . . . . . . . . . . . . . . . . 45
2.8 Heatmap for MPG for orientation 2 and 3 . . . . . . . . . . . . . . . . . . 46
2.9 Cellular uptake studies done in Karlsson lab . . . . . . . . . . . . . . . . . 46
iv
2.10 Effect of time and cargo orientation on translocation . . . . . . . . . . . . . 47
2.11 snapshots of MPG with DOPC(80%)-DOPG(20%) . . . . . . . . . . . . . 49
2.12 insertion depth and secondary structure of MPG at 100 and 80 DOPC . . . 49
2.13 insertion depth and secondary structure of MPG at 60 and 40 DOPC . . . . 50
2.14 classification network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.15 Training and validation accuracy . . . . . . . . . . . . . . . . . . . . . . . 62
2.16 AMP generative model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.17 Evaluation of the AMP-generative model over different values of ?KL and ?a 64
2.18 Physico-Chemical properties of the generated 10,000 AMP sequences . . . 66
3.1 feature selection with VAMP-2 score over 3 lagtimes (10,20,50 ns) . . . . . 74
3.2 Optimal choice of hyperparameters for MSM . . . . . . . . . . . . . . . . 74
3.3 Free energy landscape in the space of 4 TICs . . . . . . . . . . . . . . . . . 75
3.4 A) implied timescales B) CK test . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Eigenvectors of the transition matrix . . . . . . . . . . . . . . . . . . . . . 77
3.6 fraction of native contact for different states . . . . . . . . . . . . . . . . . 77
3.7 metastable state assignment according to PCCA++ over the TICA space . . 78
3.8 hydrophobic SASA over the TICA landscape . . . . . . . . . . . . . . . . 79
3.9 hydrophobic SASA and flux over different states . . . . . . . . . . . . . . . 81
3.10 representative structures and the timescale of transition between different
states . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.11 RMSF of different metastable states from sampled snapshots. . . . . . . . . 82
3.12 ? -sheet content of different states A) S1 B) S2) C) S3 and D) S4 . The grey
area in each figure shows the ? -sheet content of state S1 . . . . . . . . . . 83
3.13 Representative structures of states S1, S2 and S3 . . . . . . . . . . . . . . . 86
3.14 graphical model for inference and generative parts of GMVAE . . . . . . . 94
v
3.15 Native folded structure of studied proteins. . . . . . . . . . . . . . . . . . . 94
3.16 Training and validation loss for Trp-Cage example . . . . . . . . . . . . . . 96
3.17 Reconstruction loss vs latent space dimension . . . . . . . . . . . . . . . . 97
3.18 Results of GMVAE for Trp-cage. . . . . . . . . . . . . . . . . . . . . . . . 98
3.19 rp-cage folding transitions . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.20 clusters of Trp-cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.21 Results of GMVAE for BBA . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.22 BBA folding and unfolding transitions . . . . . . . . . . . . . . . . . . . . 103
3.23 RMSD of different parts of BBA . . . . . . . . . . . . . . . . . . . . . . . 104
3.24 GMVAE embedding results for villin . . . . . . . . . . . . . . . . . . . . . 106
3.25 Transitions between different states in villin . . . . . . . . . . . . . . . . . 106
3.26 RMSD distribution of helices in Villin . . . . . . . . . . . . . . . . . . . . 108
3.27 Overview of the architecture of GraphVAMPNet . . . . . . . . . . . . . . . 117
3.28 results of GraphVAMPNet for TrpCage . . . . . . . . . . . . . . . . . . . . 126
3.29 training and validation losses from SchNet based VAMPNet TrpCage . . . . 127
3.30 Representative structure of each metastable state in TrpCage . . . . . . . . 130
3.31 Results of GraphVampNet for Villin . . . . . . . . . . . . . . . . . . . . . 132
3.32 Representative structure of each metastable state in Villin . . . . . . . . . . 133
3.33 Results of GraphVAMPNet for NTL9 . . . . . . . . . . . . . . . . . . . . 135
3.34 Representative structure of each metastable state in NTL9 . . . . . . . . . . 137
3.35 Comparison of implied timescales . . . . . . . . . . . . . . . . . . . . . . 141
3.36 Comparison of implied timescales from GraphVAMPNet and standard VAMP-
Net for A)TrpCage B)Villin C)NTL9 . . . . . . . . . . . . . . . . . . . . . 142
4.1 superposition of RBD of SARS-COV . . . . . . . . . . . . . . . . . . . . . 147
4.2 Sequence comparison of the receptor binding motif (RBM) . . . . . . . . . 149
vi
4.3 C? RMSD plots for nCOV-2019 and SARS-COV and mutants of SARS-
COV-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
4.4 RMSF plots for nCOV-2019-WT and mutants . . . . . . . . . . . . . . . . 156
4.5 Mapping the principal components of RBD . . . . . . . . . . . . . . . . . 158
4.6 Dynamic cross correlation maps for nCOV-2019, SARS-COV and mutants . 159
4.7 Binding energy decomposition per residue for RBD of nCOV-2019 and
SARS-COV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.8 Total binding energy of SARS-COV, nCOV-2019 and mutants . . . . . . . 163
4.9 H bonds between RBD of nCOV-2019 and ACE2 . . . . . . . . . . . . . . 167
4.10 Binding energy decomposition for systems: nCOV-2019, T478I and N439K. 171
4.11 Structure of spike protein and its glycosylation pattern . . . . . . . . . . . . 178
4.12 RMSD of different regions of the spike protein . . . . . . . . . . . . . . . . 184
4.13 Snapshot of spike in open state after 1000ns . . . . . . . . . . . . . . . . . 184
4.14 RMSF of glycans in the open and closed state . . . . . . . . . . . . . . . . 185
4.15 2-dimensional PCA for the open state of spike protein head . . . . . . . . . 186
4.16 porcupine plot for RBD-up and distribution of distances between NTD of
each monomer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
4.17 Glycan occupacy of different regions . . . . . . . . . . . . . . . . . . . . . 189
4.18 Centrality measurements (eigenvector and betweenness) for A) RBD-up
and B) RBD-down states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.19 network centralities for the head region . . . . . . . . . . . . . . . . . . . . 192
4.20 microdomains in the open state of spike head . . . . . . . . . . . . . . . . 194
4.21 Antibody overlap analysis for the open state. . . . . . . . . . . . . . . . . . 196
4.22 Avg number of clashes of each glycan with the overlaid antibody . . . . . . 198
vii
4.23 Structure of full-length EAG channel embedded in membrane. VSD stands
for voltage sensing domain which includes S1-S4 transmembrane helices.
Pore domain (PD) includes transmembranes S1 and S2 . . . . . . . . . . . 204
4.24 Replica exchange solute tempering. A) distribution of enthalpies for differ-
ent replicas showing the high overlap B) transition of Tyr71 residue from
the crystal structure to a state after REST simulations where it no longer
blocks the binding pocket C) Random walk of resplicas 1,2 and 3 in the
replica space with their neighbors over the course of simulations D) Dock-
ing CPZ to the binding pocket of PAS after REST simulation. . . . . . . . 209
4.25 Docked poses of the 5 ligands . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.26 Binding free energy decomposition for residues with higher than 0.5 kcal/mol
contribution to binding free energy of different ligands . . . . . . . . . . . 211
4.27 electrostatic nature of binding pocket . . . . . . . . . . . . . . . . . . . . . 212
4.28 RMSD of different regions of the EAG channel for apo and bound states . . 213
4.29 RMSF of different regions of the PAS . . . . . . . . . . . . . . . . . . . . 214
4.30 H-bonds and salt-bridges between PAS and CNBHD for apo and bound state 214
4.31 Current flow analysis for the bound state and the current flow plots . . . . . 216
4.32 current flow difference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
viii
List of Tables
2.1 Membrane lipid components for yeast membrane . . . . . . . . . . . . . . 36
2.2 BLEU, accuracy and perplexity of a few selected models . . . . . . . . . . 65
2.3 self-BLEU (sBLEU) for 3,4 and 5-grams and KL divergence . . . . . . . . 68
3.1 stationary probability and free energy of different metastable states . . . . . 78
3.2 Chosen hyperparameters for each protein . . . . . . . . . . . . . . . . . . . 93
3.3 Hyperparameters for each system in GraphVAMPNet . . . . . . . . . . . . 125
3.4 Average VAMP-2 score for each system . . . . . . . . . . . . . . . . . . . 125
3.5 Implied timescales calculated for TrpCage (at lagtime of 20ns), Villin (at
lagtime of 20ns) and NTL9 (lagtime of 200ns) from SchNet based Graph-
VAMPNet and standard VAMPNet . . . . . . . . . . . . . . . . . . . . . . 128
4.1 Binding free energy decomposition in kcal/mol for nCOV-2019, SARS-
COV and mutants of SASR-COV-2 . . . . . . . . . . . . . . . . . . . . . . 163
4.2 H-bonds and salt bridges between nCOV-2019(salt bridges are shown as bold)166
4.3 Binding free energies kcal/mol . . . . . . . . . . . . . . . . . . . . . . . . 210
ix
Table of Contents
Dedication ii
Acknowledgements iii
1 Introduction 1
1.1 Molecular dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Force Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Hamiltonian equations of motion . . . . . . . . . . . . . . . . . . . 3
1.1.3 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Periodic boundary conditions . . . . . . . . . . . . . . . . . . . . . 5
1.1.5 Cutoff methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Thermostat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.7 Barostats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.8 Enhanced sampling methods . . . . . . . . . . . . . . . . . . . . . 9
1.2 Kinetic modeling and Markov State Models . . . . . . . . . . . . . . . . . 14
1.2.1 MSM construction from MD simulations . . . . . . . . . . . . . . 18
1.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 Artificial Neural network . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . . . 24
1.3.3 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . 25
1.3.4 Variational Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 28
1.3.5 Graph neural networks . . . . . . . . . . . . . . . . . . . . . . . . 31
1.4 Dissertation overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2 Machine learning and MD for membrane active peptides 33
2.1 Molecular dynamics of cell penetrating peptide interaction with model mem-
branes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.1 Simulation methods . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 48
2.2 Deep generative models for Antimicrobial peptide discovery . . . . . . . . 53
x
2.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2.2 AMP prediction model . . . . . . . . . . . . . . . . . . . . . . . . 55
2.2.3 Variational autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 57
Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Variational Attention . . . . . . . . . . . . . . . . . . . . . . . . . 59
Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.2.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . 60
Antimicrobial prediction . . . . . . . . . . . . . . . . . . . . . . . 60
Training the generating network . . . . . . . . . . . . . . . . . . . 61
2.2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3 Markov modeling and machine learning 70
3.1 Markov modeling of conformational fluctuations in ?2-microglobulin . . . . 70
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
MSM construction and validation . . . . . . . . . . . . . . . . . . 73
3.2.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 82
3.3 Variational embedding of protein folding simulations using Gaussian mix-
ture variational autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.3 Model parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.3.5 Trp-cage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.3.6 BBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.3.7 Villin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.3.8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 108
3.4 GraphVAMPNet, using graph neural networks and variational approach to
markov processes for dynamical modeling of biomolecules . . . . . . . . . 112
3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
3.4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3.4.3 Protein Graph representation . . . . . . . . . . . . . . . . . . . . . 120
Graph Convolution layer . . . . . . . . . . . . . . . . . . . . . . . 121
SchNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.4.4 Model selection and Hyperparameters . . . . . . . . . . . . . . . . 124
3.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
3.4.6 Trpcage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.4.7 Villin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.4.8 NTL9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.4.9 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 136
xi
4 Molecular dynamics study of membrane proteins 145
4.1 Critical Sequence hotspots for binding of novel coronavirus to Angiotensin
Converter Enzyme as Evaluated by Molecular Simulations . . . . . . . . . 145
4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Sequence comparison and mutant preparation . . . . . . . . . . . . 151
Molecular dynamics simulations . . . . . . . . . . . . . . . . . . . 152
Gibbs free energy and correlated motions . . . . . . . . . . . . . . 153
Binding free energy from MMPBSA method . . . . . . . . . . . . 154
4.1.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Structural dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . 155
PCA and approximate free energy landscape . . . . . . . . . . . . . 156
Dynamic Cross Correlation Maps (DCCM) . . . . . . . . . . . . . 157
Binding free energies . . . . . . . . . . . . . . . . . . . . . . . . . 159
Important interactions at the RBD-ACE2 interface . . . . . . . . . . 162
4.1.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 166
4.2 Exploring dynamics and network analysis of spike glycoprotein of SARS-
COV-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Molecular dynamcis simulations . . . . . . . . . . . . . . . . . . . 179
Solvent accessible surface area (SASA) . . . . . . . . . . . . . . . 180
Network analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Glycan-antibody overlap analysis . . . . . . . . . . . . . . . . . . 182
4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Dynamical motions of the spike protein . . . . . . . . . . . . . . . 182
Occupancy of spike protein by glycans . . . . . . . . . . . . . . . . 188
Glycan-Glycan interaction and network analysis of glycans . . . . . 189
Antibody overlap analysis . . . . . . . . . . . . . . . . . . . . . . 195
4.2.4 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 197
4.3 Molecular dynamics of ligand binding to PAS domain of EAG channel . . . 202
4.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Full length EAG simulation . . . . . . . . . . . . . . . . . . . . . . 212
4.3.3 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . 217
5 Conclusions and open problems 221
Bibliography 225
xii
Chapter 1: Introduction
1.1 Molecular dynamics
Proteins are responsible for nearly all biological processes that is essential for life.
They metabolize nutrients, regulate genetics, recognize pathogens and sense the outside
world. This is remarkable considering proteins are linear polymers of 20 building blocks
called amino acids. Functionality of proteins depends on spatial and temporal structure
of the protein. However, a 3D shape of protein is not the only determinant of protein
function. Conformational flexibility is an inherent property of all proteins and it is essen-
tial for the function of many proteins such as transport proteins, signal transduction pro-
teins, cellular recognition and numerous enzymes.[345] Allosteric proteins, such as GPCR
perform large scale conformational changes upon binding of ligand to their binding site
which induces large conformational changes and a cascade of intracellular response.[219]
There are numerous experimental techniques to study protein dynamics, such as Nuclear
Magnetic Resonance (NMR)[141], fluorescence resonance energy transfer (FRET)[183],
atomic microscopes and optical tweezers[126]. Despite the numerous amount of experi-
mental techniques to study dynamics of proteins, there are spatio-temporal limitations to
the time and length scale of the conformational space these method could achieve. More-
over, details about the pathway of different conformations remains unknown. On the other
hand, molecular dynamics (MD) have been named a computational microscope[163], giv-
ing us detailed microscopic interactions that play major roles in folding, ligand binding
1
and other biological problems. In fact the Nobel Prize for Chemistry in 2013 was awarded
to Martin Karplus, Michael Levitt, and Arieh Warshall for their pioneering work of MD
methodology for biomolecular systems. In short, in MD the interactions between particles
whose positions are denoted as the 3D Cartesian coordinates of individual atoms, are com-
puted. The motion of each particle is defined by the potential energy whose derivative are
calculated to obtain forces between particles which is then used to solve the Newton?s equa-
tions of motion. Solving these together in consecutive steps generates a trajectory of the
dynamics of the system under study. All-atom MD using classical force-fields allowed the
study of dynamics in small molecules such as small peptides to large protein systems such
as virus capsids.[38, 187] The accuracy of a MD simulation depends on two factors: First is
the empirical force-field used for parameters of interaction between particles in the system.
The other is the simulation time which should be long enough to overcome the local energy
barriers which lead to quasi-ergodicity. One straightforward way is to run long simulations
using supercomputers.[273] Enhanced sampling approaches are also developed to sample
conformational space in a more efficient manner. The subsequent subsections give a broad
overview of basic concepts and techniques in MD simulation.
1.1.1 Force Fields
A force field describes the parameters of interaction between particles in the system.
Empirical force fields represent biomolecules at atomistic resolution. These additive poten-
tial energy functions consist of large number of force-field parameters which are obtained
from empirical and quantum mechanical studies on small-molecules. Some commonly
used forcefielsd are CHARMM , AMBER and GROMOS.[115] These force-fields may
involve different terms and definitions of the potential energy function. For example the
CHARMM forcefield takes the form:
2
V (R) = ? Kb(b?b 20) + ? K? (? ?? )20 ? K?(1+ cos(n??? ))+
bonds angles dihedrals
?(S?S 20) + ? Kimp(? ?? 20) +
UB impropers
R
? ? mini j 12
R
? mini j
qiq j
[( ) ( )6]+ (1.1)
non?bond ri j ri j ?1ri j
In the above equation V (R) is the total energy in CHARMM forcefield. First term three
is the potential energy for bonds, angles and torsions respectively. Following terms are
Urey-Bradley, improper dihedrals, non-bonded van der Waals and electrostatic terms.
1.1.2 Hamiltonian equations of motion
For a system of N particles, the Hamiltonian of a system is the sum of potential and
kinetic energies:
1 n p2
H(q1, ...,qn; p1, ..., pn) = ? i +U(q , ...,q2 m 1 m) (1.2)n=1 i
where mi is the mass, qi is the coordinate and pi is the momenta of particle i. The
Hamiltonian equations of motion are given by:
?H p
q? ii = = (1.3)? pi mi
??H ??Up?i = = = Fi (1.4)?qi ?qi
qi and pi are time derivatives and Fi is the net force on particle i. Solving these equations
leads to the trajectory of all particles in the system. If the system is isolated, the total
3
energy of the system is conserved and ?H/? t = 0 and all microstates are visited with
equal probability. However, if an external heat bath is coupled with the system, there will
be energy exchange between the system and the bath which corresponds to a canonical
ensemble. In this situation, the phase space is explored with probability:
?H(q,p)P(q, p)? e kBT (1.5)
In a canonical ensemble, the temperature of the system and the number of particles of
the system are conserved. In NPT canonical ensemble the pressure of the system remains
constant while in NVT ensemble, the volume of the system is kept constant.
1.1.3 Integrators
Numerical integration methods are needed to find an approximate solution of the or-
dinary differential equations given a timestep and initial positions and velocities of the
atoms in the system. Some of the most widely used numerical integrators are Verlet
algorithm[109] and Leapfrog.[140] Verlet integrator uses two Taylor series expansions to
derive the positions:
q(t +? t) = 2q(t)?q(t?? t)+? t2q?(t)+O(? t4) (1.6)
velocities are calculated by first-order central difference
1
q?(t) = [q(t +? t)?q(t?? t)]+O(? t2) (1.7)
2? t
Leapfrog integrator improves the Verlet integrator by computing velocities at time t +
4
1/(2? t) and positions at time t +? t.
1 1 ? t
q(t + ? t) = q?(t? ? t)+ F(t) (1.8)
2 2 m
1
q(t +? t) = q(t)+? tq?(t + ? t) (1.9)
2
In this way, the velocities and positions are updated with an offset of a half-timestep.
Velocities at time t are computed as :
1 1 1
q?(t) = [q?(t + ? t)+ q?(t? ? t)] (1.10)
2 2 2
this algorithm is more efficient and time-reversible.
1.1.4 Periodic boundary conditions
It is not computationally feasible to simulate a real system with a mole of molecules
(1023 atoms) but periodic boundary conditions allow us to extend the simulations box and
the unit cell is embedded in an infinite space.[343] An illustration of PBC is shown in
Figure 1.1
1.1.5 Cutoff methods
Calculation of nonbonded forces is usually the most time-consuming part of MD simu-
lations. Most interactions such as van der Waals decay with increasing radius r and thus we
can use a spherical cutoff rc and only compute the forces within this cutoff. Three different
cutoff methods exist in MD simulations: truncation, shifting and switching.[232, 227] In
the truncation method if the distance is more than the cutoff rc, the forces are truncated to
zero. However, this scheme is problematic and causes a discontinuity at rc. In the shifting
5
Figure 1.1: Periodic boundary conditions.
method, the potential is shifted linearly such that the force is zero at the cutoff rc.
?????Uvdw(r)? (r? rc)?Uvdw(rc)?U? vdw
r ? rc
USF(r) =? (1.11)0 r > rc
Another approach is to switch off potential within a distance cutoff by applying a
switching function to the potential function. This method can be applied to the electrostatic
potential and forces. Long range electrostatic interactions are applied using the particle
mesh ewald (PME) summation scheme.[94]
1.1.6 Thermostat
In MD simulation, the temperature of the system is maintained constant in NPT or NVT
ensemble by coupling the system to a thermostat with fixed temperature. There are several
thermostats to employ in MD simulation:
I) Berendsen Thermostat: Berendsen algorithm[168] (known as weak coupling scheme)
generates the temperature fluctuations by multiplying the velocities by factor ? in a differ-
6
ent form:
?
? t T
? 1 0= + ( ?1) (1.12)
?T T (t)
where ?T is the coupling parameter which determines the degree of coupling between
the system and the bath and ? t is the timestep of simulation. If ?T  ? t then the system is
weakly coupled to thermostat. This method is used for systems that are far from equilibrium
(equilibration step). This method suffers from problems due to suppression of fluctuations
of kinetic energy which is not consistent with canonical ensemble.
II) Velocity rescaling thermostat: This thermostat is similar to Berendsen thermostat
but produces the right canonical ensemble.[39]. In this thermostat the velocities of each
particle at each timestep, or every nTC steps is scaled by a time-dependent factor ? as :
nTC?t T? 1 { 0= [ + ?1}]1/2 (1.13)
?T T (t? 12?t)
The parameters ?T is similar to the time constant of temperature coupling ? as :
2C
? V
?T
= (1.14)
Nd f k
CV is the total heat capacity of the system, k is the Boltmann?s constant and Nd f is the
total number of degrees of freedom of the system. This thermostat modifies the kinetic
energy by ?Ek = (? ? 1)2Ek. Velocity rescaling thermostat can be viewed as a Berend-
sen thermostat with an additional stochastic term that ensures the correct kinetic energy
distribution.[39]
III) Langevin thermostat: This thermostat can be viewed as a heat-bath with small
fluid particles with Brownian motion that can affect the diffusive behavior of molecules in
the system. In this method two terms are added to the Hamiltonian equations: first is the
7
viscous drag force ?? pi which acts opposite to direction of momentum.[72] Second is a
random noise Ri(t) due to stochastic collisions with solvent molecule:
??Hd pi = dt? ? pidt +Ri(t)dt (1.15)?qi
?i is the friction coefficient or coupling constant?which measures the degree of coupling.
Ri is the random collision and has the amplitude 2?mikBT? dWi/dt where Wi is the vec-
tor of independent Wiener processes (Brownian motions) term with zero mean value and
covariance of < Ri(t)R j(t + ?) >= 2?mikBT ? (?)?i j. Random forces are uncorrelated in
time and between particles. With this thermostat the system not only couples globally to
a heat batch but also is subject to random noise. Langevin thermstat produces a canonical
ensemble when converged, but the parameter ? affects the diffusive behavior considerably.
1.1.7 Barostats
In MD simulation and NPT ensemble the pressure of the system is maintained constant
by coupling the system to a barostat. The pressure of a system in a cubic box of finite size
can be derived from viral theorem as :
Wtot =?3NkBT =?3PV +? (1.16)
Wtot is the total work done by the system, V is the box volume, ?3PV is the external
virial due to the interactions between the particles and the wall and ? is the inner virial for
particle-particle interactions ? = ?Ni Fiqi. The pressure is obtained from :
1 N
P = (3NkBT +3V ?Fi.qi) (1.17)i
Scaling the inter-particle interactions is used in Berendsen barostat to control the pres-
8
sure of the system. In this scheme the volume of simulation cells is scaled by a rescaling
factor ? .
? t
?(t) = 1? ?(P ?P(t)) (1.18)
? 0P
? t is the timestep, ?P is the relaxation time constant of barostat and ? is isothermal
compressiblity and P0 is the temperature of barostat. In a simulation with isotropic scaling,
coordinates and box vector are scaled by ?1/3. Berendsen barostat is useful in equilibrat-
ing systems to get them to desired pressure but it does not provide the correct thermody-
namic ensemble. Parinello-Rahman barostat is an extended-ensemble pressure coupling
algorithm[196] where the simulation is carried out with both isotropic and anisotropic con-
ditions and pressure fluctuations are captured correctly. This barostat is desirable for pro-
duction runs.
1.1.8 Enhanced sampling methods
Despite the success of MD to study biological processes, there are still limitations in the
timescale that can be reached. Inadequate sampling of conformational states, in turn limits
the full understanding of the functional properties of the system under study. Large scale
conformational changes usually are complicated and time consuming processes and are
commonly beyond the capabilities of standard MD and enhanced sampling techniques are
often required. Several enhanced sampling techniques specifically replica exchange molec-
ular dynamics and metadynamics are used in this dissertation and are described shortly
below.[19]
Replica exchange simulation (REMD) In REMD, simulations are performed with
different temperatures and exchanges of coordinates between simulations at different tem-
peratures are performed to enhance sampling of configurational space. An illustration of
REMD is shown in figure 1.2.
9
Figure 1.2: Illustration of replica exchange molecular dynamics (REMD) where multiple simula-
tions with different temperatures are run in parallel and exchanges happen between these replicas.
Replica exchange solute tempering (REST) In REMD the number of required repli-
cas scales with square root of the number of degrees of freedom in the system. This means
even for a small system, tens of replicas would be needed to maintain an acceptable accep-
tance ratio.[287] One can redefine the Hamiltonian in each replica to only consider a small
subset of the system for parallel tempering. In REST1 [186] the potential energy of the sys-
tem is decomposed into three parts: protein-intramolecular interactions (Epp), interaction
between protein and water molecules in the solvation shell (Epw) and interaction between
water molecules (Eww). The potential energy Em of replica at temperature Tm is defined as:
?0 +?E m
?0
m(X) = Epp(X) = Epw(X)+ Eww(X) (1.19)2?m ?m
X is the coordinates of entire system and ?m = 1BT and ?0 = 1/k0T . In REST2 the potential
energy is defined as:
?
? ?
E X m E X mm( ) = pp( )+ Epw(X)+Eww(X) (1.20)?0 ?0
In REST1 different replicas have different temperatures, however, in REST2 [314] the tem-
perature is constant for all replicas and all simulations arrive at the final ensemble distribu-
tion for temperature Tm. The scaling is performed on bonded and non-bonded interactions
10
in the potential function. For a system with M replicas with an effective temperature Tm in
replica m the equilibrium distribution is:
e??mEm(Xm)
Pm(Xm) = (1.21)Z
where Z is the partition function. The acceptance ratio in REST2 of exchange between
replicas m and n is based on the ratio of transition probabilities that satisfy the detailed
balance:
Pi? f Pn(Xm)Pm(Xn)
= = e??nm (1.22)
Pf?i Pm(Xm)Pn(Xn)
?
?
? 0nm = (?m??n)[Epp(Xn)?Epp(Xm)+? ? (Epw(Xn)?Epw(Xm))] (1.23)
?n + ?m
Applying a metropolis criterion for the exchange results in :
????
Pi? f (r) =?1 ?? nm
? 0
? (1.24)exp(??nm) ?nm > 0
REST2 have been used for conformational sampling of proteins and protein membrane
interaction in multiple applications.[146, 129, 313]
Metadynamics: In metadyanmics simulation, an external history dependent bias po-
tential in the space of a few CVs that capture the slowest motions in the system is added
to the Hamiltonian of the system which allows to escape from current conformation and
sample other parts of the conformational space.[9, 10] An illustration of metadynamics
simulation is shown in figure 1.3. The biasing potential is constructed as a sum of Gaussian
11
Figure 1.3: An illustration of metadynamics simulation technique
kernels deposited along the collective variables s(q)
? (s ? s (q(k?)))
2
V (s, t) = W (k?)exp(?? d i i 2 ) (1.25)
k?<t i=1 2?i
In the formula W is the Gaussian height, ? is the Gaussian deposition stride, si is the ith
collective variable, d is teh number of CV s and ?i is the Gaussian width of ith CV.
The biasing potential forces the system away from any local minimum into visiting new
regions of phase space. In the long time limit, all the CVs spaces are filled evenly and bias
potential converges to the minus free energy as a function of the CVs.
V (s, t? ?) =?F(s)+C (1.26)
F(s) is the free energy as a function of CVs. The parameters ? , W , ?i determine the
shape of Gaussians which influence the accuracy of sampling and the constructed FES.[10]
If the Gaussians are large (large W), broad (large ?i) and deposited quickly (short ?), the
sampling will take shorter time at the cost of large error for construction of FES and also
the conformations in shallow energy basins are hardly sampled. On the other hand, small
slowly deposited Gaussians results in higher accuracy for FES but require more simulation
time for convergence. One drawback of standard metadynamics is that Gaussian kernels of
12
constant height are added in the entire simulation. This results in pushing the system to ex-
plore high free energy regions and the free energy calculated from bias potential fluctuates
around real value.
Well Tempered metadynamics:
The objective of well-tempered metadynamics is to effectively sample the configura-
tional space which is defined by a few CVs to obtain a converged FES. IN WT-metad[27]
the height of Gaussians decreases with time:
?V (s(q(k?)),k?)W (k?) =W0exp( (1.27)kB?T
In this formula, W0 is the initial Gaussian height, ?T is the parameter that regulates
the extent of free energy exploration. ?T = 0 corresponds to standard MD simulation and
?T ?? corresponds to standard metadyanmcis.[9] In WT-MetaD bias-factor is defined as
the ratio of temperature of CVs (T +?T ) and temperature of system T:
As the simulation proceeds, the Gaussian height has an inverse relation to the Gaussian
added and decreases with time.
T +?T
? = (1.28)
T
The bias potential in this case is :
?N(s, t)
V (s, t) = kB?T ln(1+ (1.29)kB?T
where ? = W0/? is the initial deposition rate and N is the histogram of variable CV
s(x) from biased simulation. In the long time limit, the free energy can be obtained from:
?T
V (s, t? ?) =? F(s)+C(t) (1.30)
T +?T
13
The FES at finite time is estimated from:
?T +?T
F?(s, t) = V (s, t) (1.31)
?T
Well-tempered metadynamics have been used in numerous studies for conformational
sampling of proteins, protein ligand binding and allosteric activation of proteins.
1.2 Kinetic modeling and Markov State Models
Markov state models (MSMs) are a class of kinetic models for modeling long timescale
of molecular systems. The dynamics of the system in MSM is modeled as a series of
memoryless probabilistic jumps between a set of states. A powerful feature of MSM is the
transition probability matrix pi j(?) which is probability that a trajectory is at state j after a
lagtime ? that it was found in state i.[59]
pi j(?) = Prob[xt+? , j|xt , i] (1.32)
Assuming the molecular dynamics process is Markovian in space ? and the stationary
probability distribution is computed as :
?(x) = Z?1e??H(x) (1.33)
?
H is the hamiltonian and Z = dxexp(??H(x)) is the partition function and ? = (k ?1BT )
where kB is the Boltzmann constant and T is the temperature. Furthermore, the process xt
is reversible and pi j(?) fulfills the detailed balance condition:
?i pi j = ? j p ji (1.34)
14
Eigen-decomposition of the transition matrix gives us important information about
equilibrium and dynamical properties of the system. Eigenvectors corresponding to each
eigenvalue have n elements that correspond to n states. The highest eigenvalue ?1 = 1
and its corresponding eigenfunction represents the equilibrium distribution ?(x). All other
eigenvalues are less than 1 and describe decaying processes (positive eigenvalues) or oscil-
lating (negative eigenvalues which does not have physical meaning). Positive eigenvalues
are then converted to physically meaningful timescales using a lagtime ? which is used to
build the transition matrix.[30]
There are many choices of how to perform MSM such as how many states should be
chosen. The variational principle (VAC)[215] enables objective comparison among differ-
ent choices of Markov models with the same lagtime. The variational principle of confor-
mational dynamics is analogous to the variational principle in quantum mechanics, stating
that the true eigenfunctions can be best approximated through a Markov model given the
estimated timescales are maximized. the variational principle of conformational dynamics
states that given a set of n orthogonal functions of conformatinoal space ?, their autocorre-
lation at time ? gives lower bounds to the true eigenvalues ?1, ...,?n of the Markov operator.
This means ?true(?) ? ? (?) which implies that the timescales are always underestimated
when using MSM. This allows the formation of a variational optimization approach to find
best approximate eigenvalues and eigenfunctions of the true Markov model.
If the simulation data are realized of an ergodic and time-reversible Markov process in
phase space ?, the time evolution of probability distribution pt(x) can be formulated as a
set of relaxation processes.
?
p ?
?
t t+?(x) = ? e i ?(x)?i(x)< ?i, pt > (1.35)
i=1
ti are the relaxation times in a decreasing order, ?i are the eigenfunctions of transition ma-
15
?
trix with eigenvalues ?i(x) = e
? ti . The m dominant eigenfunctions ?1, ...,?m are the slow-
est collective variables that characterize the dynamics on large timescales t  tm+1. The
eigenvalues and eigenfunctions of Markov model can be computed by a constrained opti-
mization problem which is named the variational approach to conformational dynamics.[215]
This maximizes the following equation.
Rm = max ?mi=1 E? [ fi(xt) fi(xt+?)]f1,..., fm
E? [ fi(x )2t ] = 1
E? [ fi(xt) f j(xt)] = 0 ( f or i 6= j) (1.36)
In the equations above, E? is the expected value of Xt sampled from a stationary dis-
tribution and Rm is the Rayleigh trace. Therefore, the variational approach states that the
eigenvalues are always underestimated and variational principle is used to maximize the
eigenvalues of the Markov operator. Eigenfunctions of the Markov operator are approxi-
mated by a linear combination of basis functions or feature functions X = (X1, ...,Xm)T and
the eigenfunctions are computed as :
m
fi(x) = ? bii? j(x) = bTi ?(x) (1.37)
j=1
Expansion coefficients bi and eigenvalues of the Markov model can be computed by solving
a generalized eigenvalue problem:
C(?)B =C(0)B?? (1.38)
C(0) and C(?) are the autocorrelation with zero lagtime and time-lagged covariance ma-
16
trices of basis functions and ? is the diagonal matrix of eigenvalues ?? = diag(??1, ..., ??m)
and B = (b1, ...,bm). Inserting these coefficients would result in approximation of eigen-
functions P(?) =C(0)?1C(?) is the transition matrix of the MSM. The lagtime ? should be
long enough to ensure the dynamics is Markovian and short enough to find the dynamics
that wer are interested. The implied timescales ti is an approximation of the decorrelation
time of ith process and is computed from the eigenvalues of the MSM transition rate matrix
as :
ti =?
?
(1.39)
ln|?i(?)|
The implied timescales (ITS) can be used to choose a lagtime by choosing one that
makes the ITS constant. Once we choose the lagtime we can check whether a transition
probability matrix P(?) is Markovian by Chapman-Kolomogorov test:
P(k?) = Pk(?) (1.40)
The validated transition matrix is then decomposed into eigenvalues and eigenvectors.
The highest eigenvlaue is always ?1(?) = 1 corresponding to eigenvector of the stationary
distribution ? with the property:
?T P(?) = ?T (1.41)
All other eigenvalues ?i>1 are real with norm less than 1 and are relatedto characteristic to
implied timescales of dynamical processes within the system.
17
1.2.1 MSM construction from MD simulations
In the last 20 years, researchers have constructed a pipeline for construction and valida-
tion of a Markov state model from molecular dynamics trajectories that involves multiple
steps described below.
I) Feature selection: In order to build a Markov model of long-timescale kinetics, one
must first choose a few features or collective variables that are important for the system un-
der study. These features could include but not limited to distances, torsions and Cartesian
coordinates. Variational approach to Markov processes provides a score (VAMP-2) which
allows comparison of different features and choosing the best set of features based on a
cross-validated VAMP-2 score.[262]
II) Dimensionality reduction: Featurization of the molecular system leads to high
dimensional space. Discretizing a very high dimensional space by clustering is inefficient
and could lead to low-quality discretizations that does not accurately describe the dynamics
of the system. Therefore, usually we first reduce the dimensionality by a linear coordinate
transformation. In this transformation, we look for a set of basis vectors U = [u1, ...,um]
where ui is a collective coordinate with mi components. After the transformation y(t) =
UT x(t) are the new coordinates. Common linear dimensionality reduction methods are
PCA and TICA. PCA transforms data into orthogonal basis which are uncorrelated and
retains the variance of the dataset. However, this does not describe the molecular kinetics.
We are interested in preserving the slow motions rather than high-amplitude motions as
in PCA. For example, consider a small peptide which is highly flexible at its termini and
undergoes a rare event concerted torsion at its center. We are interested in the kinetics of
this rare-event rather than the high-variance fluctuations at the termini. TICA is a form
of variational approach to conformational kinetics (VAC) and is the optimal method for
finding the flow reaction coordinates and the relaxation timescales. TICA is similar to
18
PCA but uses a time-lagged correlation matrix C(?), where Ci j(?) =< x?i(t)x? j(t) >t . A
generalized eigenvalue problem:
C(?)ui =C(0)?i(?)ui (1.42)
is solved and the new coordinates are now uncorrelated under the lagtime ? . The kinetic
m 2
variance under transformation can be defined as KV = ?i=1 ?im T KV where T KV = ?
2
i=1 ?i is
the total kinetic variance which roughly measure the total number of slow processes.
III) Clustering: State decomposition happens at the clustering step. The features in
the TICA space are grouped into a set of clusters using a clustering algorithm. Kmeans
clustering is usually used as the method of choice in this step.
VI) Building the transition rate matrix: After the clustering, the state space is dis-
cretized into discrete trajectories s(t) jumping between n microstates where n is the number
of clusters. Conditional transition probability between microstates at time ? is defined as:
Pi j(?) = P(s(t + ?) = j|s(t) = i) (1.43)
A Markov model predicts the kinetics at longer timescales using the Markov property:
P(s(t + ?) = j|s(t) = i) = [Pk(?)]i j (1.44)
The MSM also predicts the equilibrium probability ? = ?i in terms of stationary vector
?T = ?T P(?). In order to compute the error bars of the timescales, a Bayesian MSM
samples from posterior distribution P(P(?)|C c (?)(?)) ?P?n i ji, j=1Pi j where ci j(?) is the
number of transitions computed between i and j over lagtime ? . The lagtime ? must be
picked such that the relaxation times t (?) = ? ?i ln? (?) are approximately constant withini
statistical error for longer times. The model is validated using Chapman-Kolmogrov (CK)
19
test where the model estimated at lagtime ? must be able to predict estimates performed at
longer timescales k? which can be written as:
P(k?) = Pk(?) (1.45)
V) Coarse-graining MSM: Often it is desirable to describe the molecular process in
a few states that contains the essential structural, thermodynamic and kinetic information.
However, the number of microstates generated during clustering in building an MSM is
usually in the range of hundreds to thousands. On the other hand, a coarse-grained model
is important to compute information such as mean first passage times (MPFT) from one
set of states to another. A fuzzy assignment in which each microstate i has a assignment
probability to macrostate I has been proposed[156, 341] to preserve the slow kinetics in the
coarse-grained kinetic model where miI = P(macro? I|micro? i). The membership prob-
abilities are computed by a linear combination of first m eigenvectors of the observation
matrix by PCCA++ method[245] which exactly preserves the relaxation kinetics of the m
slowest processes.
1.3 Machine Learning
Introduction to Neural networks: Machine learning (ML) involves using computa-
tional methods to learn from the data without any explicit programming. ML is being used
in nearly all the fields of science and has made important and notable breakthroughs. Deep
learning is a subfield of ML and is concerned with algorithms which loosely mimic the
human brain and are called deep neural networks (DNNs). A classical application of ML
is in image classification where the model tries to associate a label to each image using the
features in the pixel data. The underlying idea is that there is a explicit relation between
the set of pixels and the associated label which the model tries to learn. The same idea
20
can be extended to the molecular space where a full description of atomic or molecular
features dictates its chemical properties. ML techniques have been used in many aspects
of molecular simulation such as enhanced sampling[21, 244], forcefield optimization[179?
] and kinetic modeling [194] and etc. Machine learning techniques could be grouped in
multiple categories:
Supervised learning: In supervised learning, the model is provided with the inputs and
the labels for all input samples and the task is usually finding the desired properties (labels)
of any given input. The model after proper training is able to predict the target (label) of
unseen samples. Most common supervised learning tasks are regression and classification
tasks.
Unsupervised Learning: Unlike supervised learning in unsupervised learning the data
will not contain labels and the task is to identify patterns or similarities/differences within
the data. Clustering for example is an unsupervised learning problem. Methods such as
K-means clustering fall in this category. Another important task in unsupervised learning
is dimensionality reduction where the goal is to find a reduced dimension of the data that
carries most of the important or relevant information.
Reinforcement Learning: This type of machine learning is different from other cat-
egories as an agent takes action in the environment and the goal is maximizing a reward
function. The agent learns to take actions which gives reward and avoid the ones that have
a negative reward or a punishment. In this type of machine learning there is no training
data (labeled or unlabeled) and the agent self improves through trial and error.
1.3.1 Artificial Neural network
The structure of analytic neural networks is inspired by neural connections in the brain.
Neural networks consist of multiple layers of simpler units. A ?perceptron? is the simplest
21
Figure 1.4: Representation of a perceptron
neural network with a single layer.[249]. In a perceptron, the model takes single or multiple
inputs and computes a weighted sum of the input and finally applies a non-linear activation
function to compute a single output. This is shown in figure 1.4.
y = f (?xiwi +b) (1.46)
i
In the above equation xi are inputs, wi are the weights and b is a bias term. f is a
nonlinearity function. The nonlinearity is also called activation function. There are many
different activation functions that are commonly used in machine learning such as ReLU,
tanh, Sigmoid and can be used depending on the task and the type of network.
A Feed-forward Neural network (FFNN) (figure 1.5) is a collection of multiple percep-
trons stacked together. A FFNN is also called a multi-layer perceptron (MLP). Universal
approximation theorem[133] states that a MLP containing as little as one hidden layer with
a finite number of neurons can approximate any continuous function under mild assump-
tions on the activation functions used.
Cost functions: Training a neural network involves minimizing a loss or cost function
which usually measures the discrepancy between actual values and the values that are out-
put of the network. Some types of cost functions are mean squared error or a categorical
cross entropy loss for classification tasks.
22
Figure 1.5: Illustration of a multilayer perceptron with 4 hidden layers and 1 output
backward propagation: During training a deep neural network, we need to update the
weights and biases of the network (w,b). The question remains how to compute the gradient
of loss function ? with respect to the parameters of the inner layers on the network. This is
done by backpropagation algorithm where a chain rule is used to compute gradient of inner
layers. In this approach, we calculate numerically the derivative of the cost function with
respect to weight wi j in layer l using a chain rule. The derivative of the loss ? is taken with
respect to the net input to outer node ui, then using chain rule with respect to parameters
we want to optimize. this can be written as:
?? ?? ?ui ?
= = ?
?w ?u ?w i
[bi +?wi j?y j? ] = ??w iy j (1.47)i j i i j i j j?
Same can be written for the bias parameters.
gradient descent: Once we have the gradients of the loss with respect to model param-
eters we can minimize the loss. If ? denotes all the parameters of the neural network, given
the initial parameters of the network, the most basic gradient descent scheme updates them
as:
23
Figure 1.6: Gradient descent algorithm. Image credit: Science magazine
?k+1 = ?k????k?(?k) (1.48)
Where ? is the learning rate, controls the size of the step for training. This is considered
a hyperparameter during training. The minus sign in the equation ensures the parameters
are updated to minimize the loss. Other variants of gradient descent algorithm are proposed
such as stochastic gradient descent (SGD) where at each epoch of training a mini-batch of
data is used for error computation. Other types include Adam optimization and RMSProp
which are variants of SGD using momentum techniques to speed up the training and avoid
getting stuck in local minima.
1.3.2 Convolutional neural networks
This is a specialized architecture of neural network for grid-structured data with strong
spatial dependencies. This architecture is increasingly being used for one-dimensional
time-series data, two-dimensional images and three-dimensional video data.[110] In convo-
lutional neural networks (CNNs) we use different kernels (or filters) on the data which are
just matrices that learn features such as edges or lines from the data. These basic features
are then used to build more complicated shapes and patterns. The convolution operation
24
Figure 1.7: Convolutional neural networks for handwritten digits.
is a simple dot product, where a filter (kernel) is moved across the image. A 2D convo-
lution involves length, width and depth parameters where the length and width describe
the convolutional kernel and depth relates to the number of input channels. For example,
images used with RGB color values have a depth of 3. The use of convolutional kernel,
also gives a translational invariance to the model where features are detected regardless of
their locations in the input image.
1.3.3 Recurrent neural networks
Feedforward neural networks fail to capture the sequential behavior of data where the
order is of importance. These include protein sequences or time series of MD simulation
trajectories. Recurrent neural networks (RNNs) are particularly useful when dealing with
sequential data. In RNNs data are provided in a sequential manner and the networks uses
the inputs from the previous timesteps to make prediction or any decision making at the
current timestep. A RNN is shown in figure 1.8 where a single RNN is unrolled to show
the information processing at each timestep through multiple copies of the network.[267]
In the RNN in figure 1.8, xt is th input at each timestep t, yt is the output of each
25
Figure 1.8: Unrolling a RNN
timestep and ht is the hidden state at t which is calculated as:
ht = f (Uxt +Wht?1) (1.49)
where f is a non-linear activation function such as tanh or ReLU. U and W are weight
matrices learned during training. In an RNN the weights are shared across all timesteps
which greatly reduces the model complexity. Training a RNN involves a special type of
backpropagation called backpropagation through time.
LSTM networks: Vanilla RNNs suffer from vanishing gradient problem, which causes
the model to forget long-term dependencies in data. In order to circumvent this issue,
several extensions to RNNs were proposed such as long short term memory (LSTM)[130]
and Gated recurrent units (GRU).[58]
Long short term memory (figure 1.9) can be used to solve some of the problems with
RNNs such as 1) long-term dependency in RNN and 2) vanishing gradient and exploding
gradient in RNNs. A LSTM[130] consists of a cell, an input gate, an output gate and a
forget gate. The cells are to store information, whereas the gates manipulate them. In
LSTM, information is selectively allowed through a gate unit, using a sigmoid function.
Forget gate: The first step is deciding what information to discard from the cell state.
26
Figure 1.9: Illustration of Long short term memory unit
This forget gate takes ht?1 and xt as input and outputs a value between 0 and 1 for each cell
state Ct?1
ft = ?(Wf [ht?1,xt ]+b f ) (1.50)
where ht?1 represents the output of previous cell and xt is the input of current cell and
? is the sigmoid function.
Input gate: this step decides how much new info will be added to the cell state. First
a sigmoid layer determines which information needs to be updated and then a tanh layer is
applied.
it = ?(Wi[ht?1,xt ]+bi) (1.51)
C?t = tanh(Wc[ht?1,xt ]+bc) (1.52)
Ct = ft ?Ct?1 + it ?C?t (1.53)
27
Output gate: Here we fist use a sigmoid to determine which part of the cell state will
be exported, then process the cell state using a tanh function.
ot = ?(Wo.[ht?1,xt ]+bo) (1.54)
ht = ot ? tanh(Ct) (1.55)
1.3.4 Variational Autoencoders
Autoencoders are a special type of neural networks designed to efficiently compress and
encode data and then learn back from the reduced encoded representation by reconstructing
the original representation as closely as possible. Therefore, they are used to reduce the
dimensionality of the input data. In a simple autoencoder the encoder maps the data into
a low dimensional space and the decoder maps it back to the original dimension. The loss
function is defined as the accuracy of reconstruction of the data with respect to the original
data. A simple autoencoder is also called a deterministic autoencoder.
Variational autoencoders combine autoencoders with variational inference and enable
the model to learn a meaningful latent representation of data. Variational inference is based
on the Bayes rule. The Bayes rule can be written as :
| p(X ,Z)p(Z X) = (1.56)
p(X)
X refers to the observed data and Z is the latent variable. p(Z) is known as prior distribu-
tion, p(X |Z) is the likelihood of the observation of X given the latent code Z. The inference
problem in Bayesian statistics deals with computing the posterior p(Z|X). The denomina-
tor in the above equation is the marginal distribution of data also called evidence, which
28
can be computed by marginalizing out the latent variables. However, in many cases the ev-
idence is intractable and cannot be computed in closed form. On the other hand, vairational
inference deals with approximate inference to find an approximate distribution that is close
to the true posterior. Kullback-Leibler (KL) divergence measures the difference between
two distributions and this KL is used to approximate the true posterior. KL between two
probability distributions P and Q is defined as:
P(i)
KL(P||Q) = ?P(i) log (1.57)
i Q(i)
We try to find a distribution q(Z) that minimizes the KL divergence with respect to the true
posterior p(Z|X). The KL in this case can be written as:
q(Z)
KL(q(Z)||p(Z|X)) = Eq(Z)[log ] (1.58)p(Z|X)
= Eq(Z)[logq(Z)]?Eq(Z)[log p(Z,X)+Eq(Z)[log p(X)] (1.59)
one can rearrange the above equation to obtain:
p(Z,X)
Eq(Z)[log ]+KL(q(Z)||p(Z|X)) = log p(X) (1.60)q(Z)
ELBO(q)+KL(q(Z)||p(Z|X)) = log p(X) (1.61)
the term on the right of above equation is called evidence logp(X) and the first term
is called evidence lower bound (ELBO) since KL is always positive we have log p(X) ?
ELBO(q) and ELBO acts as the lower bound for the evidence. The objective of minimizing
KL is equivalent to maximizing the variational lower bound ELBO. The ELBO term in
29
Figure 1.10: Variational autoencoder with encoder, decoder and a gaussian latent space
equation can be rewritten as:
ELBO(q) = Eq(Z)[log p(X |Z)]?KL(q(Z)||p(Z)) (1.62)
The first term in equation above is the expected log-likelihood of the data or data recon-
struction loss and the second term refers to the negative KL divergence between approxi-
mate posterior q(Z) and the prior p(Z)
In variational autoencoders, neural networks can be used to represent the inference
network (encoder) q? (Z|X) and the generative network (decoder) parameterized by param-
eters ? as p? (X |Z). The network can be optimized by maximizing the ELBO as described
previously. For a continuous latent variable, prior is usually chosen as a Gaussian N(0, I).
In this case KL has the following closed form solution:
1
KL(N(?,?)|N(0, I)) = (1+ log((?)2)??2??2) (1.63)
2
where ? and ? are the mean and variances of the prior distribution that are learned by
the network during training. KL divergence can be viewed as a regularizer of the network
where the latent space is forced to a pre-defined distribution. This latent space can be used
to generate novel data such as images and texts. A representation of variational autoencoder
30
is given in figure 1.10.
1.3.5 Graph neural networks
Many data can be represented as graph data structure such as the structure of molecules,
proteins or social networks. Graph neural networks are increasingly being used in many
areas such as protein structure prediction, drug design, etc. A graph G is defined as
G = (V,E,A) where V is the set of nodes, E is the set of edges and A is the adjacency
matrix. Graph convolutional networks (GCN) were introduced by Kipf et al.[152] which
rely on message passing between neighbors in a graph. Each node has a feature vectors
that represents its message and messages are passed between neighbors during each graph
convolution layer or message passing. Multiple types of GCN can be formulated based
on how the messaged are passed between nodes and edges. In a simple GCN, where only
nodes have feature vectors, the GCN layer can be defined as:
H l+1 = ?(D??1/2A?D??1/2H(l)W (l)) (1.64)
where H(l) is the features of previous nodes, W (l) is the weight parameters. A? is the
adjacency with including self connections. A? = A+ I. The messages are averaged over
neighbors using the matrix D? which is the diagonal matrix, where Dii is the number of
connections of node i and ? denotes a nonlinear activation function.
1.4 Dissertation overview
In chapter 2, I study the membrane active peptides using molecular dynamics and ma-
chine learning. In the first section of this chapter, I study two different cell penetrating
peptides (MPG and HST5) and their interaction with membrane using MD simulations.
31
In the second part of this chapter, I develop a deep learning model called attention based
variational autoencoder to generate new antimicrobial peptides and test the efficiency of
the model for generating effective peptide sequences. Third chapter deals with Markov
state models and kinetic modeling of biomolecules. This chapter is divided into three sec-
tions. In the first part, I study conformational dynamics of ?2-microglobulin using MD
simulations and Markov state modeling to find metastable states that contribute to amyloid
formation. In the second section, I explore a machine learning model called gaussian mix-
ture variational autoencoder (GMVAE) for dimensionality reduction and clustering of MD
trajectories of protein folding and show that the latent space from GMVAE can be used for
building a Markov model. The last part of this chapter introduces a novel neural network
approach to replace the pipeline of building a Markov model called GraphVAMPNet where
we used graph neural network as feature representation for protein folding trajectories.
Chapter 4 of this dissertation is about membrane proteins spike protein of SARS-COV-2
and the EAG potassium channel. In the first section of this chapter, I study the interaction
of spike protein with its receptor and the hot spots of interaction. In the second section
I study the dynamics of glycans in the spike protein of SARS-COV-2 and its impact on
shielding the protein from antibodies. In the last section, I study the inhibition of EAG
channel by small molecule drugs through MD simulation, docking and binding free energy
calculations. Chapter 5 concludes the work presented in the dissertation and gives future
directions for continuing some of the projects.
32
Chapter 2: Machine learning and MD for membrane active peptides
2.1 Molecular dynamics of cell penetrating peptide interaction with model
membranes
Membrane active peptides (MAPs) are peptides with activity toward membrane either
by translocation (cell penetrating peptides, CPPs) or even by disruption of the membrane
(Antimicrobial peptides, AMPs).[32] CPPs which are mostly short cationic or amphipathic,
have the ability to enter cells without large extent disruption of the membrane. [241] How
effective a pharmaceutical treatment is, often is related to its membrane permeability which
prevents biomolecules from reaching their specific intracellular targets. In this context,
CPPs are often used to carry biomolecular cargos with various sizes and shapes inside the
cell.[79, 101] Most CPPs are primary or secondary aphipathic in nature depending on the
position of hydrophobic residues in their sequence. These usually possess a sufficient num-
ber of positively charged amino acid residues necessary for their adsorption onto the neg-
atively charged lipid membranes. Among various CPPs, MPG and Histatin 5 (Hst5) have
been utilized to deliver proteins, fluorescein labels, and siRNAs into cell and are the focus
of this study.[204, 209] MPG is a short amphipathic peptide with cationic residues in its
C-terminal half and hydrophobic residues in the N-terminal half and was shown to strongly
interact with cell membranes and spontaneously insert into natural membranes.[208] Hst5
is shown to have both antifungal activity and bacterial effects and possess a cell penetration
ability.[86]
33
Figure 2.1: Secondary structure of A) MPG B) Hst5 predicted from PEPFOLD3. Orange and red
spheres show the N-terminus and C-terminus respectively. Peptide secondary structure is colored
based on hydrophobic (white) hydrophilic (green) acidic (red) and basic (blue) residues
Candida albicans are opportunistic pathogens which cause infections for immunocom-
promised patients. Current drugs for Candida can lead to toxicity [261] or cells can develop
resistance to drugs on excessive use.[99] An essential feature of an effective therapeutic is
the ability to successfully deliver across cell membranes to intracellular targets. CPPs have
been proposed as an alternative treatment strategy. MPG and Histatin 5 (Hst5) have been
used to deliver fluorescent protein cargo, GFP into fungal cells. Both MPG and Hst5 have
previously been used to deliver cargo into cells.[209, 276]MPG has been shown in Karlsson
lab to deliver large biomolecules such as GFP into candida albicans cells through recombi-
nant production of CPP-GFP. The predicted structure (using PEPFold3) and the sequence
of MPG and Hst5 are shown in figure 2.1.
Translocation of CPPs through cell membranes depends on several aspects which in-
cludes the specific sequence, concentration of CPP, cell type, the secondary structure of
CPP and and the cargo for translocation. Understanding the folding of CPPs is a major
step for characterizing their mode of action and is important for their membrane interaction
and efficacy of internalization. For instance, the secondary structure of penetratin is highly
34
dependent on the experimental conditions such as lipid types, buffer conditions and the
technique used which can take ?-helical or ? -strand conformations.[90, 189] This struc-
tural polymorphism has been suggested as an important factor for internalization route of
CPPs as one route might be preferred by the peptide over others depending on its sec-
ondary structure.[80] Understanding how CPPs interact with cell membranes and insert
into cells is crucial for designing new CPPs or utilizing CPPs for intracellular delivery of
molecular cargos. Obtaining detailed atomic-level resolution and nano-second timescales
information about interaction of peptides with membrane is difficult and costly with exper-
imental techniques. In contrast, MD provides detailed structural and dynamic information
of peptide-membrane systems. Ulmschneider and coworkers [46, 299] used unbiased MD
to rationally tune the functional properties of pore-forming AMPs. Based on their results,
AMPs can assemble in multiple architectures near the membrane and their relative pop-
ulations can provide insights into their mechanism of action. Structural properties and
specifically the conformation of CPPs when interacting with cell membranes play a major
role in their cellular uptake mechanisms.[108] Previous studies have shown that there is
a direct impact of peptide conformation that modulates the amphipathicity and membrane
insertion of CPPs.[90, 347]
To better understand the details of interaction between CPPs and the cell membrane,
multiple MD simulations were run for the peptides MPG and HST5. Initially, we used the
highly mobile-membrane mimetic (HMMM) [220] where the lipid dynamics is enhanced
by replacing the acyl tails in the bilayer center by an organic solvent designed to mimic
the membrane interior. This accelerates the membrane insertion process while maintaining
the detailed energetics of peptide-membrane interactions.[15, 220] Simulations included
the peptides without the fusion construct. The membrane model closely mimicked a yeast
plasma membrane with the composition shown in table 1.[206] To avoid bias from starting
secondary structure, Hst5 and MPG were from an extended conformation in the solvent
35
Table 2.1: Membrane lipid components for yeast membrane
Lipid type # lipid per leaflet
ERG 60
YOPA 7
DYPC 18
POPE 20
POPI 18
POPS 27
Total lipids per leaflet 150
phase using HMMM model. The concentration of different lipid types can affect the sec-
ondary structure of peptides during interaction with membrane. In order to prevent the
artifact of hydrophobic solvent at the center of HMMM membrane, we also studied the
interaction of MPG with several concentrations of DOPC/DOPG membrane using long-
timescale molecular dynamics.
2.1.1 Simulation methods
Simulations starting from the predicteed secondary structure: In the HMMM sim-
ulations in this study, 1,1-dichloroethane (DCLE) was used as the hydrophobic core of
the membrane. Short tailed lipids were used as headgroups where the composition of
the membrane closely mimicked a Baker?s yeast model membrane which is the closest
to C. albicans membrane[206] and consists of 150 lipids per leaflet. The lipid composi-
tion of the membrane is described in Table 1 along the corresponding number of lipids in
each leaflet. In HMMM, a scaling factor of 1.2 and an acyl tail carbon length of 8 was
used. The scaling factor has the effect of increasing the area per lipid than the fully de-
tailed atomistic model and the shortened acyl chain increases lipid diffusion by exposing
more of the hydrophobic core. All systems were build using CHARMM-GUI HMMM
Builder.[238, 165, 143] The simulations were performed using NAMD simulation engine
and Charmm36 for lipid and membrane parameters.[153] TIP3P model was used for water
36
and Na+ and Cl? ions were added to system to neutralize the charges on protein and lipids.
[144] In the HMMM systems, upon insertion of peptide, the system ran under NPAT en-
semble using Langevin thermostat to maintain the temperature at 298K and a constrain to
maintain a constant lateral surface area. Langevin piston was used to maintain the pres-
sure at 1bar.[97] A force switching function of 10 to 12 A? was used for van der Waals and
electrostatic interaction.[283] Long range electrostatic interactions were computed with
Particle Mesh Ewald (PME) method.[69] An integration time-step of 2 fs was used for all
simulations using SHAKE algorithm to constrain hydrogen atoms.[252] The HMMM sys-
tems were equilibrated using standard 6 step CHARMM-GUI input parameters for 225 ps.
For the first series of simulations, PEPFOLD3 was used to predict the secondary structure
of both peptides which were predicted to be ?-helical for both MPG and Hst5. Peptides
were inserted into aqueous phase, with at least 12 A? distance relative to the phosphate
plane of nearest leaflet with three different orientations of the peptide with respect to mem-
brane to avoid bias. The main axis of the peptide was aligned either perpendicular or at
a 45 ? tilt with respect to membrane plane with either N or C-terminus being closer to
the membrane. The production run lasted for 300 ns at NPAT ensemble after which they
were converted to full-membrane models using CHARMM-GUI HMMM Builder.[238]
Furthermore, MPG systems were simulated for an extra 100 ns under NPT ensemble at
298K. MPG-membrane systems were converted to full-membrane model and equilibrated
with six-step CHARMM-GUI protocol for 225 ps except for the last step which lasted 10
ns. An extra 200 ns full-atomic membrane simulation was performed after conversion for
MPG-membrane systems.
Simulations from extended peptide conformation: The simulations with full mem-
brane model started with an extended conformation of peptide randomly placed at least 12
A? above the nearest leaflet with three different replicates. The systems were equilibrated
using the typical six-step CHARMM-GUI protocol for 225 ps. The production run lasted
37
for 1 ?s for all systems. The membrane had the same concentration of lipids as described
previously and the simulation parameters are as before for the full-membrane model.
Long timescale simulation of MPG using Anton: In order to study the effect of mem-
brane composition on the secondary structure of MPG during interaction, we simulated
membranes with different compositions of DOPC and DOPG and their interaction with
MPG using Anton2 supercomputer. Production run using Anton2 ran under NVT ensemble
and semi-isotropic pressure. The Anton2 multigrator [182] framework and Nose-Hoover
thermostat [217, 132] and MTK barostat [197] were used with a timestep of 2.5 fs. Short
range electrostatic interactions were calculated with a cutoff of 9 A? and long-range elec-
trostatic were computed using u-series approach.[273] The membrane had 100 lipids per
leaflet with different concentrations of DOPC And DOPG. 1)DOPC(100%)-DOPG(0%)
2)DOPC(80%)-DOPG(20%) 3)DOPC(60%)-DOPG(40%) 4)DOPC(40%)-DOPG(60%). An-
ton simulations ran for 11 ?s for each system.
2.1.2 Results
Simulations starting from predicted peptide secondary structure: For our initial
studies of MPG and Hst5, we used a PEPFOLD3 which predicted both peptides to be ?-
helical with a small turn in N-term of MPG. The peptides were placed at least 12 A? above
the membrane phosphate plane in 3 different orientations with respect to membrane to
avoid bias. These conformations were parallel and tilted 45? with respect to the membrane
plane. The starting orientations of MPG and HST5 are shown in figure 2.2. The HMMM
simulations were performed for 300ns. Figure 2.3 shows snapshots of the first replicates of
Hst5 and MPG at four different timepoints (0, 50, 100, 300ns) during the simulation with
HMMM model. As shown Hst5 binds to the membrane after 50ns however it fails to enter
the membrane after 300 ns of simulation. On the other hand, MPG inserts into membrane
38
Figure 2.2: starting conformations of peptides MPG and Hst-5 with respect to the membrane. A,B,C
for Hst5 and C,D,E for MPG
through its hydrophobic N-terminus after about 50ns of simulation and adopts a vertical
conformation in the membrane after 300ns.
Since MPG showed penetration into membrane, we converted the HMMM to full-
membrane model and simulated for additional 100ns to study its interaction with a full
atomic membrane. After conversion to full-membrane model, MPG maintained its deep
insertion below the phosphate plane which was consistent throughout the additional 100ns
all-atom simulation.To study the translocation of MPG and HST5, we calculated the dis-
tance of each residue during simulation to the phosphate plane of the closest leaflet. (fig-
ure 2.4) MPG shows insertion into membrane after 50 ns through its N-term hydrophobic
residues. On the other hand, HST5 which is highly charged, does not show penetration into
membrane. Heatmap plots of distance for other starting orientations of MPG and Hst5 are
in figure 2.5. Other initial orientations do not show deep insertion into membrane. In the
second orientation of MPG which faces the membrane from C-terminal charged residues,
the N-term binds to the membrane after 100ns (figure 2.5C). However, it does not enter
the membrane during the 400ns of simulation. MPG with parallel orientation to membrane
39
Figure 2.3: Translocation of CPPs using HMMM model at various timepoints. Phosphate head-
groups of the membrane are represented as tan spheres and the membrane acyl chains are cyan
lines. Residues on the peptide are colored according to their charge. MPG inserts into membrane
after 300ns
failed to enter the membrane and binds to membrane through its N-term. The interac-
tion energies of all the residues in MPG with membrane were calculated for the last 10ns
of full-membrane model (figure 2.4) C-term residues have high interaction energies with
membrane which is due to electrostatic interaction of these charged residues with phosphate
headgroups of the membrane. N-term hydrophobic residues such as L3, F4, F7, L8 have fa-
vorable interaction energies with membrane which is due to the hydrophobic interaction of
these residues with membrane core which drives the peptide to penetrate into membrane. In
summary, the preliminary simulations starting from ?-helical MPG and Hst5 showed that
MPG can enter the membrane through its hydrophobic N-term, however, HST5 remains
attached to the membrane surface and does not show insertion into membrane.
Simulations starting from extended conformation with HMMM membrane:Structural
properties and specifically the conformation of CPPs when interacting with cell membranes
play a major role in their cellular uptake mechanisms.[108] Previous studies have shown
that there is a direct impact of peptide conformation that modulates the amphipathicity and
40
Figure 2.4: Insertion of peptides into the model membrane A) Heatmap of distance of residues
in MPG with respect to phosphate plane of hte closest leaflet with respect to simulation time for
orientation 1. B) Heatmap of distance for Hst5 orientation 1. C) Interaction energies of residues
in MPG with membrane from last 10ns of simulation with full-membrane model D) Snapshot of
simulation for MPG at 300ns (red spheres show phosphate groups and grey lines show acyl tail
of the membrane E) Snapshot of simulation for Hst5 at 300ns (showed as sticks are the K and R
residues on Hst5 which are interacting with phosphate plane of the membrane
41
Figure 2.5: Heatmap plots for Hst5 and MPG starting with helical conformation predicted by PEP-
FOLD3 a) Hst-5 in orientation 2 b) Hst-5 in orientation 3 c) MPG in orientation 2 d) MPG in
orientation 3
membrane interaction of CPPs.[347] Hst5 is known to be disordered in aqueous solution
and adopts a helical conformation in trifluoroethanol and DMSO solution.[239] MPG has
been seen to be unstructured in aqueous solution and adopts a partially ? -sheet confor-
mation upon interaction with vesicles made of phospholipids of DOPC/DOPG. It is worth
mentioning that the CD experiments for MPG was done using vesicles of DOPC and DOPG
which does not represent the full structure of fungal cell membranes.[276] Other stud-
ies demonstrated that MPG remains random coil upon interaction with fungal cells.[108]
Swiecicki et al. studied the effect of membrane composition on the internalization of sev-
eral cationic CPPs using fluorescence quenching and showed that the internalization effi-
cacy of CPPs such as penetratin and Tat greatly depend on the membrane composition.[288]
The membrane composition of the cell can affect the conformation of cell penetrating pep-
tides as well as their internalization efficacy.[53, 306] To this end, we investigated the effect
42
of conformation on peptide-membrane interaction of MPG and Hst5 using MD simulation
with HMMM model. Since these initial simulations were biased to the initial structures
being helical, we started new simulations from extended conformations of the peptides
in the water phase. To this end, we investigated the effect of conformation on peptide-
membrane interaction of MPG and Hst5 using MD with the HMMM membrane. 1 ?s
simulations were performed in three replicates of extend conformations of MPG and Hst5
using HMMM model for the fungal membrane composition studied in the first part. Simi-
lar to simulations starting from helical peptides, these results showed that MPG entered the
membrane from its N-terminal hydrophobic residues, while Hst5 binds to the membrane
surface and does not show penetration into the membrane. The simulations for MPG were
converted to full-membrane model and simulated for an extended 200 ns to study its inter-
action with a full-membrane model. The heatmap for distances of every residue in MPG to
the phosphate plane of the entering leaflet are shown in Figure 2.6. Secondary structures
of MPG during interaction with membrane was analyzed for the 1 ?s trajectories and is
shown in Figure 2.6C. MPG adopts a random coil conformation in the C-terminal domain
and interaction with membrane induces conformational change to ? -sheet in few residues
such as A2, L3, F4, L8, G9, A10 from 100 to 600 ns of simulation. After 600 ns of simula-
tion and deeper insertion to the membrane, most of the N-terminal residues adopt a helical
conformation. Upon formation of ?-helical conformation, the peptide inserts deeper into
membrane, as shown in the heatmap plot for MPG after 900 ns (Figure 2.6A), and most
of the peptide residues are below the phosphate headgroup plane of the membrane. MPG
adopts a helical conformation in the N-terminal region at 200 ns. Secondary structure and
penetration of other orientations of MPG and Hst5 (extended conformation with HMMM
membrane) are shown in figure 2.7 (MPG) and 2.8 (Hst5). In the second orientation of
MPG after 400 ns the peptide loses its secondary structure in N-terminal region and adopts
a helical structure in the C-terminal region (Figure 2.8C). Replicate 3 of MPG shows bind-
43
Figure 2.6: Results for MPG (A,C) and Hst5 (B,D) starting from extended peptide conformations
in solvent A) Heatmap of distances of MPG first replica with respect to phosphate plane of hte
nearest leaflet B) heatmap of distances of Hst5 first replicate C) secondary structure of MPG during
simulation D) secondary structure of Hst5 during simulation E) Snapshots of MPG during all-atom
simulation
ing to the phosphate plane of membrane after 250 ns as shown in the heatmap plot (Figure
2.8B). Interestingly, it adopts a ? -sheet conformation in residues F7, L8, G9, S13, T14, and
M15, and loses this ? -sheet conformation after about 600 ns of simulation which coincides
with deeper insertion of MPG into membrane from the N-terminal region (figure 2.8D).
In contrast to MPG which shows entry into membrane, Hst5 only binds to the phosphate
plane of the membrane and fails to insert below the headgroup region of membrane. How-
ever, a few hydrophobic residues such as Y10 and F14 insert below the headgroup region
of membrane as shown in figure 2.6B,D and figure 2.7. Hst5 does not form any confor-
44
Figure 2.7: a) heatmap plot for Hst-5 (orientation-2) showing distance of every residue with respect
to phosphate plane of nearest leaflet b) heatmap plot for Hst-5 (orientation-3) c) secondary structure
of Hst-5 in orientation 2 during simulation d) secondary structure for Hst-5 orientation 3
mation when interacting with membrane during the 1 ?s simulation which is consistent in
all replicas of Hst5-membrane systems. Karlsson lab recombinantly expressed CPP-GFP
fusions and tested tested cellular uptake in Candida albicans cells using flow cytometry
experiments. Based on their results MPG-GPF significantly improved the translocation of
GFP into the cells while Hst5-GFP had no significant effect on the translocation of GFP
into the Candida albicans (figure 2.9). Interestingly, they showed that the orientation of
MPG in MPG-GFP construct affects the translocation efficacy. MPG attached to the N-
terminus significantly improves the cargo translocation whereas in the construct with MPG
at the C-terminus the translocation of GFP was insignificant compared to the control with
no CPP attached. Effect of orientation on translocation of MPG-GFP constructs is shown
in figure 2.10.
45
Figure 2.8: a) heatmap plot for MPG (orientation-2) showing distance of every residue with respect
to phosphate plane of nearest leaflet b) heatmap plot for MPG (orientation-3) c) secondary structure
of MPG in orientation 2 during simulation d) secondary structure for MPG orientation 3
Figure 2.9: Cellular uptake studies done in Karlsson lab. The flow cytometry data for 24h incuba-
tion of samples were analyzed furhter for 7 replicates to quantify the translocation and membrane
permeabilization in C. albicans. The percentage of fluorescence-positive cells were used to evaluate
the GFP delivery efficacy. The permeability of the cells were evaluated after treatment with the
fusion protein using propidium iodide (PI). A) Translocation data at 24h for 7 replicates showed
significantly higher uptake of MPG-GFP compared to both GFP and Hst5-GFP B) Propidium io-
dide uptake for the same times were recorded with no significant uptake of PI. Error bars represent
standard error of the mean for 7 replicates for panels (a,b). Image credit Karlsson lab.
46
Figure 2.10: Effect of time and cargo orientation on MPG-mediated delivery of GFP to C. albicans.
Purified protiein (100 ?M) with GFP attached to MPG at either the N-terminus or C-terminus and
controls with no CPP were incubated with cells. Translocation was quantified using flow cytometry.
Image credit Karlsson lab
Effect of membrane composition on internalization and secondary structure of
MPG: It has been shown through various experimental and simulation techniques that the
structural state of most CPPs is highly dependent on the lipid types. For instance, penetratin
has been shown to adopt a variety of conformations from ? -sheet structure to ?-helical and
unstructured in the presence of different concentration of charged lipids[53] Swiecicki et al.
studied the effect of membrane composition on the internalization of several cationic CPPs
using fluorescence quenching and showed that the internalization efficacy of CPPs such as
penetratin and Tat greatly depend on the membrane composition.[288] The membrane com-
position of the cell can affect the conformation of cell penetrating peptides as well as their
internalization efficacy.[306] This is also true for MPG where CD experiments with vesicles
of DOPC and DOPG showed the presence of ? -sheet conformation.[208] However, experi-
ments with fungal cells showed a helical conformation for MPG.[108] Furthermore, the re-
sults from HMMM model could be biased due to the highly hydrophobic solvent in the core
of the membrane which can induce the formation of ?-helical structure. In the next step of
this study, we investigated the interaction and secondary structure of MPG in the presence
47
full membrane model of various compositions of the negatively-charged phosphatidylglyc-
erol (PG). In this part of our study, we investigated the structure and interaction of MPG
with 4 different membrane compositions 1) DOPC/DOPG(1:0) 2) DOPC/DOPG(4:1) 3)
DOPC/DOPG(3:2) 4) DOPC/DOPG (2:3). We simulated all these system for 11 ?s each
with a special-purpose supercomputer called Anton2 and using CHARMM36 forcefield for
both lipids and protein. The results were consistent with our earlier study with HMMM.
We have observed that the presence of charged DOPG lipids in the membrane induces per-
sistent and long ?-helical conformation in the N-terminal of MPG. Formation of a helical
conformation is also concurrent with deeper insertion of MPG into membrane. Although
the peptide is mostly surface bound for all the charged lipids, we think that the occurrence
of TM state for MPG requires longer simulation time or a higher temperature which was
shown by Ulmschneider et. al. [299] Moreover, we also observed short ? -sheet forma-
tion in N-terminal of MPG which was transient and replaced with ?-helical conformation
upon deeper insertion of MPG into membrane. Snapshots of MPG in DOPC/DOPG(4:1) is
shown in figure 2.11. Initially a ? -sheet structure forms when MPG contacts the phosphate
plane of the membrane. This is transient and after about 2 ?s the peptide becomes unfolded
on top of the membrane and then forms a helical structure at about 3 ?s. Formation of a
fully helical structure at the N-terminal region coincides with deeper penetration of MPG
into membrane. Heatmaps of distances of MPG residues to the phosphate plane and pep-
tide secondary structure are shown for two membrane compositions DOPC/DOPG(1:0) or
DOPC-100 and DOPC/DOPG(4:1) or DOPC-80 in figure 2.12 and 2.13.
2.1.3 Discussion and Conclusion
Here we studied the interaction of CPPs MPG and Hst5 with model membranes using
MD simulations. HMMM model was used for the membrane in our initial study to accel-
48
Figure 2.11: snapshots of MPG with DOPC(80%)-DOPG(20%)
Figure 2.12: Heat map of insertion depth and secondary structure of MPG for A, C) 100% DOPC
membrane B,D) 80% DOPC-20%DOPG membrane E) snapshot of MPG interaction with 100%
DOPC F) snapshot of interaction of MPG with 80% DOPC 20% DOPG membrane
49
Figure 2.13: Heat map of insertion depth and secondary structure of MPG for A, C) 60%DOPC-
40%DOPG membrane B,D) 40% DOPC-60%DOPG membrane E) snapshot of MPG interaction
with 60%DOPC-40%DOPG F) snapshot of interaction of MPG with 40% DOPC-60% DOPG mem-
brane
50
erate the membrane insertion of peptides. MPG and Hst5 were predicted to adopt a helical
conformation using PEPFOLD3. It was shown that during the simulation MPG inserts
into the membrane from its hydrophobic N-terminus. However, Hst5 fails to insert into
membrane and remains attached to the phosphate plane. Experiments done in Karlsson lab
confirms the simulation results. The translocation of MPG with GFP as the cargo protein
was significantly higher than GFP alone whereas Hst5 made no significant improvement
of the GFP translocation. On the other hand, they showed that the orientation of MPG in
the MPG-GFP constructs affects the translocation where MPG at the N-terminal had a sig-
nificantly higher translocation than MPG at the C-terminal of the constructs. This is also
in line with our simulation results where MPG enters the membrane from its hydrophobic
N-terminus. It is therefore reasonable to assume that placing MPG at the C-terminus of the
construct, prevents effective interaction of hydrophobic residues at the N-terminus of MPG
with membrane and lowers its translocation.
Secondary structure of CPPs, plays a crucial role in their uptake mechanism as well
as their translocation efficacy. Structural studies have been performed before for MPG
and Hst5. As discussed before, MPG has been shown to adopt a partially ? -sheet confor-
mation upon interaction with vesicles made of phospholipids of DOPC/DOPG in exper-
iments. Other studies have shown that MPG remains unstructured upon interaction with
fungal cells. Membrane composition of cells can affect the conformation of cell penetrat-
ing peptides as well as their internalization efficacy.[53] Simulations starting from extended
peptide conformation and HMMM fungal cell membrane showed that MPG has a partially
folded ? -sheet conformation when interacting with the phosphate plane of the membrane
but upon deeper insertion into the membrane core, it adopts a helical conformation. This
shows that the conformational change of peptide from ? -sheet which is mostly at the in-
terface with phosphate headgroups to ?-helical when inside the membrane facilitates the
translocation and deeper insertion of MPG. Hst5 on the other hand does not form any
51
secondary structure and remains unstructured during the simulation in all replicas. The hy-
drophobic solvent used in HMMM model is likely to affect the secondary structure of pro-
teins inserted into membrane. In order to avoid the artifact of HMMM model and also study
the effect of different concentrations of charged lipids (DOPC/DOPG ratio) we ran long-
timescale simulations of MPG interaction with different concentrations of DOPC/DOPG
using Anton2 supercomputer. These simulations showed that MPG has a ? -sheet con-
formation upon making contact with the membrane but after deeper insertion it adopts a
helical conformation. Moreover, the helical conformation was maximum at 20% DOPG
concentration which is the natural concentration of negatively charged lipids. The 100%
DOPC concentration of membrane showed the smallest helical conformation and also the
slowest penetration of peptide into membrane which points to the importance of charged
lipids for helical conformation of MPG and its insertion into membrane. With the com-
bined knowledge gained from experimental and simulation studies we are better equipped
to design better CPPs and study their translocation into the fungal pathogen. This study
will further motivate the use of both experiments and simulations to design better CPPs to
enable cargo delivery.
52
2.2 Deep generative models for Antimicrobial peptide discovery
Antimicrobial resistance causes ?2.8 million resistant infections yearly which leads to
more than 700,000 deaths globally. This is expected to rise to 10 million deaths per year
by 2050 if the current trend continues.[70, 161, 221] Of particular importance is Multi-
drug resistant Gram-negative bacteria. Naturally occurring antimicrobial peptides (AMPs)
have remained effective to combat pathogens despite their ancient origins and continuous
contact with pathogens. Therefore, AMPs are deemed as ?drugs of last resort? for their
ability to combat multi-drug resitant bacteria. AMPs are usually 12-50 amino acids long
and are typically rich in cationic residues (R and K) as well as hydrophobic (A, C and
L) amino acids. The mechanism of action of AMPs depends on their sequence but they
generally act by disrupting the membrane or through other routes such as binding to DNA
and essential cytoplasmic protein and inhibiting their function.[171].
There have been numerous studies on generating new AMPs and/or improving their
activity which resulted in some successful AMPs.[75, 293] These have been generated usu-
ally through expert knowledge and rational design approaches which could be very costly
due to vast space of peptides. There are some limitations in using current AMPs such as
their relatively low half-lives, unknown toxicity to human cells, and relatively high produc-
tion costs.[195, 107, 198] On the other hand, due to the vast space of peptide sequences,
computational techniques are necessary for discovery of novel AMPs with desired prop-
erties. Generative models in artificial intelligence have previously shown great promise in
material and drug discovery.[270, 255, 352] Deep learning have been previously used in
peptide identification, property prediction and peptide generation.[52] Specifically, deep
generative models have been used for generating antimicrobial, anticancer, immunogenic
and signal peptides to name a few. Computational methods using recurrent neural networks
(RNNs)[210], VAEs[71] and generative adversarial networks (GANs) [113, 297] showed
53
the promise of these methods for AMP discovery in silico.
In this study, we use variational autoencoders (VAEs) to learn a meaningful latent space
of AMPs and generate novel AMPs from this latent space. A variational autoencoder[151]
encodes data into a latent space and decodes it back to the original data and optimizes a vari-
ational lower bound of the log-likelihood of the data. Since we are dealing with sequences
as done in natural language processing (NLP), we use recurrent neural networks (RNNs)
as both encoders and decoders in what is known as sequence-to-sequence models. Due to
complexity of natural language and sequential nature of data, these models are harder to
train than other types of neural networks. However, Bowman et al.[31] showed that a seq-
to-seq VAE is able to generate meaningful and novel sentences from the learnt continuous
latent space. Attention mechanism proposed for translation originally has made a great
leap in NLP tasks.[303] In attention mechanism, source information is summarized into a
context vector using a weighted sum where weights are learned probabilistic distributions.
This context vector is used during the decoding process to guide the decoder into what
word in the sequence was most important during decoding. Attention was shown to signifi-
cantly improve almost every task in seq2seq models such as translation [64], summarization
[251], etc. However, Bahuleyan et al.[5] showed that using a deterministic attention where
the source information is directly provided during decoding can lead to a phenomenon
called the ?bypassing? where the variational latent space is not meaningful since the atten-
tion mechanism is too powerful. Thus, they proposed a variational attention mechanism to
address this problem where the attention vector (context vector) is modeled as a random
variable by imposing a prior Gaussian distribution. They evaluated this model on question
generation and dialog systems and showed that the variational attention achieves a higher
diversity than deterministic attention while retaining high quality of generated sentences.
In this study, we have used a variational attention with variational autoencoder in a seq2seq
approach to generate novel, high quality and diverse AMPs. Moreover, we trained a binary
54
classifier network using attention mechanism for evaluation of the generated peptides in
the generative model. The generated peptides are also analyzed for their physicochemical
properties and comparison with real antimicrobial peptides.
2.2.1 Methods
The training data for the AMP prediction model was set by combining AMPs from mul-
tiple databases. These include DRAMP [148], LAMP2[342], DBAASP[234] and APD3
[311]. All AMP sequences had a length of 5-30 amino acids. To exclude repetitive
sequences from our dataset, we used CD-HIT with a cutoff of 0.35 which resulted in
16,808 AMP sequences. Since there are no known dataset for non-AMPs, we made a
non-antimicrobial dataset using Uniprot excluding keywords antimicrobial, antibiotic, an-
tibacterial, antiviral, antifungal, antimalarial, antiparasitic, anti-protist, anticancer, defense,
defensin, cathelicidin, histatin, bacteriocin, microbicidal and fungicide. The final non-
AMP dataset had 16808 examples. We also made sure the positive and negative dataset
have similar length distribution to avoid bias. The final code for AMP prediction and gen-
eration can be found at the https://github.com/ghorbanimahdi73/AMPGen.
2.2.2 AMP prediction model
We trained a model on both AMP and non-AMP dataset for antimicrobial prediction.
The architecture of our model is shown in figure 2.12. The Antimicrobial classification
network contains an embedding layer, a 2D convolution and a bidirectional LSTM with
a context attention and a sigmoid activation at the end for binary classification of peptide
sequences. The dataset was split into training (70%) and validation set (30 %). The output
of the prediction model is the probability score for sequences. The sequences with score
> 0.5 are considered AMP and those with score <0.5 are considered non-AMP. We used
55
Figure 2.14: An illustration of the classification network used for evaluating the generated AMPs.
a binary cross-entropy loss and an Adam optimizer for training the network. Early stop-
ping was applied if the validation loss was not improving for 5 consecutive epochs during
training. The weights of the model with the best validation accuracy was selected as the
optimal model. A 10-fold cross-validation was applied to tune the hyperparameters of the
model. In the AMP prediction network, the peptide sequences are first transformed into a
sequence of integers from 1 to 20 which are then embedded into 2d matrices in the embed-
ding layer of the network. The optimal embedding size was found to be 64 dimension. A
2d convolution is then applied to the embedded sequences using 64 convolutional filters of
size 3. The output of the convolutional layer then goes into a bidirectional-LSTM which
processes the matrices for each residue from both forward and backward directions and the
output is the summation of the two directions. The tuned bi-LSTM hidden dimension was
64. The attention layer then gathers the hidden state of bi-LSTM and computes a weighted
sum of all hidden states as:
exp(a
? j
)
j = (2.1)? j? exp(a j? )
where a is the attention score by applying a linear transformation to the bi-LSTM out-
puts followed by a ReLU activation and ? is the attention weight. The output of the atten-
56
tion layer can be computed as a weighted sum ? j ? jh j where h j is the j?th hidden state of
the bi-LSTM output. The output of attention then goes into the sigmoid activation for an-
timicrobial prediction. The final model had 93% accuracy under a 10-fold cross-validation
which is comparable to other AMP-prediction models such as AMPlify and ACEP with
93.7% and 92.6% accuracies. However, the goal of this study is not antimicrobial pre-
diction and this network was trained in order to evaluate the generated peptides by our
generative model.
2.2.3 Variational autoencoder
A traditional VAE proposed by Kingma and Welling encodes data to a latent space
and then decodes to reconstruct the input data. The network is trained to optimize the
variational lower bound of the log-likelihood of data. Since we are dealing with sequences,
in natural language processing (NLP) recurrent neural networks (RNNs) are typically used
as encoders and decoders in what is known as sequence to sequence models (seq2seq).
Bowman et al. [31] trained a seq2seq VAE and used the continuous latent space to generate
new text. A model with useful information in the latent space will have non-zero KL
and a relatively small cross-entropy term. However, in a standard VAE, the KL becomes
vanishingly small and the model becomes a RNN language model. The decoder learns to
ignore the latent z vector and only use the input data which is provided at each step of
decoding. Two techniques have been proposed by Bowman et al. to mitigate these issues ,
both of which are used here: 1) KL-annealing and 2) word dropout. For the KL-annealing,
we add a variable weight to the KL term in the loss function during training. This weight is
set near zero at the beginning of training and then increases to a maximum weight toward
the end of training process. This ensures that the at the beginning the model learn enough
information from latent space. The word dropout weakens the decoder by removing some
57
of the conditional input information during training. This forces the model to rely more on
the latent code.
Attention mechanism has transformed the natural language processing enabling train-
ing of enormous models and achieving high accuracies. In the attention mechanism, the
source information in summarized in an attention vector using a weighted sum of hidden
states of the source sentence where the weights are learned probabilistic distribution. Then
this attention vector is directly fed to the decoder at each step during decoding. It has
been demonstrated that this attention mechanism improves the performance of models in
translation[64], Summarization[251] and other NLP tasks. However, it was shown that this
deterministic attention can serve as a bypassing mechanism and the latent space cannot
learn the distribution of the data since the attention is too powerful.[5] Here we use vari-
ational auto-encoder with variational attention for the generation of novel antimicrobial
peptides. Different parts of the network are described in detail below:
Encoder
In our model, encoder is a GRU which is parameterized by ?E . The encoder network
takes the inputs sequence x = x1, ...,xn and outputs the hidden representation of the se-
quence h = h1, ...,hn where n is the length of the sequence. This is written as :
hi = GRU(hi?1,xi;?E) (2.2)
Two dense connection layers are then used to learn the mean vector ?z and the stan-
dard deviation vector ?z. latent variable z is then sampled from the Gaussian distribution
N (?z,diag(?2z ))
58
Variational Attention
Attention mechanism tries to dynamically align x? = x?1, ..., x?n during generation. During
decoding, the attention mechanism for step j of the decoder is computed between all the
hidden states of encoder and hidden state at step j of decoder as :
exp(e
? ji
)
ji = n (2.3)? ? exp(ei =1 ji? )
In the above equation e ji is the pre-normalized score calculated as e x? T x x?ji = h jW hi where h j
and hxi are the j?th and i?th hidden representation of decoder and encoder hidden states and
W is a bilinear term to capture specific relation. The attention vector is then calculated by
a weighted sum as:
n
a j = ? ? jihi (2.4)
i=1
The posterior qaE(a j|x) is modeled as another Gaussian distribution N (?a j ,diag(?2a ))j
which is written as:
a j ? qaA(a j|x) = N (? 2a j ,diag(?a )) (2.5)j
Decoder
The decoder is a single layer GRU. The decoder at each step is provided with the latent
space of encoder, attention vector computed at each step and the true input sequence hx?j =
GRU x?? (h j?1,y j?1,a j,z)
A softmax function at the end is used to predict the next word in the sequence x? j given
59
the hidden representation at h? j:
p(x?t) = so f tmax(Whh?t +bh) (2.6)
The loss function of the model with attention at each step of decoder can be written as:
L j(?D,?E ,x) =?KL(qE(z,a|x)||p(z,a))+EqE(z,a|x)[logpD(x?|z,a)] (2.7)
=?KL(qE(z|xi)||p(z))?KL(qE(a|xi)||p(a))+EqE(z|xi)qE(a|xi)[logpD(x?i|z,a)]
In the above equations KL is the Kullback-Leibler divergence between two distribu-
tions. The posterior qE(z,a|x) = qE(z|x)qE(a|x) is factorized into two distributions since
a and z are conditionally independent given x. The sampling can then be performed sepa-
rately for a and z. The overall objective of the VAE with variational attention can then be
written as:
LD(?D,?E) = Lrec(?D,?E , x?)+?KLKL[qE(z|x)||p(z)]
(2.8)
+?a ?nj KL[qE(a j|x)||p(a j)]
hyperparameters ?KL and ?a are the weights on the KL and attention terms of the loss
function. Annealing is done on the ?KL weight while the ?a weight is kept constant. We
used a monotonic annealing scheduler for training from 0 to maximum weight of the KL
term.
2.2.4 Results and Discussion
Antimicrobial prediction
To assess the quality of the generated peptides in the generative network, we first trained
a binary classifier network for prediction of antimicrobial peptides. The training data for
60
our model consists of AMPs and non-AMPs collected from multiple datasets as described
in the methods section. The total dataset has 16,808 positive and 16,808 negative (non-
antimicrobial) sequences. Our model architecture (fig 2.12) consists of an Embedding
layer, a convolutional layer, a bidirectional LSTM, a context attention and a sigmoid at
the end to compute the probability of antimicrobial class for sequences. Attention mech-
anism is inspired by the the brain?s ability to prioritize segments of information during
textual or visual processing. A bi-directional LSTM is a variant of RNN which encodes
positional information from the sequence in a recurrent manner in both forward and back-
ward directions. The context-vector attention generates a vector summary of all hidden
states of the bidirectional LSTM using a weighting average, where the weights are learned
during training. The architecture of this model is shown in figure 2.12. During training we
used early-stopping to stop the training when there is no improvement after 5 consecutive
training epoch. For training this model we performed a train/test split (70%/30%). For
the classification model, we used a batch size of 32, a learning rate of 0.001, embedding
dimension of 64, hidden dimension of 64. The number of convolutional kernels in the con-
volution layer were 64 with a size of 3. We also employed dropout with a dropout rate of
0.3 to avoid overfitting. The training and validation accuracy during training is shown in
figure 2.13. This model achieved an accuracy of 93.5% under a 10-fold cross validation.
The accuracy of our model is comparable to other AMP prediction models such as AMPlify
[172] and ACEP [102] which use deep learning models and report accuracies of 92.79%
and 91.16 (for sequences less than 30), respectively.
Training the generating network
For the generative model, we only used the AMP dataset consisting of 16,808 known
AMP sequence. For training the generative VAE with variational attention, our architec-
61
Figure 2.15: Training and validation accuracy of the AMP-prediction over the training epochs.
Figure 2.16: Illustration of AMP generative model with Encoder, Decoder and variational attention
parts
ture consisted of 128 hidden units for the Encoder and Decoder which are both single
directional LSTM networks, a latent space dimension of 32. The model was trained for
100 epochs. The training process takes about 3 hours on a Tesla V100 GPU. During gener-
ation we experimented with different sampling methods such as Beam-search, Temperature
sampling, Top-K sampling and Top-P sampling. Top-K sampling gave better results than
other sampling methods with K = 5.
During the training process, we tokenize the sequences of amino acids into all the
twenty natural amino acids and three additional tokens representing the start of the se-
quence ?<SOS>?, the end of a sequence ?<EOS>?, the padding ?<PAD>?. In the word-
62
dropout technique, we randomly replace an amino acid with ?<UNK>? token during train-
ing. Since the peptides have different length, we added a padding token to the sequences
so that they all have a fixed length of 30 during training and evaluation. The standard VAE
where the weight on the KL term is 1, suffers from a KL vanishing problem which leads
to i) an encoder that produces posteriors almost identical to Gaussian prior for all sam-
ples and ii) Decoder ignores the latent variable and the model reduces to a simple RNN
Encoder-Decoder. Bowman et al.[31] proposed two approaches to deal with this problem.
A word-dropout which randomly replaces the words in sequences with an unknown token
?<UNK>? during training to avoid overfitting in the model. And the second approach is
KL-annealing where at the start of the training process, the weight of the KL term is small
where z is learned to capture useful information for reconstructing x during training. Then
the KL-weight increases monotonically to a maximum value. During training, we used
a monotonic annealing where the weight increases from a small value (0.001) to a maxi-
mum value. Not annealing the KL-term led to a posterior collapse and an uninformative
latent space. We experimented with different weights on the KL and attention KL term
and the results of generative model evaluation are shown in figure 2.15. For each set of
hyperparameters we generated 10,000 peptide sequences from the generative model. The
AMP-prediction model was used to predict what fraction of the generated AMPs are in
fact antimicrobial. As shown in figure 2.15, increasing the ?KL increases the accuracy of
the generative model in generating antimicrobial peptides. In most ?KL values, increasing
?a also increases the accuracy of AMP prediction. BLEU is another metric used here to
evaluate the generated peptides. This was originally proposed for evaluation of machine
translation tasks by comparing the similarity between the translated sentences by the model
and the true human references. Here we used BLEU score to compare the generated AMPs
with the training data of real AMP sequences. BLEU score is calculated for every sample
63
Figure 2.17: Evaluation of the AMP-generative model over different values of ?KL and ?a a) Ac-
curacy of the generative model over 10,000 generated sequences using the trained AMP-prediction
model B) average perplexity of the generated sequences using an external language model C) Av-
erage BLEU score (BLEU-2 to BLEU-5) of generated sequences D) Average Self-BLEU score
(BLEU-2 to BLEU-5) of generated sequences
in the generated dataset Sgen against all the real AMP references Sre f as :
1
BLEU(Sgen,Sre f ) = ? BLEU(s,Sre f ) (2.9)|Sgen| s?Sgen
Higher BLEU score implies more overlap of n-grams in the generated data and the real
AMPs. BLEU score for VAE without annealing, VAE with annealing and ?KL=0.08 and
VAE-attn with different combinations of ?KLand ?a are calculated in table 2.2.
External language model: In NLP, given a sequence of words x = (w1, ...,wn), lan-
guage models estimate the probability distribution P(x) over it. A popular choice of lan-
64
Table 2.2: BLEU, accuracy and perplexity of a few selected models
Model BLEU-3 BLEU-4 BLUE-5 BLEU ACC PPL
VAE (no anneal) 0.999 0.984 0.911 0.974 99.5 5.3
VAE (?KL=0.08) 0.998 0.962 0.826 0.947 96.2 8.09
VAE-attn (?KL=0.04,
?a=0.5)
0.997 0.921 0.711 0.907 88.7 11.82
VAE-attn (?KL=0.04,
? =1.5) 0.998 0.928 0.722 0.912 89.8 11.48a
VAE-attn (?KL=0.06,
? =0.5) 0.998 0.943 0.769 0.927 93.6 10.26a
VAE-attn (?KL=0.06,
? 0.998 0.944 0.773 0.929 92.84 10.13a=1.5)
VAE-attn (?KL=0.08,
? =0.5) 0.998 0.957 0.818 0.943 95.6 8.8a
guage modeling is autoregressive model where:
P(x) = P(w1, ...,wn) = ?ni=1 p(wi|w1, ...,wi?1) (2.10)
The likelihood of a sequence of words P(x) can be used as a proxy for its quality. RNNs
and LSTMs are common architectures for autoregressive modelling. RNNs are trained to
predict next word given the current word wi and the hidden state of previous word hi?1
which is equivalent to maximizing the marginal likelihood P(x) of a sequence of words in
the training data. RNN language models have been proposed to compute the perplexity of
generated text which is a measure of fluency of machine generated text.[349] We trained
an LSTM language model on the AMP dataset. For this, we split the training set into train
(70%) and heldout or test (30%) set. Then, we trained a character-level LSTM on the
training set and calculated the perplexity on the heldout dataset. Our best model achieved a
perplexity of 6.0 on the heldout dataset. Figure 2.15B shows the change in perplexity upon
changing the ?KL and ?a values in VAE-attn model. Higher ?KL gives a higher perplexity
to the model. Table 2.2 shows the perplexity of a standard VAE (without annealing), a ? -
65
Figure 2.18: Physico-Chemical properties of the generated 10,000 AMP sequences and their com-
parison with real AMP dataset. A)fraction of different type of amino acids B) distribution of charge
in real and generated AMPs C) Global hydrophobicity over generated, real and random sequences
D) Global hydrophobic moment for generated, real and random sequences E) Sequence length dis-
tribution for generated and real AMPs
VAE with ?=0.08 and other selected models. A standard VAE has the lowest PPL which is
even lower than that evaluated on the validation set. This shows that the standard VAE has
collapsed to a denoising autoencoder which is just reconstructing back the original data.
In order to evaluate the diversity of the generated AMPs we used self-BLEU score
which assesses the similarity between every generated sequence and the rest of the gener-
ated dataset. Lower self-BLEU score implied higher diversity of the generated text. The
BLEU and self-BLEU scores for different ?KL and ?a weights are shown in figure 2.15C,D.
Increasing ?KL increases the BLEU and self-BLEU scores. We noticed that at higher ?KL
66
values the difference between different ?a values are more noticeable. The comparison
of self-BLEU for real AMPS, standard VAE (no annealing), ? -VAE (?=0.08) and other
selected models is shown in table 2.3. The self-BLEU for real-AMPs is 0.968. This high
value is due to the choosing a small cutoff (0.35) for removing repetitive AMPs in the
data curation process which was chosen to maximize the number of selected AMPs for the
generative model. Standard VAE without annealing shows a higher self-BLEU than the
real AMPS which shows a very low diversity of generated sequences. ?VAE also shows
a higher self-BLEU than other VAE-attn models. The KL divergence for each model was
also calculated against the validation AMP dataset. A higher KL implies a higher differ-
ence between real and generated AMP distributions. However, a very high KL could lead
to the model only generating random sequences. A comparison of KL for different models
is shown in table 2.3. Standard VAE have a KL of 0 which shows the posterior collapse
and the model becoming a RNN language model. The KL for ?VAE is 1.1 which is also
lower than VAE-attn models which points to lower divergence more similarity of ?VAE
generated sequences to the real AMPs.
We also investigated the generated antimicrobial peptides through their physicochem-
ical properties such as length, charge, hydrophobicity and hydrophobic moment. Specifi-
cally the generated sequences are rich in amino acids such as Lys, Leu, Arg, Ala, Ile, Gly,
Val, Phe and Trp in that order. (figure 2.16) As shown in the distribution of charges for
generated peptides, most of them have a positive net charge due to presence of Lys and
Arg residues. Furthermore, the hydrophobic moment shows that most of these are ?-helix
peptides. These observations highlight the close properties of the generated peptides with
the dataset of real antimicrobial peptides which point to the fact that the generated peptides
have antimicrobial activities.
67
Table 2.3: self-BLEU (sBLEU) for 3,4 and 5-grams and KL divergence
Model sBLEU-3 sBLEU-4 sBLEU-5 sBLEU KL
real AMPS 0.998 0.967 0.909 0.968 -
VAE (no anneal) 0.998 0.984 0.911 0.973 0
VAE (?KL=0.08) 0.998 0.962 0.826 0.946 1.1
VAE-attn (?KL=0.04,
? =0.5) 0.999 0.929 0.764 0.923 13.1a
VAE-attn (?KL=0.04,
? =1.5) 0.994 0.9333 0.771 0.924 12.4a
VAE-attn (?KL=0.06,
? =0.5) 0.993 0.941 0.819 0.938 7.0a
VAE-attn (?KL=0.06,
? 0.994 0.944 0.828 0.942 6.4a=1.5)
VAE-attn (?KL=0.08,
? =0.5) 0.998 0.957 0.818 0.943 3.2a
2.2.5 Conclusion
Antimicrobial peptides have shown great potential as alternative therapeutics for bacte-
rial resistance. In this study, we use deep learning generative model attention based vari-
ational autoencoder to generate novel and high quality sequences of AMPs. A bypassing
phenomena has been observed when using deterministic attention in a VAE framework.
The bypassing phenomena makes the latent space uninformative and sampling from this
space would give random results during generation. Since attention has been shown to
improve tasks in NLP, it is tempting to include attention mechanism in our model. But a
deterministic attention is not possible for sequence generation task on the same dataset so
we opted for a variational attention mechanism. Therefore, Bahuleyan et al.[5] proposed a
variational attention approach where the attention vector is modeled as random variable by
imposing a prior Gaussian. Here, we used variational attention based variational autoen-
coder to generate novel AMPs. The generated AMPs from our best model were evaluated
using an antimicrobial prediction network which showed a more than 95% probability. We
68
have also used evaluations metrics such as BLEU, self-BLEU and perplexity which showed
that high quality of the generated sequences. Moreover, we compared the physicochem-
ical properties of the generated peptides with the real AMPs which showed closeness of
these properties. The future direction of this work will be using post-evaluation models
such as regression models to predict antimicrobial (MIC) activity of the generated peptides
and a toxicity prediction to further choose a few selected peptides. Further evaluation of
the generated peptides can be performed by performing MD simulations and experimental
validation.
69
Chapter 3: Markov modeling and machine learning
3.1 Markov modeling of conformational fluctuations in ?2-microglobulin
3.2 Introduction
?2-microglobulin (?2m) is a 99-residue protein subunit of major histocompatibility
complex I.[229] Upon renal failure the concentration of ?2m in the serum increases by
60-fold which causes fibril formation. Individuals with kidney impairment living through
heamodyalisis have a high concentration of ?2m at about 30-50 mg/mL from the normal
level of 0.3?30 mg/mL because of the inability of dialysis membrane to effectively remove
the protein.[18] High concentration of ?2m is known to be the major cause of fibrogene-
sis associated with dialysis related amyloidosis (DRA).[123, 63] One of the first steps in
protein aggregation involves partial unfolding or misfolding of monomeric species to ini-
tiate aggregation and amyloid formation.[85] Moreover, the monomers of ?2m are highly
stable at physiological conditions even at high concentrations in vitro.[89] Intermediate
states in the folding of ?2m have been identified which are known to adopt a non-native
trans conformation in Pro32.[4, 278] This intermediate state is known to be a precursor for
?2m enhanced fibrilogenesis and is known to form stable isomers.[56, 166] However, the
trans conformation alone is not sufficient to induce amyloid formation as mutants of ?2m
have been identified where the trans conformation is dominant but spontaneous amyloid
formation is not observed.[235] This indicates that other structural changes are involved in
70
misfolding of ?2m and structural, thermodynamic and kinetic properties of ?2m confor-
mational changes need to be investigated to unveil the amyloid propensities of ?2m.[63]
Here we set out to study the folding landscape of ?2m to identify the metastable misfolded
states with potential aggregation propensities.
Experimental techniques such as NMR and Cryo-EM only provide static snapshots of
most populated conformational states while other techniques such as FRET are limited by
their resolution in giving atomic level detail of dynamics. On the other hand, molecular dy-
namics simulation has proven to be a useful tool to provide atomic level details of biological
systems such as protein folding and protein conformational heterogeneity. This usually re-
quires generating large amount of data using fast supercomputers or distributed computing
platforms such as folding@home[160] which makes interpretations and gaining biologi-
cally relevant information a challenging task. Recently Markov state models (MSM) have
been increasingly adopted for analyzing the high dimensional data from MD simulations.
In this model, the dynamics of the biological system is described by memory-less jumps
between discrete conformational states. In this study we characterize the dynamic confor-
mational landscape of ?2m to identify the aggregation prone intermediate metastable states
using molecular dynamics simulations. MSM analysis was applied to find the thermody-
namics and kinetics of ?2m misfolding which gives us important insights into the first stage
of aggregation of ?2m. Metadynamics simulation was performed to sample misfolded and
near folded conformations of ?2m and seed MSM simulations. Next, we have accumulated
250?s MD simulation trajectories of ?2m to perform the MSM analysis.
3.2.1 Methods
Metadynamics and conventional simulations of ?2m: Metadynamics simulation was
used to sample different conformations of ?2m close to the native state. In metadynamics[9],
71
one picks a few relevant (slow) collective variables in the simulation and an external
history-dependent bias potential which is constructed as a sum of Gaussian kernels are
added to the simulation trajectory in the space of collective variables (CVs). The idea of
metadynamics is to push the system away from local minima and visit new states in col-
lective variable space. More information about metadyanmics is given in the chapter 1 of
this dissertation at the enhanced sampling methods section. Three different collective vari-
ables were chosen according to a previous study on ?2m.[162] These include: 1) ? -sheet
content of the protein, 2) phipsi collective variable containing all ? and ? backbone tor-
sion angles and 3) RMSD with respect to the folded state. We performed metadynamcis
simulations with 2 collective variables combining every 2 two mentioned CVs, totaling 3
different metadynamics simulations each lasting 500 ns. These include metadynamics-1
(? -sheet content and phipsi CVs), metadynamics-2 (? -sheet content and RMSD CVs) and
metadynamics-3 (RMSD and phisphi CVs). In each metadyanmics simulation the Gaus-
sians were deposited every 2ps with a Gaussian height of 2 kJ/mol and a biasfactor of 10.
All simulations ran at a temperature of 340K to further enhance the conformational tran-
sition of ?2m. This temperature is at the experimental melting point of ?2m (357.6K) to
avoid complete unfolding of the protein.[257] Metadynamics simulation were performed
using GROMACS and PLUMED.[294] CHARMM36m[23] forcefield was used for protein
and Tip3p [144] water model for the solvent. Na+ and Cl? ions were added to neutralize
the system. Prior to running metadynamics, we minimized the system with steepest descent
algorithm followed by 0.25ns equilibration with a 1fs timestep and further 20ns equilibra-
tion with a 2fs timestep at NPT ensemble. For all simulations, we used a velocity rescaling
thermostat to maintain temperature at 340K with a coupling constant of 0.1 ps?1. Pres-
sure was maintained at 1 bar using Parinello-Rahman barostat with a coupling constant of
5 and compressibility of 4.5? 10?5 bar?1.[226] The resulting 500ns simulation of each
CV were combined into 1.5 ?s simulation trajectory and clustered using K-Means into 300
72
structures for seeding Markov model simulations. The seeding structures were solvated
and simulated for 500ns for building a MSM. These simulations were performed at 340K
and 1 bar using velocity rescaling thermostat and Parinello-Rahman Barostat and a 2fs in-
tegration timestep. The initial Markov model resulted in disconnected states in the space
of TICA. Thus we seeded further states from the low-populated intermediates in the TICA
space and simulated another 200 structures. In total, we accumulated 500 trajectories with
a total simulation time of 250?s for the wild-type ?2m. Snapshots of the system were
saved every 200ps and we used every 1ns snapshot for building the MSM.
3.2.2 Results
MSM construction and validation
In MSM, our purpose is to model the slow dynamics of the system.[215] The variational
principle of conformational dynamics provides a scalar value (VAMP-2) [218] to compare
features in order to find hyperparameters of Markov model that provide a kinetic model
with the highest kinetic variance. To find the optimal hyperparameters for featurization and
TICA, we used the variational scoring method with cross-validation to evaluate the model
quality.[139] The following trajectory featurizations were considered for the optimization
process: 1) Cartesian coordinates of C? atoms 2) pairwise distances of C? ?s 3) dihedral
angles 4) transformed pairwise distances according to f (di j) = exp(?di j) and 5) inverse
pairwise distances between C? atoms. A 50:50 train-test split cross-validation scheme
was used to evaluate hyperparameters of MSM and avoid overfitting. In this approach,
the model is fitted to the training data while transforming the test set according to the
model. We repeat the shuffle split process 5 times to obtain standard deviations of out-
of-sample model performance. As MSM and VAMP-2 score is highly dependent on the
chosen lagtime, we repeated the process for three different lagtimes 10, 20 and 50ns. The
73
Figure 3.1: feature selection with VAMP-2 score over 3 lagtimes (10,20,50 ns)
Figure 3.2: Optimal choice of hyperparameters for MSM
result of feature optimization is shown in figure 3.1. Based on our analysis C? Cartesian
coordinates outperforms other type of features consistently in all lagtimes. After selecting
C? coordinates as the featurization choice, we need to select hyperparameters of MSM
including the number of components for TICA, the number of microstates for clustering
and the lagtime for building the MSM. The optimal hyperaparameters were obtained using
a cross-validated VAMP-2 score. Based on our analysis (figure 3.2) the VAMP-2 score
were maximized at a lagtime of 20ns with 80 microstates and 4 TICs. Featurization and
TICA were performed using PYEMMA software.[263]
After finding the optimal hyperparameters for building the kinetic model, we trans-
formed the data into 4D TIC space to reduce the dimension of the feature space. TICA
projects the dynamics into a few components (TICs) while preserving the long-timescale
74
Figure 3.3: Free energy landscape in the space of 4 TICs
dynamics of the system. TICA lagtime, number of components (TICs) and number of
cluster points were optimized using 5-fold cross-validated VAMP-2 score (20ns lagtime, 4
TICs and 80 clusters). The free energy landscape in the space of TICs were obtained by
performing histogram analysis over TIC dimensions as shown in figure 3.3 (transforma-
tions on different combinations of TIC components). Multiple low-energy basins are found
in the FEL with transition regions between metastable states.
The 4D TICA space was then clustered using 80 KMeans points. All trajectories were
then discretized into these 80 microstates and a MSM transition matrix was built using a
75
Figure 3.4: A) implied timescales B) CK test
Bayesian Markov modeling scheme. Thermodynamic and kinetic properties of the system
can be extracted from the Eigendecomposition of the transition matrix. To choose a proper
lagtime for the final MSM, we plotted the implied timescales (ITS) as a function of lag-
time as shown in figure 3.4. In MSM, we select the smallest lagtime where the implied
timescales have converged. the implied timescale converge after about 75ns which is what
we use to build the final MSM. Diagonalization of the 80 microstate transition matrix led
to 4 leading timescales (eigenvalues) followed by a spectral gap which motivated us to
choose a 5 Macrostate MSM for further analysis. The Chapman-Kolmogrov (CK) test was
performed on the diagonal of MSM transition matrix to check the self-consistency of the
constructed MSM (figure 3.4B)
Figure 3.5 shows the visualization of the top 4 eigenvectors of the transition matrix.
the first eigenvector of the transition matrix corresponds to the stationary distribution and
76
Figure 3.5: Eigenvectors of the transition matrix over transformed on TIC space. The timescales
t1 to t4 correspond to the timescale of each eigenvector where t1 is the timescale of the second
eigenvector (first eigenvector is the stationary distribution)
Figure 3.6: fraction of native contact for different states
77
Figure 3.7: metastable state assignment according to PCCA++ over the TICA space
Table 3.1: stationary probability and free energy of different metastable states
state probability Free energy / kT
S1 0.027 3.62
S2 0.079 2.53
S3 0.192 1.65
S4 0.231 1.47
S5 0.471 0.75
gives the free energy landscape (FEL). The timescale for each eigenvector correspond to
the timescale in the ITS figure at a lagtime of 75ns. To identify the folded and unfolded
region of the TICA landscape we computed the fraction of native contact of each snapshot
and colored the TICA landscape with this quantity. Figure 3.6 shows the fraction of native
contact over 4 combinations of TICs.[22] This shows the first implied timescale of 8.2?0.9
?s as shown in the eigenvector visualization (figure3.5) corresponds to going from the
global folded state of the protein to an unfolded (misfolded) region. The second implied
timescale of 3.0?0.9 ?s corresponds to transition between two different misfolded states.
The 80 state transition matrix was further coarse-grained into 5 states using PCCA++
78
Figure 3.8: hydrophobic SASA over the TICA landscape
clustering over the the first 4 eigenvectors of transition matrix. This is a fuzzy clustering
algorithm which gives the probability of each microstate to belong to each 5 macrostates.
We further use the maximum assignment probability to assign 80 microstates into the 5
macrostates. A visualization of the Macrostate assignment (states) over the space of TICA
is given in figure 3.7 over a few TIC dimensions. The stationary probability and free energy
of each macrostate are calculated in Table 3.1. State S1 has the smallest population of only
0.027 with a energy of 3.62 kBT while the folded state S5 has a population of 0.471 and a
free energy of 0.75 kBT .
Exposure of hydrophobic residues in the misfolded states of monomeric proteins is an
important factor for aggregation and amyloid formation. We have computed the solvent
accessible surface area of the hydrophobic residues in ?2m in the TICA landscape. Figure
3.9A shows the average SASA of each metastable states in the ?2m folding. State S1
which is the unfolded (misfolded) has the highest hydrophobic SASA and S5 which is the
79
folded has the lowest hydrophobic SASA. We have applied transition path theory (TPT)
to analyze the flux from the unfolded S1 to folded S5 states. The results are shown in
figure 3.9B. The most putative folding pathways are from the unfolded states S1 and S3
to the folded S5 state while other pathways have smaller fluxes. In order to visualize
metastable state structures, we sampled from the center of each metastable state to generate
a structure. Figure 3.10 shows the structure of these states along with the mean first passage
times between different states. In the misfolded state S1, the outer strands of the strand A
and D were unfolded. This unfolding of the outer strands, exposes the hydrophobic core
of the protein to the solvent which make the protein prone to aggregation. Hydrophobic
residues Leu54, Phe56, Trp60, Phe62, on the DE loop as well as residue Phe30 are the
dimerization hotspots of the protein.[95] We have also computed the root mean square
fluctuation (RMSF) relative to the folded state for 10,000 sampled structures from each
metastable state. The results are shown in figure 3.11. state S1 has the highest fluctuation
in the A strand which completely detaches from the core of the protein. Detachment of A
strand is a hallmark of aggregation for the ?N6 variant of ?2m. Other metastable states
show lower RMSF in strand A. Therefore, a structural characteristic of state S1 is the
unfolded strand A. Other misfolded states show high RMSF in strand D which is also
unfolded. This strand is in fact the first to unfold and has high RMSF in all metastable
states even in the folded S5 state. This causes more fluctuation in the DE loop which
exposes the hydrophobic residues to the solvent. We have computed the ? -sheet content of
the protein in each metastable state in figure 3.12. Strand A in state S1 is unstructured in
more than half of this structural ensemble. Second half of strand D is also unstructured in
state S1. Strand G also has a lower sheet content than the folded ensemble S5. State S2 has
an unstructured D strand which has even lower probability than state S1. An unstructured
D strand is thus characteristic of this state. States S3 and S4 also have a partially unfolded
strand D and a fully folded strand A.
80
Figure 3.9: A) hydrophobic SASA of different metastable states B) network representing the flux
from the misfolded S1 to folded S5 with arrows showing the flux between states and the size of each
state corresponds to the stationary distribution
Figure 3.10: representative structures and the timescale of transition between different states. Strand
A is shown as blue and strand D is shown as red. The thickness of each arrow is relative to the
transition rate between different states and the diameter of each circle corresponds to the population
of each metastable states.
81
Figure 3.11: RMSF of different metastable states from sampled snapshots.
3.2.3 Discussion and Conclusion
High concentration of ?2m is suggested to be the major cause for fibrilogenesis in dial-
ysis patients undergoing haemodialysis.[18, 89] Monomers of wild-type ?2m are highly
stable under physiological conditions with almost no tendency to form aggregates even at
elevated concentrations in vitro.[89] Partial unfolding or misfolding of monmeric species is
widely believed as the first stage of aggregation for globular proteins.[4, 278] In this regard,
a long-lived metastable intermediate in folding of ?2m has been identified with non-native
trans-Pro32 conformation.[145] Although this folding intermediate has been recognized as
an important amyloidogenic precursor, it is not the only factor driving the amyloidogenic
properties of ?2m. Structural and dynamical studies are needed to investigate the misfolded
states of ?2m with highly amyloidogenic characteristics. Marchand et al.[162] studied the
aggregation properties of D76N mutant of ?2m by combining ssNMR and ensemble mod-
eling Molecular Dynamics. Their results pointed to the presence of major conformational
exchanges happening on the ?s-ms timescale. The metastable states in their study are
characterized by the loss of ? strand in D76N for the outer strands. They proposed that the
82
Figure 3.12: ? -sheet content of different states A) S1 B) S2) C) S3 and D) S4 . The grey area in
each figure shows the ? -sheet content of state S1
destablization of the outer strands of D76N ?2m causes increased aggregation propensity.
This impairs the hydrophobic core of the protein and exposes it to solvent. In their study,
in the excited state of D76N the D and A strand were unstructured and the C-terminal was
paritally detached. This resulted in exposure of aggregation prone core strands (B, E and F)
which lose their protection from aggregation resistant edge strands A, D and G. Despite nu-
merous structural and mutational studies on ?2m conformational dynamics, the structural
details that dictate formation of amyloidogenic species of ?2m remain elusive.
Markov state models are statistical models, able to estimate conformational changes as
Markovian transitions at discrete space. They can overcome the timescale limitations of
long-timescale MD which is an inherent problem in MD. A MSM can be estimated from
multiple short MD trajectories which allow the sampling of different states to be conducted
in parallel which is highly efficient using modern supercomputers. To our knowledge, this
is the first study applying MSM on the folding landscape of ?2m using accumulated 250
?s trajectories of MD simulation. Initially, we ran three different metadynamics simula-
83
tions with 2 different collective variables from metadynamics-1 (? -sheet content and phipsi
CVs), metadynamics-2 (? -sheet content and RMSD CVs) and metadynamics-3 (RMSD
and phisphi CVs). The data from all metadyanmics simulation were combined and clus-
tered to generate 300 seeds for conventional MD simulation each for 500ns. The initial
MSM built using these trajectories resulted in a free energy landscape with disconnected
states. Therefore, we further adaptively sampled from the low-energy states and ran ad-
ditional 200 simulations. The total simulation time of 250?s although might be short for
describing the full folding landscape of ?2m, is probably sufficient for characterizing the
near-folded and misfolded states and their transitions. Constructing a Markov model of the
folding and misfolding trajectories involves selection of multiple hyperparameters such as
number of TICs and the number of discreate states. Variational approach to conformational
dynamics (VAC) allows the comparison of different models and choosing the optimal fea-
turization and hyperparamers for MSM. Therefore, we computed a 5-fold cross-validation
VAMP-2 score for different types of features as well as number of TICs and number of
clusters. These led to choosing the C? coordinates as features, 4 TICs and 80 clusters for
MSM construction. Projection of data into 4 TICA space (figure 3.3) shows multiple low
energy states and transition regions between them. To choose a proper lagtime for final
MSM construction and check its Markovian properties, we conducted implied timescale
and CK tests. At the chosen lagtime of 75 ns, the implied timescales converge and the CK
test shows the ability of the model to predict multiple lagtimes into future which shows
the Markovian property of the model. The eigenvectors of the MSM transition matrix pro-
jected onto the TICA space showed the transition from the unfolded to folded state to be
the slowest process with a timescale of 8.2?0.9 ?s and the second slowest process the
transition between different misfolded states. (figure 3.5) The implied timescales showed a
gap between 4th and 5th timescale which led to constructing a 5-state coarse-grained MSM.
The 80-state MSM was further coarse-grained into 5 metastable states using PCCA++ algo-
84
rithm. The metastable states projected on TICA are shown in figure 3.7 with the probability
and free energy of each state shown in table 3.1. We investiagted the aggregation propen-
sity of different states by computing the hydrophobic SASA of samples in each state. The
computed hydrophobic SASA projected onto the TICA is shown in figure 3.8 and the av-
erage SASA of each state in figure 3.9A. Unfolded state S1 has the highest hydrophobic
SASA and the folded state S5 has the lowest SASA. The transition from the unfolded S1
to folded S5 happens directly or through an intermediate S3 state as shown in figure 3.9B
using TPT analysis. We investigated the structural properties of intermediates states by
sampling a conformation from the center of each metastable state. Figure 3.10 shows the
structure of metastable states as well as the MFPT of transition between them. S1 state has
an unfolded D-strand and an unstructured and detached A-strand which is reminiscent to
?N6 intermediate state. The detachment of A strand in S1 is the slowest process in the mis-
folding landscape of ?2m. RMSF for different states is shown in figure 3.11 and shows the
high RMSF of strand A is only a characteristic of state S1. Secondary structural properties
of different state as the percentage of ? -sheet content (figure 3.12) showed that state S1 has
an unfolded strand A and a partially folded strand D while state S2 has an unfolded strand
D. Other states had a low ? -sheet content in strand D which points to the fact that strand
D is highly flexible. Some representative structures of states S1, S2 and S3 are shown in
figure 3.13.
The unfolding of A strand is reminiscent of ?N6 mutant of ?2m.[95, 93] Strand A
plays a major role in aggregation by acting as a hook in dimer assembly. [88] Unfolding or
detachment of strand A from the core exposes hydrophobic residues Pro5, Leu7 and Val9
on this strand to the solvent (figure 3.13). It was proposed that the high aggregation po-
tential of ?N6 is due to its ability to populate one or more aggregation prone intermediate
states.[254] Estacio et al.[95] performed computational study on dimerization of interme-
diate state using Monte-Carlo ensemble docking (MC-ED) and constructed contact maps
85
Figure 3.13: Representative structures of states S1, S2 and S3
of dimer interface. For wild-type ?2m, the dimerization was majorly driven by DE-loop.
Other studies have shown the importance of DE loop for dimerization and aggregation of
?2m.[257] Hot spots residues Phe56, Trp60, Phe62, Tyr63 and Leu65 on or near the DE
loop were identified which also assist docking of H?2m to MHC-I heavy chain.[235] Aro-
matic residues Phe56, Trp60, Phe62 and Tyr63 all lie in aggregation prone sequence make
contact with MHC-I heavy chain.[259] Residues Phe62, Tyr63 and Leu65 were shown to
play major roles in fibril nucleation. [250] Trp60 in their study had the largest number of
intermolecular contacts between monomers of ?2m. Phe30 also belongs to the same hy-
drophobic cluster near the DE loop and is important for aggregation. Structural mapping us-
ing computational techniques mapping the dimerization interface suggested residues Tyr10,
His13, Phe30 and His84 to be hotspots for ?N6 amyloidosis.[95] Our Markov state model
also showed in involvement of the hydrophobic residues especially residues Phe56, Trp60,
Phe62 and Tyr63 in the misfolded state of the ?2m folding landscape with ?s transition
times for unfolding and folding. Since, the temperature of simulation has a direct impact
on the transition rates, it is expected that the folding and unfolding MFPTs which are on
the order of 10?s of ?s are underestimated. However, this study gives important insights
into misfolding pathway and metastable states for misfolding ?2m.
86
3.3 Variational embedding of protein folding simulations using Gaussian
mixture variational autoencoders
1Conformational sampling of biomolecules using molecular dynamics simulations of-
ten produces a large amount of high dimensional data that makes it difficult to interpret us-
ing conventional analysis techniques. Dimensionality reduction methods are thus required
to extract useful and relevant information. Here, we devise a machine learning method,
Gaussian mixture variational autoencoder (GMVAE), that can simultaneously perform di-
mensionality reduction and clustering of biomolecular conformations in an unsupervised
way. We show that GMVAE can learn a reduced representation of the free energy land-
scape of protein folding with highly separated clusters that correspond to the metastable
states during folding. Since GMVAE uses a mixture of Gaussians as its prior, it can directly
acknowledge the multi-basin nature of protein folding free-energy landscape. To make the
model end-to-end differentiable, we use a Gumbel-softmax distribution. We test the model
on three long-timescale protein folding trajectories and show that GMVAE embedding re-
sembles the folding funnel with folded states down the funnel and unfolded states outside
in the funnel path. Additionally, we show that the latent space of GMVAE can be used for
kinetic analysis and Markov state models built on this embedding produce folding and un-
folding timescales that are in close agreement with other rigorous dynamical embeddings
such as time independent component analysis (TICA).
1Taken from a published paper: Ghorbani, M., Prasad, S., Klauda, J. B., Brooks, B. R. (2021). Variational
embedding of protein folding simulations using Gaussian mixture variational autoencoders. The Journal of
Chemical Physics, 155(19), 194108.
87
3.3.1 Introduction
In recent years, computer simulations of biomolecular systems have gained huge at-
tention due to advances in theoretical methods, algorithms and computer hardware. This
enabled efficient exploration of processes in atomic scale using molecular dynamics (MD)
simulations.[134] In a MD simulation, one integrates the Newton?s equations of motion
where the forces between atoms in the system are described by a parameterized force field.
Exploration of the high dimensional space typically requires long timescale simulations or
the use of some enhanced sampling techniques.[339, 19] These simulations usually gen-
erate a large amount of high dimensional data making analyzing the important features
of protein folding such as free energy landscape (FEL) and identifying metastable states
a challenging task.[105] Therefore, dimensionality reduction techniques are often used to
describe the processes such as folding and conformational transitions of proteins.[169]
The ideal FEL should consist of heavily clustered datapoints, where each cluster is po-
sitioned in a local free energy minimum and corresponds to long-lived metastable states
separated by kinetic bottlenecks (i.e. free energy barriers).[124] This ideal FEL is the
cornerstone of many kinetic models that describe the dynamics of the system using for
example Markov state models (MSM).[61, 60, 59] Traditional methods to capture FEL,
rely on identifying relevant collective variables (CVs) that are well-suited to describe the
physical processes or to distinguish different states. However, finding the right collective
variables for the system of interest requires a physical/chemical intuition about the process
of interest.[85, 222] This makes it necessary to define a low-dimensional representation
of the system that can capture the essential degrees of freedom or the important CVs of
the system of interest. There are various methods for dimensionality reduction and find-
ing optimal representation of complex FEL such as PCA[1], TICA[269, 228], Isomap[6],
sketch map[42] and diffusion map[212, 213]. PCA-based methods assume an underlying
88
linear manifold which is generally not correct. Some of the nonlinear manifold methods
like Isomap assume data to be isomorphic to a hyperplane which leads to topological in-
stabilities. Moreover, these methods involve computation of distances (geodesic or other
kernel based) between all pairs of points which makes it unscalable to larger MD simula-
tion trajectories. In diffusion maps, one needs to calculate the Gaussian kernels which can
be computationally expensive and not scalable to large-scale MD simulation data.
Machine learning (ML) has recently emerged as a powerful alternative tool for learn-
ing informative representations and in particular variational auotencoders (VAEs) have
shown great potential for unsupervised representation learning [151]. An autoencoder has
two parts: encoder and decoder. The encoder network reduces the input data to a low-
dimensional latent space and the decoder maps the latent representation back to the original
data. In the VAE framework, a regularization is added to the model by forcing the latent
space to be similar to a pre-defined probability distribution (e.g Gaussian) which is called
a prior. VAEs have been recently used for CV discovery in MD simulations [47, 266, 50],
enhanced sampling [244, 26] and dimensionality reduction methods [24, 302].
In a simple VAE, the prior is a simple standard distribution, which can lead to over-
regularization of the posterior distribution and results in posterior collapse.[112] This makes
the output of the decoder almost independent of the latent embedding and can result in poor
reconstruction and highly overlapping clusters in the latent space [24]. On the other hand,
a Gaussian prior is limited since the learnt representation can only be unimodal and can-
not capture multimodal nature of data such as protein folding simulation where there exist
multiple metastable states during the folding process.[83]
In this work, we employ a Gaussian mixture variational autoencoder (GMVAE) that
directly acknowledges the multimodal nature of protein folding simulations and can con-
struct the ideal multi-basin FEL. This is achieved by modeling the latent space as a mixture
of Gaussians by using a categorical variable that identifies which mode each data point
89
comes from. Therefor, GMVAE model simultaneously performs dimensionality reduction
and clustering.[84]. The features in our model are the normalized distance map between
C? atoms of the protein. We test our model on three long-timescale protein folding sim-
ulations taken from DE Shaw group [180]. These include Trp-cage (208 ?s), BBA (325
?s) and Villin (125 ?s). We show that the model can learn the funnel-shaped landscape
of protein folding and cluster the conformational space with high accuracy that correspond
to different structural features of protein. Furthermore, we show that despite the fact that
the GMVAE embedding does not make use of any dynamical information, it is able to de-
scribe the kinetics of protein folding and the folding and unfolding timescales obtained by
making a Markov model on this embedding are in close agreement with other works using
a rigorous dynamical model to describe the kinetics.
3.3.2 Methods
Variational inference methods convert an intractable inference problem into an opti-
mization one. While the classical variational methods are limited to conjugate priors and
likelihood, VAEs allow the use of arbitrary function approximators (i.e. neural networks)
as the conditional posterior [151].
VAEs can be approached from two different perspectives: variational inference and
neural networks. In the variational inference, the main idea is to learn a distribution in
the latent space that truly captures the distribution of the dataset. In particular, given a
dataset x, the goal of variational inference is to infer the latent space representation z, i.e.
to accurately model p(z|x). The Bayes theorem gives the relation between the posterior
p(z|x), the prior p(z) and the likelihood p(x|z) as:
| p(x|z)p(z)p(z x) = (3.1)
p(x)
90
The denominator in this equation p(x) is called the evidence which requires marginal-
ization over all latent variables and thus is intractable. Therefore, in variational inference
one seeks an approximate posterior q? (z|x) with learnable parameters ? and minimize the
Kullback-Leibler divergence (KL) between the approximate and true posterior. The KL
divergence shows the difference between two probability distributions and is defined as:
q (z|x)
DKL(q? (z|x)||p(z|x)) = Eq log
?
( ) (3.2)
p(z|x)
by re-writing this equation and using Bayes rule we get:
q (z|x)
log p(x) = DKL(q? (z|x)||p(z|x))?Eq log
?
( ) (3.3)
p(x,z)
Due to Jensen?s inequality the KL divergence is a non-negative term which makes the
last term in the equation called evidence lower bound (ELBO) to act as a lower bound for
the log-likelihood of the evidence.
p(x,z)
ELBO = Eq log( ) (3.4)q? (z|x)
Therefore, we can now write equation 3 as:
log p(x) = DKL(q? (z|x)||p(z|x))+ELBO (3.5)
This has the implication that minimizing the KL divergence or maximizing the log-
likelihood of evidence can done by maximizing the ELBO.
The graphical model of GMVAE is shown in figure 3.12. In the generative part (de-
coder) of the network, a sample z is drawn from the latent space distribution p? (z|y) of
cluster y which is parameterized by parameters ? using the decoder part of the neural net-
work. This can be used to generate the conditional distribution p? (x|z) parameterized by
91
another neural network ? . The generative process for GMVAE can be written as
p? ,? (x,z,y) = p? (x|z)p? (z|y)p(y) (3.6)
p? (z|y) = N(z|?? (y),?2? (y)) (3.7)
p? (z|x) = N(x|?? (z),?2? (z)) (3.8)
p(y) =Cat(?) (3.9)
In these equations, ? = 1/K is the uniform categorical distribution where K is the
number of clusters, and Cat(?) refers to categorical distribution for discrete variable y. N()
refers to normal distribution where ?? , ?? , ?2? , ?
2
? are the means and variances learned by
the neural nets parameterized by ? and ? . Variational inference of GMVAE can be done
by maximizing the ELBO which can be written as:
p
ELBO E log ? ,?
(x,z,y)
= q (3.10)q? ,?(z,y|x)
The approximate posterior of the inference model q? ,?(z,y|x) can be factorized into two
distributions:
q? ,?(z,y|x) = q? (y|x)q?(z|x,y) (3.11)
q? (y|x) gives the cluster assignment probabilities and thus ?Kk=1 q? (y|x) = 1. q?(z|x,y) is
a Gaussian mixture where the parameters of the each Gaussian (? 2? ,?? ) are learned by the
encoder part of neural network. In this model, categorical variable y represents a discrete
node for each categorical distribution, which cannot be backpropagated and thus is substi-
92
Table 3.2: Chosen hyperparameters for each protein
number number number number
latent kernel learning pooling
systems of of ofdimension batch-size temperature
of
size rate sizes
layers neurons clusters filters
Trp-cage 2 64 5 8 5000 0.1 [3,3] 0.001 [64,64] [1,1]
BBA 2 64 6 9 5000 0.1 [3,3] 0.001 [64,64] [2,2]
villin 3 64 5 6 2500 0.05 [3,3,3] 0.001 [64,64,32] [2,2,1]
tuted with a Gumbel-softmax distribution which approximates this categorical distribution
with a continuous one. This can be written as:
log(?i)+gi
e ?
yi = log ? g for i=1...K (3.12)( j)+K j? j ?=1 e
where ? is called the temperature parameter controls the smoothness of distribution
where at small temperatures samples are close to one-hot encoded and at large temper-
atures, the distribution is more smooth. gi are the samples drawn from a Gumbel(0,1)
distribution.
Using the generative and inference model the ELBO can be written as:
p? (x|z)p? (z|y)p(y)ELBO = Eq log (3.13)q? (y|x)q?(z|x,y)
ELBO = Eq[log p(y)? logq? (y|x)+
p (z|y) (3.14)
log ? + log p? (x|z)]q?(z|x,y)
The second term in the loss is called the cross-entropy and the last term is the mean
squared error between the true and the reconstructed data.
3.3.3 Model parameters
The model architecture is shown in figure 3.14B. The GMVAE model was implemented
in tensorflow. Convolutional layers were applied along with pooling for their ability to rec-
93
Figure 3.14: A) graphical model for inference and generative parts of GMVAE. Grey circles rep-
resents the observed data B) Schematic of GMVAE architecture. In this architecture, q(y|x) refers
to cluster assignment probabilities, q(z|x,y) is the approximate posterior, ? an ? are the mean and
variance of each Gaussian in the approximate posterior of encoder network. p(z|y) is the prior
Gaussian and ?p and ?p are the mean and Gaussians of the prior Gaussians in the decoder network
Figure 3.15: Native folded structure of studied proteins. A) Trp-cage, B) BBA C) Villin headpiece
ognize features in images. Exponential linear unit (Elu) activation function was used in
each layer and a softmax activation was used for the cluster assignment probability. The
means and variances of distributions were obtained using no activation, and softplus acti-
vation respectively. Adam was used as optimizer in all models.[150] We have optimized
the hyperparameters of the model based on the reconstruction loss. The chosen hyperpa-
rameters for each protein are shown in table 3.2. During training, we split the data into a
train/validation set with a fraction of 0.8 for training and 0.2 for validation set. The latent
space dimension was chosen using a grid search for minimizing the reconstruction loss of
94
the validation set for each protein.
Number of clusters is another hyperparameter that must be specified for training the
model. Varolgu?nes? et al.[302] used a thresholding scheme to pick the clusters that have
class probabilities more than a pre-defined cutoff. In this paper we adapted a similar proce-
dure. To select this hyperparameter, we first started with a random number of clusters (e.g.
10) and computed the membership probability of each point in the input. Then we used a
cutoff value (0.95) to count the number of clusters with membership probabilities higher
than the cutoff. We then trained the model with the recovered number of clusters from the
previous training. We found that this number is highly robust to the other hyperparameters
of the model. We also found that after the first round of training, the number of recovered
clusters does not change using the same probability assignment cutoff. Each model was
trained for 100 epochs of training. The temperature parameter in gumbel softmax controls
the smoothness of distribution. We also tried annealing the temperature parameter starting
with a high value (5) and lowering it to 0.1 during the first 40 epochs of training and then
keeping it the same for the rest of training. However, we found that the model would diverge
after a few epochs of training and having a fixed and small value of temperature parameter
gives the best results. Since the GMVAE model gives a probabilistic cluster assignment that
is the probability of each datapoint belonging to each cluster (fuzzy-clustering), we used a
k-nearest neighbors method to compute a hard-cluster assignment using the neighborhood
of each point in the embedding. For the kinetic analysis, we used pyemma package [264]
to build the transition matrix. In each case, the embedding was discretized using 500 K-
means cluster points and the transition probability matrix was built by counting the number
of transitions between different states at lag time ? . The implied time scales are computed
from the eigenvalues of the transition probability matrix:
ti(?) =?
?
(3.15)
ln |?i(?)|
95
Figure 3.16: Training and validation loss for Trp-Cage example
To test the Markovianity of the transition matrix the implied timescales are plotted against
the lag-time and then the smallest ? is chosen such that the implied timescales have con-
verged. A coarse-grained transition matrix is later built by assigning the K-means points to
the closest GMVAE clusters yielding a coarse-grained view of dynamics. The folding and
unfolding timescales are obtained from this coarse-grained matrix.
3.3.4 Results
Here we tested the performance of GMVAE model for dimensionality reduction and
clustering of three protein folding systems including Trp-Cage (pdb: 2JOF)[11], BBA (pdb:
1FME)[260] and Villin (pdb: 2F4K)[157]. The native folded structure of these proteins are
shown in Figure 3.15. We show that the GMVAE embedding captures the free energy land-
scape of these proteins with well-separated clusters. We analyze the structural properties
of each cluster and show that each cluster corresponds to a different structural feature in
the protein. The total loss, cross-entropy loss and reconstruction loss shows a decreasing
behavior for both train and validation set in all three proteins. The loss for TrpCage as an
example is shown in figure 3.16.
For visualizing the latent space of GMVAE, we used a low-dimensional latent space (2
96
Figure 3.17: Reconstruction loss vs latent space dimension for A) Trp-cage, B) BBA C) Villin
headpiece
or 3) and show that this embedding mimics the funnel-shaped landscape of protein fold-
ing where the folded state resides down the funnel and the unfolded states are outside the
funnel. For the rest of our analysis on each protein, we used an optimized number for
latent-space dimension based on a cross-validated reconstruction loss. Figure 3.17 shows
a cross-validated reconstruction loss as a function of latent space dimension for each pro-
tein. Higher dimensional embeddings result in better reconstruction loss for all proteins.
This means to capture the complex protein folding landscape we need a high dimensional
latent space in our GMVAE model. To test whether the GMVAE clusters give meaningful
structural information, we sampled 5000 datapoints from the center of each cluster and
compared the distribution of RMSDs of the whole protein and specific domains of each
cluster to the folded state. Moreover, we show that building a Markov model on the em-
bedding of GMVAE produces folding and unfolding timescales that are in close agreement
with the timescales obtained from constructing a Markov model on a dynamical embedding
such as TICA.
3.3.5 Trp-cage
As the first example, we test our GMVAE model on an ultra-long 208 ?s explicit solvent
simulation of the K8A mutation of the 20-residue Trp-cage TC10b at 290K by D.E. Shaw
Research.[180] Numerous experimental and computational studies have been performed
97
Figure 3.18: Results of GMVAE for Trp-cage. A) 3D embedding (zdim=3) colored with RMSD
with respect to the folded state. B) first two dimensions of latent space (zdim=3) colored with
RMSD C) Free energy landscape of the first two dimensions of embedding (zdim=3) D) TSNE
visualization of 5D latent space colored based on the argmax of their cluster assignment probabilities
(only points with more than 0.75 membership probability are shown) E) RMSD distribution of trp-
cage in different clusters F) implied timescale (ITS) plot for MSM construction
on Trp-cage.[201, 92, 274] The folded state of Trp-cage shown in figure 3.18A contains an
?-helix (residues 2-8), a 310 helix and a polyproline II helix, and the tryptophan residue is
caged at the center of the protein. Two different folding mechanisms has been identified
for Trp-cage to date [78]: one where Trp-cage goes through a hydrophobic collapse into a
molten globule followed by formation of N-terminal helix and the native core (nucleation-
condensation) and second the pre-formation of the helix from the extend conformation and
the joint formation of 310-helix and hydrophobic core (diffusion-collision). The second
mechanism is identified as the dominant folding pathway for Trp-cage.
Here we investigated Trp-cage folding trajectories using GMVAE model for embed-
ding and clustering. The features are the normalized distances between the C? atoms of
Trp-cage in the trajectories. The hyperparameter K which identifies the number of clusters
is unknown a priori. To choose a reasonable number for each protein, we started from a
98
higher estimate for number of clusters (e.g 10) and trained the model. Then we used a cut-
off (0.95) to find the number of clusters with membership probability more than the cutoff
value. We found only 8 out of 10 clusters had higher than 0.95 membership probability.
Next we trained the model again with 8 clusters. At this stage we found that all clusters
had membership probabilities higher than our original cutoff. Moreover, we found the
same number of clusters regardless of the other hyperparameters for the mode such as the
number of layers. Although the 2D or 3D latent spaces are used for visualization purposes,
higher latent space embeddings are needed to describe the folding energy landscape more
accurately. To choose an optimum latent-space dimension, we computed a cross-validated
reconstruction loss for different values of latent space dimension from 2 to 10. The re-
sults for Trp-cage are shown in figure 3.17A. We chose a 5-dimensional latent space for
clustering this protein. Other hyperparameters such as batch-size, learning-rate, number
of layers, temperature of gumbel-softmax, kernel-size, number of filters and pooling sizes
were optimized using a grid search method based on reconstruction loss. The chosen hyper-
parameters for each protein are listed in table 1. The total, reconstruction and cross-entropy
loss using the determined hyperparameters in table 1 are shown in figure 3.16. Reconstruc-
tion and cross-entropy loss for both training and validation data show a decreasing behavior
demonstrating the convergence of the model after 100 epochs of training.
Figure 3.18A shows the 3-dimensional embedding (z-dim=3) of Trp-cage trajectories
colored based on the RMSD with respect to the crystal structure. The gradual change
of color from high RMSD (red) to low RMSD (blue) in the landscape demonstrates that
the low-dimensional embedding can capture the protein folding process. Figure 3.18B
shows the first two dimensions of the latent embedding colored based on RMSD. The high
RMSD and low RMSD regions are well separated on this landscape. The folded state has
a narrow distribution and is the narrow wedge of the folding funnel. We computed the free
energy landscape on the first two dimensions of the latent space (figure 3.18C). The free
99
Figure 3.19: Trp-cage folding transitions, the thickness of lines corresponds to the transition proba-
bility between the two states. Transitions with probabilities less than 0.05 are not shown for clarity
energy landscape shows multiple wells that are separated by diffuse regions in between
them. The wells correspond to the centers of GMVAE clusters and the diffuse region
is the transition region between different conformational states. Hard-cluster assignment
in the 3D latent space is shown in figure 3.20A. Next, based on figure 3.17A, we used
a 5-dimensional latent space for clustering Trp-cage. To visualize the 5D latent space,
we only take datapoints with membership assignment probabilities higher than 0.75 and
used t-distributed stochastic neighborhood embedding (t-SNE)[300] for transforming the
5-dimensional embedding into 2 dimensions. The t-SNE results for Trp-cage are shown in
figure 3.18D. The clusters are highly separated on this landscape. To ensure that GMVAE
clusters corresponds to different structures during folding, we sampled 5000 points from the
center of each cluster and computed the RMSD distribution of the protein with respect to
the folded state (figure 3.18E). The folded state (cluster 5) has a narrow distribution while
other unfolded and misfolded states have wider distributions with higher RMSD values.
Representative structures of each cluster are shown in figure 3.19. We have also computed
the RMSD distribution of residues 11-15 comprising the 310-helix for different states. The
100
results are shown in figure 3.20B.
Figure 3.20: A) clusters of Trp-cage with 8 clusters and 3D embedding zdim = 3 B) RMSD of
residues 11-15 comprising the 310-helix for selected clusters for Trp-cage
Next, we built a MSM on the 5D embedding by choosing 300 KMeans points and dis-
cretizing the trajectories based on this clustering on the GMVAE embedding. The implied
timescales for this transition matrix is shown in figure 3.18F. Based on this, we chose a
lag time of 160 ns to build the MSM. To compute the mean-first-passage-time (MFPT) be-
tween different GMVAE clusters, we coarse grained the transition 300-state matrix into 8
states that corresponded to the GMVAE clusters. The folding and unfolding times based on
the coarse-grained Markov model are 11.62 and 4.85 ?s respectively. The folding and un-
folding times are in good agreement with the values reported by Lindorff-Larsen et al.[180]
who reported 14.4 and 3.1 ?s as the folding and unfolding times of this protein using the
average lifetime in the folded and unfolded states observed in trajectories using a contact
based definition of folded and unfolded states. A visualization of the 8 metastable states
found by GMVAE model is shown in figure 3.19. The arrows between different states
show the transition between different conformations and the arrow thickness relates to the
transition probability between different clusters obtained by coarse gaining the Markov
model into 8 GMVAE clusters. The native folded state S5 accounts for about 18% of the
total distribution and the unfolded ensemble represents the remaining 82%. Folding mostly
101
Figure 3.21: A) 2D embedding of BBA colored based on RMSD to the folded state B) 2D free
energy landscape of BBA based on 2D embedding C) clusters in 2D embedding of BBA using kNN
for cluster assignment D) TSNE visualization of 6D latent space colored based on the argmax of
their cluster assignment probabilities (only points with more than 0.75 membership probability are
shown) E) histograms of RMSD for different clusters F) ITS plot based on 6D latent space.
proceeds via the molten globule state S0 or the near-folded state S4.
3.3.6 BBA
The second example is ??? fold protein (BBA) which is a 28-residue fast folding pro-
tein. The NMR structure of this protein is shown in figure 3.15B. This protein contains
an antiparallel ? sheet at the N terminal and a helical conformation at its C terminus. For
finding the optimum number of clusters, we first trained the model with 10 clusters and
only 9 clusters were recovered using a 0.95 cutoff. Next, we trained the model with 9
clusters and found all the clusters have probabilities higher than our cutoff. We also ob-
served that training the model with different hyperparameters would yield the same number
of clusters. To better visualize the latent space we trained the model with 2 dimensions.
The resulting latent space colored based on RMSD with respect to folded state is shown
in figure 3.21A. Unfolded and folded states are well separated on this 2D embedding. The
102
Figure 3.22: BBA transitions, the arrows show the transition between different clusters and the
arrow thickness represents the transition probability between the corresponding clusters. Transition
with probabilities less than 0.1 are not shown for clarity.
free energy landscape on this embedding is shown in figure 3.21B. It is observed that all
clusters reside in the wells of the free energy landscape. There are also some diffuse and
high energy states between the wells which correspond to transitions between different
metastable states. These regions are also where the model is least certain about cluster
assignment. To transform the fuzzy clustered output of GMVAE into hard-cluster assign-
ment we used a K-nearest neighbors algorithm and assigned each points to the most likely
cluster in its neighborhood using 500 neighbors. The result is shown in figure 3.21C which
exhibits highly separated and non-overlapping clusters in the 2D embedding. In this em-
bedding, state 8 corresponds to the folded state and state 6 is the near-folded (misfolded)
state and all the other states are the unstructured or unfolded conformations. The highly
non-overlapping clusters in the GMVAE landscape showcases the ability of this model to
separate a vastly diverse set of protein conformations from a protein folding trajectory.
The 2d embedding latent space cannot fully capture the complex folding landscape.
Therefore, we optimized the latent space dimension based on a cross-validated reconstruc-
103
tion loss in figure 3.17B. Next, based on this result, we used a 6-dimensional latent space
for the rest of our analysis. The T-SNE visualization of this 6-dimensional landscape is
shown in figure 3.21D. We have studied the structural properties of each cluster by sam-
pling 5000 datapoints from the center of each cluster. Figure 3.21E shows the distribution
of RMSD of each cluster with respect to the folded state. Cluster 0 is the folded state with
the sharpest and lowest RMSD distribution. Other clusters have wider and higher RMSD
distributions and correspond to misfolded or unfolded states. Representative structures for
each cluster are shown in figure 3.22. We also investigated the details of structural features
for each cluster by calculating the RMSD distribution of specific domains in BBA. Figure
3.23 shows the distribution of RMSD of antiparallel ? -sheet (residues 7 to 14) (left panel)
and the ?-helical (right figure) parts of BBA (residues 16 to 26) with respect of the folded
structure. The folded state (cluster 0) has the lowest RMSD in both domains while cluster
4 has a low RMSD in the antiparallel ? -sheet domain but a higher RMSD in the ?-helical
domain.
Figure 3.23: A) RMSD of antiparallel ? -sheet residues (7-14) with respect to the folded state of
BBA B) RMSD of ?-helical domain of BBA (residues 16 to 26) with respect to the folded state for
different clusters
To perform a Markov model on this embedding, we first clustered this embedding using
500 KMeans and discretized the trajectories based on the points. To choose the proper lag-
104
time for the MSM model, we plotted the implied timescales (figure 3.19F) and picked 220
ns and built the transition probability matrix. Next to compute the transition timescale be-
tween different GMVAE clusters, we assigned each of the 500 KMeans clusters to the clos-
est cluster in GMVAE and then computed the mean first passage times (MFPTs) between
clusters. The folding and unfolding timescales calculated here are 15.2 and 7.42 ?s re-
spectively, which are in close agreement with the values reported by DE Shaw group.[180]
Figure 3.22 illustrates the representative structures of each cluster which are sampled from
the mean of each distribution in the latent space. The transition between different states is
shown with the arrow where the width of each arrow represents the transition probability.
3.3.7 Villin
The last example is a 35-residue villin-headpiece subdomain, which is one of the small-
est proteins that can fold autonomously. It is composed of three ??helices denoted as
helix 1 (residues 4-8), helix 2 (residues 15-18) and helix 3 (residues 23-32) and a com-
pact hydrophobic core. The observed experimental folding timescale for wild-type villin is
about 4 ?s and the replacement of two Lysine residues (Lys65 and Lys70) with uncharged
Norleucine (Nle) yield a mutant with folding time of less than one microsecond. [158]
Folding landscape of villin double mutant has been studied both by experiments and com-
puter simulations.[280, 167, 62, 16] Folding a double mutant of Villin was studied using
long-timescale molecular dynamics by D.E. Shaw group and is used here.[180]
The number of clusters for villin was found as described for other proteins. We started
with 7 clusters and found only 6 clusters were recovered using a 0.95 cutoff for cluster
probability. The latent embedding using a 3D latent space is shown in figure 3.24A where
each point is painted based on RMSD with respect to the folded structure. The first two
dimensions of this 3D embedding colored based on RMSD is shown in figure 3.24B. Figure
105
Figure 3.24: GMVAE embedding results for villin. A) 3D latent space (zdim=3) colored with
RMSD B) first two dimensions of 3D latent space colored based on RMSD C) FEL based on first
2 dimensions of latent space D) TSNE plot for 5D latent space. (only points with more than 0.75
membership probability are shown) E) distribution of RMSD for villin with respect to the folded
state F) ITS plot for markov model construction based on 5D embedding
Figure 3.25: Transitions between different states in villin headpiece simulation. The thickness of
arrows corresponds to the transition probability between the two states. transitions with less than
0.1 probability are now shown for clarity.
106
3.24C shows the free energy landscape on the first two dimensions of the embedding. Due
to fast transitions between different states in villin, unlike BBA the FEL has larger diffuse
regions with smaller basins at the center of each cluster. Presence of large diffuse regions
on this landscape means that the metastable states in the folding of villin are short lived
and transition between each other quickly. The optimum latent space dimension for villin
was found to be 5 (figure 3.17C). Other hyperparameters for villin were optimized based
on a cross-validated reconstruction loss and the chosen hyperparameters are shown in table
1. The T-SNE visualization of this 5D latent space is shown in figure 3.24D which shows
highly separated clusters. Figure 3.24E shows the RMSD distribution of each cluster in
5D latent space with respect to the folded structure. Cluster 3 corresponds to the folded
state where the RMSD distribution is the narrowest and smallest. Figure 3.25 shows the
representative structure of each cluster in 5D latent space. Structural properties of specific
domains in different clusters were studied using the RMSD distribution of helices 1, 2 and
3 with respect to the folded structure. The results are shown in figure 3.26. each cluster has
a different distribution for the helical residues of the protein which are Gaussian. Cluster
S0 has a low RMSD for helix 1 and 2 but higher RMSD values for helix 3. Secondary
structure calculations showed that S0 has a folded helix 1 and helix 2 but unfolded helix 3.
Most clusters have a folded or near-folded helix-1, except for cluster S4. Cluster S3 is the
folded state where are helices are folded with more than 80% probability. Helix 3 is only
folded in S3 and S5 which shows the importance of this helix in proper folding of Villin.
Next, we built a Markov model on this embedding by choosing 500 K-means cluster
points for discretizing the trajectories. The implied timescales for this discretization are
shown in figure 3.24D. A lag-time of 220 ns was chosen to build the transition matrix.
The 500 K-means clusters were then assigned to their nearest GMVAE clusters to build a
coarse-grained transition matrix. The folding and unfolding times obtained based on the
constructed MSM on this embedding are 2.25 ?s and 1.54 ?s , respectively, which are in
107
Figure 3.26: RMSD distribution of helices 1,2 and 3 in different clusters with respect to the folded
state. From left to right the figures in the panel correspond to helices 1,2 and 3 in villin headpiece.
good agreement with the values reported DE Shaw group (2.8 ?s) and others building a
Markov model using TICA.[180, 286, 223] Figure 3.25 shows the structures of each clus-
ter and the transition probability between different states. The highest transition probability
S3? S0 corresponds mostly to unfolding of helix 3. Therefore, proper folding of helix 3
leads to formation of native contacts and native helices. Piano et al.[231] studied the dou-
ble mutant (Nle/Nle) of Villin and found a sparsely populated intermediate that involved
formation of helix 3 and the turn between helices 2 and 3. This corresponds to cluster S2
in our analysis that has near-folded helix-3. Mori and coworkers[207] studied the molec-
ular mechanics for folding of Villin and the Nle/Nle double mutant. They found that the
mutation Lys? Nle speed up the folding transition by rigidifying helix-3.
3.3.8 Discussion and Conclusion
Here we demonstrated the use of a deep learning algorithm, Gaussian mixture varia-
tional autoencoder (GMVAE), to help analyze and interpret the highly complex landscape
of protein folding trajectories. Variational autoencoder framework has been extensively
used in the field of Molecular dynamics simulations for dimensionality reduction[24, 302],
enhanced sampling [244, 26] and collective variable discovery[47, 266, 50]. Noe and
coworkers proposed a time-lagged autoencoder (TAE) that can find the low-dimensional
108
embedding for high dimensional data while capturing the slow dynamics of the underlying
processes [325]. Although Ferguson et al.[48] showed that TAE is limited in finding the
optimal embedding for the dynamical system and in general it finds a mixture of slow and
maximum variance modes. Ward et al. introduced DiffNets, which are deep autoencoders
that identify structural features for predicting biochemical differences between protein vari-
ants from MD simulation trajectories. [319]
The GMVAE model acknowledges the multi-basin nature of protein folding by enforc-
ing a mixture of multiple Gaussian as the prior model for the variational autoencoder. We
applied our model to three long timescale protein folding trajectories, namely Trp-cage,
BBA and Villin headpiece, all of which have been extensively characterized in previous
studies.[180] In all cases, we showed that the model is able to characterize different fea-
tures of the structure that could correspond to folded, misfolded or unfolded states. The
low dimensional embedding obtained by GMVAE for these proteins resembles the folding
funnel where the folded states lay down the funnel and unfolded ensemble are outside the
funnel. This can be intuitively described from the conformational entropy point of view.
The unfolded state has larger variations in the structure which causes the variance of Gaus-
sian learned by GMVAE to be larger than the folded cluster having a narrower distribution.
This along with the continuity of the latent space makes landscape funnel-shaped. To verify
that the clusters obtained by GMVAE correspond to different structural features of proteins
during folding, we computed the global and local RMSD of of each cluster respect to the
folded structure. As expected the distribution of RMSD for different clusters follows a
Gaussian where the folded state has the lowest and narrowest RMSD and the unfolded
(extended) structure has the highest and widest RMSD distribution.
We used normalized distance maps as the features in our machine learning model which
are practical ways to represent the simulation dataset of proteins. Other features such as
contact maps can also be used as the input to the model which would give a lower resolu-
109
tion embedding due to the amount of information in the contact maps relative to distance
maps. Specifically in our model, we used convolutional operations which are known for
their great ability to recognize and process image dataset. It is worth noting that, our
GMVAE model is different from a simple Gaussian mixture model (GMM). In a GMM,
the parameters of the model are optimized iteratively through expectation-maximization
algorithm.[77] GMM has been used to cluster the FEL of proteins. Delemotte et al. used
GMM to construct and cluster the FEL of binding Ca2+ to Calmodulin and found novel
pathway involving salt bridge breakage and formation.[326] However, GMM requires the
use of a few handcrafted features and a high number of collective variables can lead to
over-fitting the model. On the other hand, since the GMVAE model is trained by gradient
descent and is a deep learning architecture, it does not suffer from the same shortcomings
of GMM. Unlike the GMVAE model proposed by Varolgu?nes? et al.[302] that learns the
clusters cluster assignment through a stochastic layer, we replace this with a deterministic
layer using Gumbel-softmax distribution which makes the model end-to-end differentiable
and leads to better performance.[98, 142] The temperature parameter in Gumbel-softmax
was tuned along with other model hyperparameters during training. The best hyperpa-
rameters for each protein was chosen based on a cross-validated reconstruction loss. The
number of clusters is a hyperparameter in the GMVAE. However, we showed that to find
an optimum number of clusters, we first start with a higher estimate of number of clus-
ters in each protein. Then, using a cutoff for cluster probability assignment, we find the
number of clusters with membership probability higher than a defined cutoff. Next, we
train the model with the recovered number of clusters from the previous step. We showed
that at this stage all clusters have membership probabilities higher than the chosen cutoff
(0.95). This also means the model has converged to the optimum number of clusters in the
system. Notably, the number of recovered clusters was found to be the same regardless of
other hyperparameters in the model. However, the number of clusters can be dependent
110
on the chosen cutoff. On the other hand, this can be viewed as a hierarchical clustering
where based on the clustering resolution which correlates with the cutoff value in our pro-
cess, different structures are embedded in the same cluster. The latent space dimension is
another important hyperparameter that needs to be optimized. To find the optimum latent
space dimension for each protein, we calculated a cross-validated reconstruction loss for
different values of latent-space dimension for each protein. The reconstruction loss reduces
as the latent space dimension increases and it reaches a plateau. For each protein, we pick
the latent space dimension where the reconstructions loss reaches this plateau.
Beyond the static characterization of the protein folding trajectories, we tested whether
the model is able to characterize the kinetics of protein folding. We built a high resolution
Markov model on the embedding obtained by GMVAE and computed the MFPTs between
different states. Interestingly the folding timescales obtained by the model are in good
agreement with the folding times reported by other groups constructing a MSM on a TICA
landscape which characterizes the dynamics of folding. We should note that our model
does not utilize any lag-time for construction of the low-dim embedding however, it is able
to describe the folding timescales with reasonable accuracy. However, for some of the
most dynamic proteins such as villin with fast folding timescales, only the first two implied
timescales converge after 220 ns and the other implied timescales are below the maximum
likelihood threshold which makes the model unable to give meaningful information about
these faster processes. This might be remedied by adding dynamical information to the
model by using a lag time in the training process. Further improvements to the model
could include graph embedding of protein structures instead of using a distance map. This
will be studied in our future work.
111
3.4 GraphVAMPNet, using graph neural networks and variational approach
to markov processes for dynamical modeling of biomolecules
1 Finding low dimensional representation of data from long-timescale trajectories of
biomolecular processes such as protein-folding or ligand-receptor binding is of fundamen-
tal importance and kinetic models such as Markov modeling have proven useful in describ-
ing the kinetics of these systems. Recently, an unsupervised machine learning technique
called VAMPNet was introduced to learn the low dimensional representation and linear
dynamical model in an end-to-end manner. VAMPNet is based on variational approach to
Markov processes (VAMP) and relies on neural networks to learn the coarse-grained dy-
namics. In this contribution, we combine VAMPNet and graph neural networks to gener-
ate an end-to-end framework to efficiently learn high-level dynamics and metastable states
from the long-timescale molecular dynamics trajectories. This method bears the advantages
of graph representation learning and uses graph message passing operations to generate an
embedding for each datapoint which is used in the VAMPNet to generate a coarse-grained
dynamical model. This type of molecular representation results in a higher resolution and
more interpretable Markov model than the standard VAMPNet enabling a more detailed ki-
netic study of the biomolecular processes. Our GraphVAMPNet approach is also enhanced
with an attention mechanism to find the important residues for classification into different
metastable states.
1Taken from a published paper: Ghorbani, M., Prasad, S., Klauda, J. B., Brooks, B. R. (2022). Graph-
VAMPNet, using graph neural networks and variational approach to Markov processes for dynamical model-
ing of biomolecules. The Journal of Chemical Physics, 156(18), 184103.
112
3.4.1 Introduction
Recent advances in computer hardware and software has recently enabled the gener-
ation of extensive and high throughput molecular dynamics (MD) trajectories.[180, 272]
These facilitates the thermodynamic and kinetic study of biomolecular processes such as
protein folding, protein-ligand binding and conformational dynamics to name a few. These
simulations often produce large amount of high dimensional data which require special
rigorous techniques for analyzing the thermodynamics and kinetics of the molecular pro-
cesses.
In recent years, Markov state modeling approach [289, 139, 200, 237] has been greatly
developed and utilized for understanding long-timescale behavior of dynamical systems
and state-of-the art software packages such as Pyemma[263] and MSMBuilder [121] were
introduced. Markov state models provide a master equation that describes the dynamic
evolution of the system using a simple transition matrix.[263] Markovianity in these system
means the kinetics are modeled by memoryless jumps between states in the state space.
Combined with the advances in the MD simulations, the framework for MSM construction
has been greatly advanced to a robust set of methods to analyze a dynamical system. In an
MSM the molecular conformation space is discretized into coarse-grained states, where the
inter-conversion between microstates within a macrostate are fast compared to transitions
between different macrostates.[274] Markov state models have previously been used to
investigate kinetics and thermodynamic properties of biophysical systems such as protein
folding [216, 29, 305, 17], protein-ligand [125, 236] binding and protein conformational
changes.[154, 8, 253]
There are several steps in the pipeline of Markov model construction: The first step in-
volves featurization where relevant MD coordinates such as distances, contact maps or tor-
sion angles are chosen.[199] This is followed by a dimension reduction step that maintains
113
the slow collective variables using methods such as time independent component analy-
sis (TICA)[228, 269] or dynamic mode decomposition (DMD) [202, 265, 296] or other
variants of these techniques. The resulting low-dimensional space is then discretized into
discrete states.[60, 330, 324] This is usually done using Kmeans clustering.[263] A transi-
tion matrix is then built on the discretized trajectories which describes the time evolution of
processes using a lagtime. [237, 29] This transition matrix can be further processed through
its Eigendecomposition to find the equilibrium and kinetic properties of the system.[237]
Finally, fuzzy clustering methods such as PCCA are often used to produce a more inter-
pretable coarse-grained model.[245]
As noted above, there are multiple steps where hyperparameters must be carefully cho-
sen to construct the Markov model. The quality of the constructed MSM is highly depen-
dent on these steps and this has brought many research opportunities into optimizing the
pipeline of MSM using various techniques.[237, 215, 269, 138, 200, 137, 262] Moreover,
complex dynamical systems require the optimal choice of model parameters which requires
physical and chemical intuition about the model. Therefore, suboptimal choices of model
parameters can lead to poor results in learning the dynamics from trajectory. Recently a
variational approach for conformational dynamics (VAC) has been proposed which helps
in selection of optimal Markov models by defining a score that measures the optimality of
the given Markov model for governing the true kinetics.[332, 215, 262, 218, 262, 200] VAC
states that given a set of n orthogonal functions of the state space, their time autocorrelations
at lag-time ? are the lower bounds to the true eigenvalues of the Markov operator. [218]
This is equivalent to underestimating the relaxation time scales and overestimating the re-
laxation rates.[332, 215, 218] Before VAC, the tools to diagnose the performance of MSMs
were mainly visual such as implied timescale plot (ITS)[289] and Chapman-Kolomogorov
test[139] (CK-test). Variational approach enabled the objective comparison of different
model choices for the same lag time. [215] The VAC has recently been generalized to vari-
114
ational approach for Markov processes (VAMP).[332, 331] VAMP was proposed for the
general case of irreversible and non-stationary time series data and is based on a singular
value decomposition of the Koopman operator.[331]
Using this variational principle, Mardt and coworkers introduced VAMPNets to replace
the whole pipeline of Markov model construction with a deep learning framework.[194]
A VAMPNet maps the configuration x to a low-dimensional fuzzy state space where each
timepoint has a membership probability to each of the metastable states. VAMPNets are
then optimized by a variational score (VAMP-2). This framework was further developed
by directly learning the eigenfuctions of the spectral decomposition of the transfer opera-
tor that propagates equilibrium probability distribution in time with a state free reversible
VAMPNet (SRV).[49] Physical constraints of reversibility and stochasticity were further
added to the VAMPNet model to have a valid transition matrix enabling the computations of
transition rates and other equilibrium properties from out-of-equilibrium simulations.[193,
192] However, a proper representation of the molecules is not discussed in these models
and traditional distance matrices or contact maps are often used.
Conformational heterogeneity of proteins during folding can complicate the selection
of features for building a dynamical model. This is even more true in the case of disordered
proteins.[188] Gaining interpretable few state kinetic model of protein folding using MD
trajectories is still highly desirable and can be achieved by MSM approaches using care-
fully chosen features. For this, one would choose a set of features for the system such as
distances, dihedral angels or the root-mean-square deviation to some reference structure.
Using more complicated feature functions such as convolutional layers on the distance
matrices was proposed to enhance the kinetic resolution of the model.[188] Graph neural
network have previously been used in molecular feature representation as a promising tool
in a variety of applications to predict the properties of the system of interest or energies
and forces.[268, 136] Battaglia first introduced graph neural networks with convolutional
115
operations and graph message passing.[13, 152] Currently there are various types of graph
neural networks that differ in how the message passing operations are done between nodes
and edges and how the output of the network is generated. Traditionally, distance maps
or contact maps were used to represent the structure of the molecules. A more natural
way of representing proteins is by using graphs where nodes represents atoms and the
edges representing the bonds (real or unreal) connecting them. This representation is ro-
tationally invariant by construction. Recent advances of protein structure prediction has
greatly exploited the advances in geometric deep learning [36] and graph neural networks
[152, 37]. Combining the VAMPNet framework with graph neural network improves the
kinetic resolution of the resulting low dimensional model where a smaller lagtime can be
chosen to build the transition matrix. In the original VAMPNet the dynamics is directly
coarse-grained into a few coarse-grained states without learning a low-dimensional latent
space representation. However, using graph neural networks, we show that the learned
embeddings for the graphs can represent useful information about the dynamic system.
Furthermore, using a graph attention network [304] gives us useful insights about the im-
portance of different nodes and edges for different metastable states. An illustration of our
GraphVAMPNet model is shown in figure 3.27
3.4.2 Methods
A Markov model estimates the dynamics by a transition density which is the probability
density of transition to state y at time t + ? given that the system was at state x at time t:
p?(x,y) = P(xt+? = y|xt = x) (3.16)
Where x and y are two different states of the system and ? is the lagtime of the model
from which the transition probability density P is built. Using this definition of transition
116
Figure 3.27: Overview of the architecture of GraphVAMPNet method. Given a molecular structure
at time t and a lagtime later t + ? , molecular graphs are built using the nearest neighbor of the
chosen atoms. Several graph convolution operations are performed resulting in representation for
each node. A hierarchical pooling is done to find a latent representation of the full graph which
is concatenated between time t and t + ? . The full network is then optimized by maximizing a
VAMP-2 score.
density the time evolution of the ensemble of states in the system can be written as:
?
pt+?(y) = (P? pt)(y) = p?(x,y)p(x)dx (3.17)
In this equation, P? acts as a propagator which propagates the dynamics of the system in
time. However, this definition of propagator assumes a reversible and stationary dynamical
systems.[237] For the general case of non-reversible and non-stationary dynamics, Koop-
man operator is used.[332] Koopman theory enables feature transformations into a feature
space where the dynamics evolve on average linearly. Koopman operator acts like a transi-
tion matrix for non-linear dynamics and describes the conditional future expectation values
for a fixed lag time ? . In the Koopman theory, the Markov dynamics at a lagtime ? is
117
approximated by a linear model of the following form:
E[X1(xt+?)]? KTE[X0(xt)] (3.18)
In equation above X0(x) = (X T T00(x), ...X0m(x) and X1(x) = (X11(x), ...,X1m(x) are fea-
ture transformations to a space where dynamics evolve on average linearly. This approxi-
mations is exact in the infinite-dimensional feature transformations however it was shown
that given a large enough lagtime ? the low dimensional feature transformations can be-
come optimal.[332] Equation 3.18 can be interpreted as a finite rank approximation of the
so-called Koopman operator.[203] The optimal Koopman matrix to minimize the regres-
sion error from equation 3 is:
K =C?100 C01 (3.19)
Where the mean-free covariance matrices for data transformation are defined as:
????????C00 = E[X0(x Tt)X0(xt) ]?????C0t = E[X0(xt)X1(x
T
t (3.20)? +?
) ]
?Ctt = E[X1(xt+?)X1(x Tt+?) ]
However, the regression error has no information about the choice of feature transfor-
mations X0 and X1 and can lead to trivial solutions for these feature transformations.[194]
On the other hand, VAMP provides useful scoring functions that can be used to find opti-
mal feature transformations. VAMP is based on singular value decomposition of Koopman
operator and is used to optimize the feature functions and does not have the limitations
of time reversibility and stationary dynamics. VAMP states that given a set of orthogonal
candidate functions, their time-autocorrelations are the lower bounds to the true Koopman
118
eigenvalues. This provides a variational score such as the sum of estimated eigenvalues that
can be maximized to find the optimal kinetic model. Wu and Noe showed that the optimal
choice of X0 and X1 in equation 3.18 are obtained using the singular value decomposition
of the Koopman matrix and setting X0 and X1 to its top left and right singular functions
respectively.[332, 331]
The VAMP-2 score is then defined as :
VAMP?2 ??2 ||C?1/2 ?1/2= = 2i 00 C0tCtt ||F+1 (3.21)
i
The left and right singular functions of the Koopman operator are always equal to the
constant function 1. Therefore, 1 is added to basis functions. Maximum VAMP-2 score
is achieved when the top m left and right Koopman singular functions are in the span
(X01, ...X0m) and (X11, ..., X1m) respectively. VAMP-2 also maximizes the kinetic variance
captured by the model.
Feature transformations X0 and X1 can be learned with neural network in the so called
VAMPNet where there are 2 parallel lobes each receiving MD configurations xt and xt+? .
As done in the original VAMPNet, we assume the two lobes have similar parameters and
use a unique basis set X = X0 = X1. The training is done by maximizing the VAMP-2
score to learn the low-dimensional state space produced by a softmax function. Since K is
a Markovian model it is expected to fulfill the Chapman-Kolmogrov (CK) equation
K(n?) = Kn(?) (3.22)
for any value n? 1 where K(?) and K(n?) indicate models estimated at a lag time of ?
and n? respectively.
119
The implied timescales of the process are computed as follows:
?
ti(?) =? (3.23)ln|?i(?)|
Where ?i(?) is the eigenvalue of the Koopman matrix built at a lagtime ? . The smallest
lagtime ? is chosen where the implied timescales ti(?) are approximately constant in ? .
After having chosen the lagtime ? we test whether the CK test holds within statistical
uncertainty.
3.4.3 Protein Graph representation
Each structure is represented in terms of an attributed graph G = (V,E) where V are
node features V = v1, ...,vN and E are the edge features E = ei j that captures the rela-
tions between nodes. We have tested different Graph Neural networks (GNNs) for their
ability to learn higher resolution kinetic models from MD trajectories of protein folding
simulations.[335] In all these different GNNs the node embeddings vi are initialized ran-
domly and the edge embeddings ei j are the Gaussian expanded distances between the ad-
jacent nodes using the following formula:
eti j = exp(?(d 2 2i j??t) /? ) (3.24)
Where di j is the distance between atoms i and j, ?t = dmin + t ? (dmax? dmin)/K A?, t =
0,1, ...,K, ? = (dmax?dmin)/K A?. dmax and dmin are the maximum and minimum distances
respectively for constructing the gaussian expanded edge features. Unless noted otherwise,
all the graphs are built using M nearest neighbors for the C? atoms of the protein with
edges built from the gaussian expanded distances between C? atoms.
120
Graph Convolution layer
In this type of graph neural network, protein graph is represented as G = (V,E) where
V contains features of the nodes and E contains the edge attributes of the graph. A separate
graph is constructed for configurations at each timestep of the simulation. The initial node
feature representations are randomly initialized. However, a one-hot vector representation
based on the atom type or the amino acid type can also be used. During training the node
embeddings vki for node i at layer k are updated using the following equations:[258, 336]
vk+1 = vki i + ? wk k ki, jg(zi, jWc +bkc) (3.25)
j?Ni
wki, j = ?(z
k k
i, jWg +b
k
g) (3.26)
zk ki, j = vi ? vkj? ei, j (3.27)
Where  denotes element-wise multiplication and ? denotes concatenation, ? is the
sigmoid function as the non-linearity and g() is the edge-gating mechanism introduced
by Marcheggiani[191] to incorporate different interaction strength among neighbors into
the model. W kg , W
k
c , b
k
g and b
k
c are gate weight matrix, convolution weight matrix, gate
bias and convolution bias respectively for k?th layer of the graph convolution layer. To
capture the embedding of the whole graph, we use graph pooling where graph embeddings
are generated using the learnt node embeddings.[119, 240] The embedding for the whole
graph is done through a pooling function where we average over the embedding of all the
121
nodes.
1 N
vG = ? vi (3.28)N i=1
Other types of pooling such as hierarchical pooling can also be applied to make a more
complicated model.[344]
SchNet
Another type of GNN is SchNet which was introduced by Schutt and others to use
continuous-filter convolutions for predicting forces and energies of small molecules ac-
cording to quantum mechanical models.[268, 13]. This was modified by Husic et al.[136]
to learn a coarse-grained representation of molecules using graph Neural network. SchNet
is employed here to learn feature representation of nodes for learning dynamics of protein
folding. This is a subunit of our model for learning feature representations of molecules
on the graph level. The initial node features or embeddings are initialized randomly but a
one-hot encoding based on node type (amino acid) can also be used. These embeddings
are later learned and updated during training of the network by a few rounds of message
passing through nodes and edges of the graph. Node embeddings are updated in multiple
interaction blocks as implemented in the original SchNet.[268] Each interaction layer con-
tains continuous convolution between nodes. The edge attributes are obtained by using a
radial basis function e.g. Gaussians centered at different distances.
e 2i j = exp(??(di j??) ) (3.29)
These edge attributes are then fed into a filter-generating network w that maps ei j to a
dh-dimensional filter. This filter is then applied to node embeddings as (continuous filter
122
convolution):
zki = ? ?k k k ki jw ei jb (hi ) (3.30)
j?N(i)
k exp(z
k W ki j a )?i j = k (3.31)? exp(z W tj i j a)
?
zk = wke jbk(hki j i i ) (3.32)
w is a dense neural network and b is an atom-wise linear layer as noted in the original
paper.[268] Note that the sum is over every atom j in the neighborhood of atom i. Multiple
interaction blocks allow all atoms to interact with each other in the network and therefore
allow the model to express complex multi-body interactions. We enhanced the standard
SchNet architecture by adding an attention layer that learns the importance of the edge
embeddings for updating the embedding of incoming node in the next layer. The attention
weight ?i j is learned using a softmax function between embedding of the node and its
neighbors. ? denotes the concatenation of node features of neighboring nodes j ? N(i)
where i is the query node. The node embeddings are updated in each interaction block,
which can contain a residual block to avoid gradient annihilation as done in deep neural
networks.[122] The residual connection is followed by a nonlinear activation function of
the output of continuous filter convolution zki as:
hk+1 = hk +gki i (z
k
i ) (3.33)
The trainable function g involves linear layers and a nonlinearity. We used a hyperbolic
tangent as the activation as proposed by Husic et al.[136] The output of final SchNet inter-
action block is fed into an atom-wise dense network. The learned embedding of nodes after
123
several SchNet layers is then fed into a pooling layer as described previously to produce a
graph embedding for each timestep.
3.4.4 Model selection and Hyperparameters
In GraphVAMPNet, instead of traditional features such as dihedral angles, distance
matrices and contact maps, we use a general graph neural network, a more natural rep-
resentation of molecules and proteins. We have implemented two different graph neural
networks (GraphConvLayer and SchNet). The GraphVAMPNet built from each of these
GNN layers have several model hyperparameters including the dimension of feature space
(number of output states) and the lag time ? . To resolve k? 1 relaxation timescales, we
need at least k output neurons in the last layer of the network since the softmax function
removes one degree of freedom. The models are trained by maximizing the VAMP-2 score
on the training set and hyperparameters are optimized using a a cross-validated VAMP-2
score for the validation set using a ratio of 0.7 for training and 0.3 for the validation set.
To have a fair comparison between different feature representations we trained model with
similar number of layers (4) and similar number of neurons per layer (16). In general,
increasing the dimension of feature space makes the dynamical model more accurate, but
it may result in overfitting when the dimension is very large. A higher dimensional fea-
ture space is also harder to interpret as the model seeks a low dimensional representation.
Therefore, in this study, we experiment on systems with 5-state output models unless stated
otherwise. There are multiple hyperparameters in the model that must be selected. These
include the architecture of GCN, the number of clusters, number of neighbors for making
the graph, time step of analyzing the simulation. Here we have used 5 clusters for Trp-Cage
and NTL9 and 4 clusters for Villin. In the case of Villin using a 5-state led to finding only
4 states after using a cutoff value of 0.95 on the cluster probabilities which is the reason we
124
Table 3.3: Hyperparameters for each system in this study. dmin, dmax and number of Gaussians are
the parameters for building the Gaussian expanded distances in equation 9.
number number number number number number
system learningof graph of of Batch-size of of d
rate min
dmin of
layers neurons clusters residues neighbors Gaussians
TrpCage 4 16 5 1000 0.0005 20 7 2 8 12
Villin 4 16 4 1000 0.0005 35 10 2 10 16
NTL9 4 16 5 1000 0.0005 39 10 2 12 20
Table 3.4: Average VAMP-2 score for each system
system Standard VAMPNet GraphConvLayer SchNet
TrpCage 4.68 ? 0.08 4.76 ? 0.03 4.79 ? 0.01
Villin 3.74 ? 0.02 3.74 ? 0.06 3.78 ? 0.02
NTL9 4.67 ? 0.03 4.50 ? 0.41 4.80 ? 0.03
have used a value of 4 for this protein. The hyperparameters chosen for each protein are
shown in table 3.3.
3.4.5 Results
We tested the performance of our GraphVampNet method on 3 different protein folding
systems including Trp-Cage (pdb: 2JOF)[11], Villin (pdb: 2F4K)[157] and NTL9 (pdb:
2hba)[57]. The Graph Neural network was implemented using PyTorch and the deeptime
[131] package was used for VAMPNet. Pyemma[263] was used for free energy landscape
plots. Adam was used as the optimizer in all models. GNN provides a framework to
learn the feature transformations in VAMPNets that is invariant to permutation, rotation
and reflection. Moreover, the graph embeddings can be learned with the GraphVAMPNet
framework which correspond to different dynamic states during the simulation. In order
to visualize the graph embeddings in 2-D we have also transformed the graph embed-
dings in the last layer of GraphVAMPNet into 2-D and trained the model by maximizing
the VAMP-2 score. The free energy landscape on the graph embeddings shows highly
separated metastable states separated by high energy transition regions. The low energy
125
Figure 3.28: TrpCage system A) Implied timescale (ITS) plot for SchNet as feature transformation
in VAMPNet (errors are taken from 95 % confidence interval from 10 different trainings) B) Free
energy landscape (FEL) in a 2d graph embedding C) CK-test for SchNet using lagtime of 20 ns D)
state assignment of the 2d graph embedding using 0.95 cutoff.
126
Figure 3.29: training and validation losses from SchNet based VAMPNet TrpCage
metastable states correspond to the regions of high fidelity for metastable assignment prob-
abilities with higher than 0.95. It is important to note that this is only done for visualization
purposes and higher dimensional (16) embeddings are used for finding the metastable states
in these complex protein folding systems. Furthermore, the present results do not depend
on enforcing reversibility to the learned transition matrix. However, this can be done by
Koopman re-weighting[332] or learning the re-weighting vectors during training in the
VAMPNet framework.[193]
Although we have tried different Graph Convolution networks in the GraphVAMP-
Net approach, our results showed that SchNet has the best performance with the highest
VAMP-2 score among all. Therefore, we present the results of SchNet in the main paper.
The average VAMP-2 scores are calculated from the validation set of 10 different train-
ing for each system and compared (table 3.4). The VAMP-2 score for SchNet in TrpCage
system for training and validation set are plotted against the training epoch and shows a
converging behavior after 100 epochs (figure 3.29). A comparison of implied timescales
between standard VAMPNet and GraphVAMPNet based on SchNet is shown in figure 3.36
and table 3.4.
127
Table 3.5: Implied timescales calculated for TrpCage (at lagtime of 20ns), Villin (at lagtime of 20ns)
and NTL9 (lagtime of 200ns) from SchNet based GraphVAMPNet and standard VAMPNet
ITS TrpCage Villin NTL9
1-GraphVAMPNet 1917 ?28 1138 ?23 14,682 ?935
1-VAMPNet 1800 ?101 697 ?108 15,013 ?738
2-GraphVAMPNet 419 ?88 395 ?10 1623 ?93
2-VAMPNet 382 ?51 381 ?13 1285 ?37
3-GraphVAMPNet 253 ?85 82 ?17 680 ?76
3-VAMPNet 225 ?57 65 ?1 464 ?47
4-GraphVAMPNet 179 ?121 - 409 ?107
4-VAMPNet 186 ?28 - 288 ?86
3.4.6 Trpcage
We test our GraphVAMPNet model an ultra-long 208 ?s explicit solvent simulation
of K8A mutation of the 20-residue Trp-Cage TC10b at 290 K provided by DE Shaw
group.[180] The folded state of Trp-Cage contains an ?-helix (residues 2-8), a 310-helix
and a polyproline II helix.[201] The tryptophan residue (Trp6) is caged at the center of the
protein.
A VAMPNet was built for different types of feature learning in neural networks. The
average VAMP-2 score of the validation set for 10 training runs were compared between
different types of feature learning Neural Networks (Standard VAMPNet, and two graph
layers) in table S1. SchNet showed the highest average VAMP-2 score among different
types of features leanings. The average VAMP-2 score for training and validation set of
TrpCage for 10 different training examples is shown in figure 3.29. The VAMP-2 score
shows a converging behavior after 100 epochs of training. Since SchNet showed the high-
est VAMP-2 score we use this type of Graph Neural network for the rest of our analysis.
The implied timescales learned using SchNet is shown in figure 3.28A. A comparison of
the implied timescales between SchNet based GraphVAMPNet and standard VAMPNet
is shown in figure 3.36A. Both Standard VAMPNet and SchNet show a fast convergence
128
of implied timescales. However, the implied timescales from SchNet have smaller error
bars than the standard VAMPNet (table 3.4). Standard VAMPNet using distances shows
higher variance in the implied timescales than the VAMPNet built using SchNet. All 4
timescales in SchNet converge after 20 ns which is the lagtime we choose to build the
kinetic Koopman matrix. Moreover, a closer look at the implied timescales shows that
the timescales learnt from standard VAMPNet layer are also smaller (table 3.4) than the
implied timescales in SchNet. According to the variational approach for Markov pro-
cesses, a model with longer implied timescale corresponds to less modeling error of the
true dynamics of the system.[218] To validate the resulting GraphVAMPNet, we conduct
a CK-test which compares the transition probability between pairs of states i? j at time
k? predicted by a model at lagtime ? . CK-test shows (figure 3.28C) excellent prediction of
transition probabilities even at large timescales k? = 200 ns at a lagtime of 20 ns. We next
analyzed the resulting coarse-grained states built from VAMPNet using SchNet as feature
transformation. The folded state (S4) possesses 18% of the total distribution and the un-
folded state (S1) has 69.5% of the total distribution which is in great agreement with other
studies on this dataset using Markov state models.[274, 104]. GraphVAMPNet produces
an embedding for each timestep of the simulation which is then turned into a membership
assignment using a softmax function. This higher dimensional embedding (16 dimension)
can be visualized using dimensionality reduction methods such as t-SNE. To have a better
visualization of the low-dimensional space learned by the model, we also trained a Graph-
VAMPNet where in the last layer we linearly transformed the learned graph embedding
into 2-D and trained the model by maximizing the VAMP-2 score. Other parameters were
kept similar to the main SchNet. The 2-D free energy landscape (FEL) for this embedding
is shown in figure 3.28B. This low energy states in the FEL correspond to the states with
high cluster assignment probability. This low-dimensional FEL shows the ability of the
GraphVAMPNet to produce an interpretable and highly clustered embedding of graphs for
129
Figure 3.30: A) Representative structure of each metastable state in TrpCage with their probabilities
B) average attention score between C? atoms for each cluster C) averaged attention score for each
residue of TrpCage in each cluster which is the scaled sum of rows.
130
simulation of proteins. The learned 2-D embedding of graphs during TrpCage Folding is
shown in figure 3.28D where the states with more than 0.95 cluster assignment probability
are colored. The enhancement of the SchNet with attention gives an interpretable model
where we can analyze the nodes and edges in the graph that are most important in each
coarse-grained cluster. The scaled attention scores for TrpCage are shown in figure 3.30.
The cage residue Trp6 shows a high attention score in most clusters due to being in the
center of the protein and having a high number of connections in the graph. In the un-
folded state (S1) most residues have high attention score only with their close neighbors
on the sequence which is due to high level of dynamic and no defined structure of the un-
folded state. On the other hand, other clusters such as S2 show different hot spot regions
for their attention scores. In this hairpin-like structure, residues Ala4, Ser14 and Pro17
which make the groove have high attention scores. A two step folding mechanism has been
proposed for TrpCage that involves an intermediate state with a salt bridge between Asp9
and Arg16.[354] Breaking this salt-bridge is thought to be a limiting step in folding of Tr-
pCage. Surprisingly, our model puts high attention scores on residues Arg16 and Asp9 in
metastable state S3 which also has a 10% probability.
3.4.7 Villin
Villin is a 35-residue protein and is known as one of the smallest proteins that can
fold autonomously. It is composed of 3 ?-helices denoted as helix-1 (residues 4-8), helix-
2 (residues 15-18), helix-3 (residues 23-32) and a compact hydrophobic core. The double
mutant of Villin with replacement of two Lys residues with uncharged Norleucine (Nle) was
simulated by DE Shaw [180] group and is studied here. Hernandez and coworkers [128]
used a variational dynamic encoder to produce a low-dimensional embedding of Villin
folding trajectories using C? distance maps. The optimized TICA for this protein used a
131
Figure 3.31: Villin system A) Implied timescale (ITS) plot for SchNet as feature transformation
in VAMPNet (errors are taken from 95 % confidence interval from 10 different trainings) B) Free
energy landscape (FEL) in a 2d graph embedding C) CK-test for SchNet using lagtime of 20 ns D)
state assignment of the 2d graph embedding using 0.75 cutoff.
lagime of 44 ns according to hyperparameter optimization. [137]
We built VAMPNet with different types of feature functions and compared the average
VAMP-2 score of validation set for 10 different training between them. VAMPNet based on
SchNet showed the highest VAMP-2 score for the same number of states (4). The VAMP-
Net built using SchNet shows an extremely fast convergence of implied timescales even
after 20 ns (figure 3.31A) which gives a high resolution kinetic model for villin folding.
A close comparison between SchNet based GraphVAMPNet and standard VAMPNet im-
plied timescales is shown in figure 3.36B. VAMPNet built with standard VAMPNet shows
a slow convergence for implied timescales where the first timescale converges after 40 ns of
132
Figure 3.32: A) Representative structure of each metastable state in Villin with their probabilities
B) average attention score between C? atoms for each cluster C) averaged attention score for each
residue of Villin in each cluster which is the scaled sum of rows.
lagtime (figure 3.35B). The timescales of the processes are also higher for SchNet than in
standard VAMPNet model which also demonstrates the higher accuracy of GraphVAMP-
Net than VAMPNet with simple distance matrix (table 3.3). The CK test for SchNet (figure
3.31C) shows excellent Markovian behavior of the model built using a lagtime of 20 ns
at large timescales k? = 200 ns. GraphVAMPNet also provides a latent embedding for
graphs which is another advantage of GNN features compared with standard VAMPNet
layer. To have a interpretable embedding we have trained a VAMPNet using SchNet where
we linearly transformed the last embedding layer into 2 dimensions and trained the model
using similar parameters as before. The 2-D embedding of Villin learned using Graph-
VAMPNet is shown in figure 3.31D where datapoints with a cluster assignment probability
higher than 0.75 are colored based on their corresponding state. The FEL on this 2-D em-
bedding is shown in figure 3.31B. This FEL features highly separated clusters with low
133
energy minima corresponding to the center of clusters and the transition regions having
low membership assignment probabilities. The representative structure of each cluster of
Villin (Misfolded: S0, Unfolded: S1, Partially-folded: S2, Folded: S3) are shown in fig-
ure 3.32 which are colored based on the average attention score for each residue in that
cluster. The folded state (S3) shows a high attention score for residue Arg13. This is in
agreement with previous study by Mardt et al.[192] who used distance map features and
attention on neighboring residues. We found residues Gln25 and Nle29 to also have high
attention scores in the folded state. These residues are in the central hydrophobic core of
the protein and have high number of connections in the graphs built for the folded state.
The partially folded (S2) state has similar attention scores to the folded state except for
residue Lys7 which shows high attention score in partially folded (S2) but not in folded
(S3). The misfolded state (S0) has high attention score for helix 2 residues which is also in
agreement with work done by Mardt et. al.[192] In general the N and C-terminal of the pro-
tein due to their high flexibility are given low attention scores. Hernandez and coworkers
[128] used variational dynamic encoders to reduce the complex nonlinear folding of Villin
into a single embedding and used a saliency map to find important Ca contacts for folding
of Villin. They found residues Lys29 and His27 to be important for the folding of Villin.
We found these residues to have high attention scores in our model for partially folded and
folded states.
3.4.8 NTL9
As our last example, we tested the GraphVAMPNet on the NTL9 (residues 1-39) fold-
ing dataset from DE Shaw group.[180]. We uniformly sampled the 1.11 ms trajectory using
a lagtime of 5 ns. Mardt et al.[194] previously used a 5-layer VAMPNet with contact maps
between neighboring heavy-atoms to coarse-grain the NTL9 simulation into metastable
134
Figure 3.33: NTL9 system A)Implied timescale (ITS) plot for SchNet as feature transformation
in VAMPNet (errors are taken from 95 % confidence interval from 10 different trainings) B) Free
energy landscape (FEL) in a 2d graph embedding C) CK-test for SchNet using lagtime of 200 ns D)
state assignment of the 2d graph embedding using 0.95 cutoff.
states. They showed that the relaxation timescales by a 5-state VAMPNet correspond to a
40-state MSM. Their implied timescales showed a converging behavior after about 320 ns
which they chose as the lagtime of the Koopman matrix.
The comparison of VAMP-2 score for VAMPNet built using different neural network
feature transformations is shown in table S1. Standard VAMPNet based on the distance
maps shows a lower VAMP-2 score compared to SchNet based VAMPNet. The implied
timescales for SchNet layer is shown in figure 3.33A. A comparison of implied timescales
between SchNet based GraphVAMPNet and standard VAMPNet for NTL9 is shown in fig-
ure 3.35C. SchNet shows a better convergence behavior and also higher implied timescales
than the standard VAMPNet (table 3.4).
The convergence of implied timescale for SchNet (figure 3.33) and standard VAMPNet
135
(figure 3.35C) and the magnitude of the implied timescales are similar. A lagtime of 200
ns was chosen to build the Koopman matrix. The CK test (figure 3.33C) for SchNet us-
ing a lagtime of 200 ns shows the Markovianity of the model even at high timescales of
2000 ns. As described for other proteins, we have trained a SchNet based VAMPNet for
NTL9 with a 2-D embedding. Figure 3.33D shows the 2-D embedding of NTL9 which is
colored by states higher than 0.95 cluster membership probability. The FEL in this 2-D em-
bedding (figure 3.33B) shows low energy metastable states separated by transition regions
that correspond to points where model is uncertain about their membership. Representa-
tive structures of each cluster in NTL9 are shown in figure 3.34 (colored based on residue
attention scores). The folded state (S1) and unfolded state (S3) posses 82.3 and 15.3 %
of the total probability distribution. Schwantes et al.[269] used a TICA MSM for NTL9
and showed that the slowest timescale ( 18?s) corresponds to the folding process while the
faster timescales correspond to transition between different register-shifted states. Register
shift in each strand can also shift the hydrophobic core contacts. For instance based on
their study, register shift in strand 3 produces a shift in the core packing in which Phe29
is packed. Interestingly in our model, high attention scores are given to the beta-strand
residues such as Phe5, Phe29,Leu30 and Phe31. ALA36 which is part of the strand 3, has
high attention score in folded and near folded states (S0, S1 and S2). A register shift for
strand 3 was reported by Schwantes et al. [269] for NTL9.
3.4.9 Discussion and Conclusion
MSM construction has previously been a complex process which involved multiple
steps such as feature selection, dimension reduction, clustering, estimating the transition
matrix K and coarse graining the dynamical model. Each of these steps requires choos-
ing some hyperparameters and suboptimal choices could lead to poor kinetic model of
136
Figure 3.34: A) Representative structure of each metastable state in NTL9 with their probabilities
B) average attention score between C? atoms for each cluster C) averaged attention score for each
residue of NTL9 in each cluster which is the scaled sum of rows.
137
the system with lower kinetic resolution.[262] Variational approach for conformational dy-
namics (VAC) and its more general form the variational approach for Markov processes
(VAMP) have recently guided the optimal choice of hyperparameters.[332, 218, 262] A
cross-validated variational score is usually used to find the set of features with the highest
cross-validated VAMP-2 score.[262]
The end-to-end deep learning framework VAMPNet was proposed by Mardt et al.[194]
to replace the MSM construction pipeline by training a neural network that maps the molec-
ular configurations x to a fuzzy state space where each point has a membership probability
to each of the metastable states. VAMPNet is trained by maximizing a VAMP score allow-
ing us to find the optimal state space which enables linear propagation of states through a
transition matrix. VAMPNets are not restricted to stationary and equilibrium MD and can
be used as general case for non-stationary and non-equilibrium processes. The few-state
coarse-grained MSM in the case of VAMPNet is learned without the loss of model quality
as is the case in standard pipelines such as PCCA.[245] Due to the end-to-end nature of
deep neural networks, VAMPNets require less expertise to build an MSM. The framework
of VAMPNet were further developed into state-free reversible VAMPNets (SRVs) not to
approximate MSMs but rather to directly learn nonlinear approximations to the slowest
dynamics modes of a MD system obeying detailed balance.[49] In SRVs, the transfer oper-
ator rather than soft metastable state assignment, directly employs the variational approach
under detailed balance to approximate the slow modes of equilibrium dynamics. Ferguson
and coworkers [274] showed that MSMs constructed from nonlinear SRV approximations
would permit the use of shorter lagtimes and therefore furnish the models with higher ki-
netic resolution. Hernandez et al. [128] introduced Variational dynamic encoders (VDEs)
which uses a variational autoencoder to find a simple embedding of nonlinear dynamical
processes by optimizing a loss function that is the sum of trajectory reconstruction and
auto-correlation losses. However, both SRVs and VDEs produce an embedding of the
138
dynamics and not the coarse-grained states. To build the Markov state model from the em-
bedding, they rely on the traditional MSM construction pipeline (clustering and Markov
model construction and coarse graining) to build a coarse-grained dynamical model. On
the other hand, in VAMPNets and GraphVAMPNets the entire mapping from features to
Markov states is done in a single end-to-end network. Moreover, VDEs are limited to esti-
mate single leading eigenfunctions of the dynamical propagator and fail to uncover the full
spectrum of slow modes. Lack of orthogonality constraint on the learned eigenfunctions
could lead to different slow modes to be entangled for systems with multiple metastable
states. SRVs also use variational approach to conformational dynamics (VAC) as their loss
function and are only suitable for stationary and reversible processes whereas VAMPNets
can be applied to more general non-reversible and non-stationary processes.
Despite the success of VAMPNet, the feature selection is still a process that must be
done with caution. Traditionally distance maps are used as a general feature representation
of protein dataset. However this representation does not preserve the graph-like structure
of proteins as it does not capture the 3d structure and models the protein as points on a
regular grid. In this work we have focused on representation learning of VAMPNet using
graph-representation of protein to get a higher-resolution kinetic model where a smaller
lag-time can be chosen. Graph representation of molecules is shown to be effective in ex-
tracting different properties using deep learning. Recently there has been a large amount
of work in the area of geometric deep learning [36] that has graph based approaches for
representing graph structures. These methods enable automatic learning of the best repre-
sentation (embedding) from raw data of atoms and bonds for different types of predictions.
[149, 191] These methods have been applied to various tasks such as molecular feature
extraction [87, 149] protein function prediction [106] and protein design [285] to name a
few. Park et al.[225] proposed a machine learning framework (GNNFF) a graph neural net-
work to predict atomic forces from local environments and showed its high performance in
139
force prediction accuracy and speed. Introduction of graph message passing enhances the
model ability to recognize symmetries and invariances (permutation, rotation and trans-
lation) in the system. Hierarchical pooling from atom-level to residue level and then to
the protein level enables the model to learn global transition between different metastable
states that involves atomic-scale detailed dynamics. Xie and coworkers [335] developed
graph dynamical networks (GdyNet) to investigate atomic scale dynamics in material sys-
tems where each atom or node in the graph has a membership probability to the metastable
states. Graph representation of materials in their model enabled encoding of local envi-
ronment that is permutation, rotation and reflection invariant. The symmetry in materials
facilitated identifying similar local environment throughout the material and learning of
the local dynamics. This type of approach can be used to learn local dynamics in some
biophysical problems such as nucleation and aggregation where local environment is im-
portant.
The introduction of Graph Neural networks to VAMPNets enables the higher resolu-
tion of the kinetic model and higher interpretability. A large increase in VAMP-2 score
is observed when switching from distance based features to graph-based. This suggests
the usefullness and representation capability of GNN for further improving the kinetic em-
bedding of MD simulation. We have tested the GraphVAMPNet with two different types
of graph neural networks (Graph Convolution layer and SchNet) on three long-timescale
protein folding trajectories. The GraphVAMPNet showed a higher VAMP-2 score than
the standard VAMPNet and the implied timescales showed a faster convergence in Graph-
VAMPNet due to an efficient representation learning as opposed to standard VAMPNet.
This enables choosing a lower lagtime for building the dynamical model and improves the
kinetic resolution of the resulting Markov model. The timescales observed from our Graph-
VAMPNet are comparable to other methods which use markov state models with hundreds
of states on the resulting embedding derived from state free reversible VAMPNets. We
140
Figure 3.35: Implied timescales for standard VAMPNet A)TrpCage B)Villin C)NTL9 and VAMP-
Net based on Graph Convolution layer for D)TrpCage E)Vilin F)NTL9
141
Figure 3.36: Comparison of implied timescales from GraphVAMPNet and standard VAMPNet for
A)TrpCage B)Villin C)NTL9
142
should also note that the graph neural network approach for molecular representation could
also be extended in the SRV for learning a dynamical embedding based on graph represen-
tations. The graph embeddings resulted from GraphVAMPNet are highly interpretable and
shows clustered data in low energy minima in a free energy landscape. Furthermore, the
addition of attention mechanism into SchNet enables us to decipher the residues and bonds
that most contribute to each of the metastable states. However care must be taken when
interpreting the attention scores returned by the model. One main obstacle of GNNs is that
they cannot go deeper than a few layers (3 or 4) and suffer from over-smoothing problems
in which all node representations tend to become similar to one another. An architecture
that enables training deeper networks is the residual or skip connections as deployed in
ResNet architecture is used here to train a deep neural network.[122]
Due to the flexibility of graph representation of molecules, other physical properties
of atoms or amino acids such as electric charge, hydrophobicity. can be encoded into
node or edge features in order to enhance the physical and chemical interpretability of
the model. Moreover, hierarchical pooling layers can be applied to learn the dynamics at
different resolutions of the molecule.[344] Our GraphVAMPNet is inherently transferable.
This means theoretically given a sufficient amount of dynamical data, transfer learning can
be leveraged to reduce the number of trajectories needed for studying the dynamics of a
particular system of interest.
Time-reversibility and stochasticity of the transition matrix are the two physical con-
straints that are needed to get a valid symmetric transition matrix for analyses such as
transition path theory (TPT). Physical constraints added to the VAMPNet and the resulting
model called revDMSM was successfully applied to study kinetics of disordered proteins.
This allowed to have a valid transition matrix and therefore rates could be quantified for
interesting processes. This revDMSM was further extended by including experimental ob-
servables into the model as well as a novel hierarchical coarse-graining to give different
143
levels of detail. These physical constraints can be further added into GraphVAMPNet to
obtain a valid and high resolution transition matrix. In summary, our GraphVAMPNet au-
tomates the feature selection in VAMPNet to be learned from graph message passing on the
molecular graphs which is a general approach to understand the coarse-grained dynamics.
144
Chapter 4: Molecular dynamics study of membrane proteins
4.1 Critical Sequence hotspots for binding of novel coronavirus to An-
giotensin Converter Enzyme as Evaluated by Molecular Simulations
1 novel coronavirus (nCOV-2019) outbreak has put the world on edge, causing millions
of cases and hundreds of thousands of deaths all around the world, as of June 2020, let alone
the societal and economic impacts of the crisis. The spike protein of nCOV-2019 resides on
the virion?s surface mediating coronavirus entry into host cells by binding its receptor bind-
ing domain (RBD) to the host cell surface receptor protein, angiotensin converter enzyme
(ACE2). Our goal is to provide a detailed structural mechanism of how nCOV-2019 rec-
ognizes and establishes contacts with ACE2 and its difference with an earlier coronavirus
SARS-COV in 2002 via extensive molecular dynamics (MD) simulations. Numerous mu-
tations have been identified in the RBD of nCOV-2019 strains isolated from humans in
different parts of the world. In this study, we investigated the effect of these mutations as
well as other Ala-scanning mutations on the stability of RBD/ACE2 complex. It is found
that most of the naturally occurring mutations to the RBD either slightly strengthen or have
the same binding affinity to ACE2 as the wild-type nCOV-2019. This means the virus had
sufficient binding affinity to its receptor at the beginning of the crisis. This also have im-
plications for any vaccine design endeavors since these mutations could act as antibody
1Taken from published paper: Ghorbani, M., Brooks, B. R., & Klauda, J. B. (2020). Critical sequence
hotspots for binding of novel coronavirus to angiotensin converter enzyme as evaluated by molecular simu-
lations. The Journal of Physical Chemistry B, 124(45), 10034-10047.
145
escape mutants. Furthermore, in-silico Ala-scanning and long-timescale MD simulations,
highlight the crucial role of the residues at the interface of RBD and ACE2 that may be
used as potential pharmacophores for any drug development endeavors. From an evolu-
tional perspective, this study also identifies how the virus has evolved from its predecessor
SARS-COV and how it could further evolve to become even more infectious.
4.1.1 Introduction
The novel coronavirus (SAR-COV-2) outbreak emerging from China has become a
global pandemic and a major threat for human public health. According to World Health
organization (WHO) as of August 28th 2020 (the time this research was done), there has
been about 25 million confirmed cases and about 1 million deaths due to coronavirus in the
world.[356]
Coronaviruses are a family of single-stranded enveloped RNA viruses. Phylogenetic
analysis of coronavirus genome has shown that nCOV-2019 belongs to the beta-coronavirus
family, which also includes MERS-COV, SARS-COV and bat-SARS-related coronaviruses.
[176, 315] In all coronaviruses, a homotrimetic spike glycoprotein on the virion?s envelope
mediates coronavirus entry into host cells through a mechanism of receptor binding fol-
lowed by fusion of viral and host membranes.[309, 176]. The spike protein in coronavirus
contains two functional subunits S1 and S2. The S1 subunit is responsible for binding
to host cell receptor, and the S2 subunit is responsible for fusion of viral and host cell
membranes.[176, 329] The spike protein in nCOV-2019 exits in a metastable pre-fusion
conformation that undergoes a substantial conformational rearrangement to fuse the vi-
ral membrane with the host cell membrane.[28, 329] nCOV-2019 is closely related to bat
coronavirus RaTG13 with about 93.1 % sequence similarity in the spike protein genome.
The sequence similarity of nCOV-2019 and SARS-COV is less than 80% in the spike
146
Figure 4.1: A) superposition of RBD of SARS-COV (yellow) and nCOV-2019 (red) B) Different
regions in the binding domain of nCOV-2019 definding the extended loop (non-yellow)
protein sequence.[353]. S1 subunit in the spike protein includes a receptor binding do-
main (RBD) that recognizes and binds to the host cell receptor. The RBD of nCOV-2019
shares 72.8% sequence similarity to SARS-COV RBD and the root mean squared deviation
(RMSD) for the structure between the two proteins is 1.2 A?, which shows the high structural
similarity.[315, 28]. Experimental binding affinity measurements using Surface Plasmon
Resonance (SPR) have shown that nCOV-2019 spike protein binds its receptor human an-
giotensin converter enzyme (ACE2) with 10 to 20 fold higher affinity than SARS-COV
binding to ACE2.[329]. Based on the sequence similarity between RB of nCOV-2019 and
SARS-COV and also the tight binding between RBD of nCOV-2019 and ACE2, it is most
probable that SASR-COV-2 uses this human receptor on human cells to gain entry into the
body.[176, 309, 329, 307]
The spike protein and specifically the RBD in coronaviruses have been a major target
for therapeutic antibodies. However, until August 28th, 2020, no monoclonal antibodies
targeted to RBD have been able to efficiently bind and neutralize nCOV-2019.[329, 292].
The core of nCOV-2019 RBD is a 5-stranded antiparallel ? -sheet with connected short
?-helices and loops (Figure 4.1). Binding interface of nCOV-2019 and SARS-COV with
147
ACE2 are very similar with less than 1.3 A? RMSD. An extended insertion inside the core
containing short strands, ?-helices and loops called the receptor binding motif (RBM)
makes all the contacts with ACE2. In nCOV-2019 RBD, the RBM forms a concave surface
with a ridge loop on one side and it binds to a convex exposed surface of ACE2. The overlay
of SARS and nCOV-2019 RBD proteins is shown in figure 4.1A. The binding interface of
nCOV-2019 contains loops L1 to L4 and short ? -strands ?5 and ?6 and a short helix ?5.
The location of RBM in nCOV-2019 RBD as well as different helices, strands and loops
are shown in figure 4.1B.
The sequence alignment between SARS-COV in human, SARS civet, Bat RaTG13
coronavirus and nCOV-2019 in the RBM is shown in figure 4.2. There is a 50% sequence
similarity between the RBM of nCOV-2019 and SARS-COV. RBM mutations played an
important role in the SARS epidemic in 2002. [175] Two mutation in the RBM of SARS-
COV in 2002 from SARS-Civet were observed from strains of these viruses. These two
mutations were K479N and S487T. These two residues are close to the virus binding
hotspots in ACE2 including hotspot-31 and hotspot-353. Hotspot-31 centers on the salt-
bridge between K31-E35 and hotspot-353 is centered on the salt-bridge between K353-
E358 on ACE2. Residues K479 and S487 in SARS-Civet are in close proximity with these
hotspots and mutations at these residues causes SARS to bind ACE2 with significantly
higher affinity than SARS-Civet and played a major role in civet-to-human and human-to-
human transmission of SARS coronavirus in 2002. [176, 178, 174] Numerous mutations in
the interface of SARS-COV RBD and ACE2 from different strains of SARS isolated from
humans in 2002 have been identified and the effect of these mutations on binding ACE2
have been investigated by surface plasmon resonance.[173, 333]. Two identified RBD mu-
tations (Y442F and L472F) increased the binding affinity of SARS-COV to ACE2 and two
mutations (N479K, T487S) decreased the binding affinity. It was demonstrated that these
mutations were viral adaptations to either human or civet ACE2.[173, 333]. A pseudo-
148
Figure 4.2: Sequence comparison of the receptor binding motif (RBM) in SARS-2002, SARS-
civet, Bat RaTG13 and nCOV-2019. The mutations from SARS-2002 to nCOV-2019 are marked
with blue. Important mutations in RBM are marked with yellow. Red color shows the 3-resdiue
motif in SARS and Civet and 4-residue motif in RaTG13 and nCOV-2019.
typed viral infection assay of the interaction between different spike proteins and ACE2
confirmed the correlation between high affinity mutants and their high infection.16 Further
investigation of RBD residues in binding of SARS-COV and ACE2 was performed through
ala-scanning mutagenesis, which resulted in identification of residues that reduce binding
affinity to ACE2 upon mutation to alanine.[43] RBD mutations have also been identified
in MERS-COV, which affected their affinity to receptor (DPP4) on human cells. Multiple
monoclonal antibodies have been developed for SARS since 2002 that neutralized spike
glycoprotein on SARS-COV surface.[65, 291, 76, 246] However, multiple escape muta-
tions exist in the RBD of SARS-COV that affect neutralization with antibodies, which led
to the use of a cocktail of antibodies as a robust treatment.[247]
Full genome analysis of nCOV-2019 in different countries and the receptor binding
surveillance has shown multiple mutations in the RBD of glycosylated spike. The GISAID
database[91] (www.gisaid.org/) contains genomes of nCOV-2019 from researchers across
the world since December 2019. Latest report by the GISAID database on June 2020 have
shown 25 different variants of RBD from strains of nCOV-2019 collected from different
countries along with the number of occurrences in these regions which is listed below for
the seven most occurring mutations:
149
213x N439K (211 Scotland, England, Romania), 65x T478I in England, 30x V483A
(26 USA/WA, 2 USA/UN, USA/CT, England), 10x G476S (8 USA/WA, USA/OR, Bel-
gium), 7x S494P (3 USA/MI, England, Spain, India, Sweden), 5x V483F (4x Spain, Eng-
land), 4x A475V (2 USA/AZ, USA/NY, Australia/NSW.
It is not known whether these mutations are linked to the severity of coronavirus in
these regions. Starr and coworkers[282] performed a deep mutational scanning of nCOV-
2019 RBD and used Flow Cytometry to measure the effect of single mutations on the
expression of the folded protein as well as its binding affinity to ACE2. They showed that
RBD is very tolerant to these mutations to maintains its expression level as well as binding
affinity to ACE2 in most cases. According to their results, most natural mutations exert
similar binding affinities to ACE2 as wild-type nCOV-2019. Furthermore, they showed that
mutations at critical positions at the RBD-ACE2 interface at nCOV-2019 such as residues
Q493 and Q498 does not reduce the binding affinity to ACE2 which shows the substantial
plasticity of the interface.[282]
Different groups have computationally studied the binding of nCOV-2019 RBD with
ACE2.[317, 190, 281, 317] All these studies point to higher binding affinity of nCOV-2019
RBD than SARS-COV RBD to ACE2. Interestingly, the role of water-mediated interac-
tions has been pointed out to be a driving force which is shown to be the similar for both
SARS-COV and nCOV-2019 RBD.[190] Spinello and coworkers[281] studied the binding
of nCOV-2019 and SARS-COV RBD to ACE2 and found that the former binds its recep-
tor with 30 kcal/mol higher affinity than SARS-COV RBD. Gao et al. [317] used free
energy perturbation (FEP) and showed that most amino acid mutations at the RBM from
SARS-COV to nCOV-2019 increase the affinity of RBD to ACE2. The focus of this ar-
ticle is to elucidate the differences between the interface of SARS-COV and nCOV-2019
with ACE2 to understand with atomic resolution the interaction mechanism and hotspot
residues at the RBD/ACE2 interface using long-timescale molecular dynamics (MD) sim-
150
ulation. An alanine-scanning mutagenesis in the RBM of nCOV-2019 helped to identify the
key residues in the interaction, which could be used as potential pharmacophores for future
drug development. Furthermore, we performed molecular simulations on the seven most
common mutations found from surveillance of RBD mutations N439K, T478I, V483A,
G476S, S494P, V483F and A475V. From an evolutionary perspective this study shows the
residues in which the virus might further evolve to be even more dangerous to human
health.
4.1.2 Methods
Sequence comparison and mutant preparation
nCOV-2019 shares 76 % sequence similarity with SARS-2002 spike protein, 73 %
sequence identity for RBD and 50 % for the RBM.[356]. Bat coronavirus RaTG13 seems
to be the closest relative of nCOV-2019 sharing about 93% sequence identity in the spike
protein.[309]. Sequence alignment of SARS-2002, SARS-Civet, Bat RaTG13 and SARS-
COV-2 are shown in figure 4.2.[309] To investigate the roles of critical mutations on the
complex stability of nCOV-2019 with ACE2, mCSM-PPI2 webserver [248] was used to
find the residues in nCOV-2019 that are at the interface with ACE2. 21 different residues
were identified to be in contact with ACE2 and were chosen to do further MD simulation.
On the other hand, mutations are also observed in RBD domain from full genome analysis
of different nCOV-2019 variants collected from different countries compiled in GISAID
database.[76] The mutations selected are listed below along with their locations in the
RBD. K417 (?3), K439 (?4), G446 (L1), G447 (L1), Y449 (L1), Y453 (?5), L455 (?5),
F456 (L2), Y473 (?6), A475 (?6), G476 (L3), T478 (L3), Y483 (L3), E484 (L3), F486 (L3),
N487 (L3), Y489 (?6), Q493 (?5), S494 (?5), G496 (L4), Q498 (L4), T500 (L4), N501 (L4),
G502 (L4), Y505 (?5)
151
Molecular dynamics simulations
The crystal structure of nCOV-2019 in complex with hACE2 (pdb id:6M0J)[159] as
well as SARS-COV complex with human ACE2 (pdb id: 6ACJ)[279] were obtained from
protein data bank. All initial structures were prepared in GROMACS.[2] A TIP3P water
model was used for the solvent and Param999SB-ILDN AMBER forcefield (FF)[181, 23]
was used for all protein complexes. Neutralizing ions were added to all systems. It is
important to note that none of the RBD/ACE2 complexes studied here were glycosylated.
The glycosylation of RBD are far from the binding interface and does not interfere with the
binding to ACE2. The dynamics of glycans in the spike protein and its effect on shielding
is studied in the next section of this chapter.
500 steps of energy minimizations were done using the steepest descent algorithm.
In all steps the LINCS algorithm was used to constrain all bonds containing hydrogen
atoms. The systems were equilibrated using a velocity-rescaling thermostat to maintain
the temperature at 310 K with a 0.1 ps coupling constant in NVT ensemble under periodic
boundary conditions and harmonic restraints on the backbone and sidechain atoms of the
complex.[39] A velocity rescaling thermostat was used in all steps of the simulation. In
the next step, further equilibration was done in NPT ensemble at pressure of 1 bar using
Berendsen barostat.[301]. During the production run harmonic restraints were removed
and all systems were simulated using NPT ensemble where pressure was maintained at 1
bar using the Parrinello-Rahman barostat [226] with a compressibility of 4.5?10?5bar?1
and a coupling constant of 0.5ps. The produciton run lasted for 500ns for SARS-COV and
nCOV-2019 complexes and 300 ns for all the mutants with a 2fs timestep and the particle-
mesh Ewald (PME)[69] for long range electrostatic interactions using GROMACS 2018.3
package. [2]. All mutant systems were constructed as described before and ran for 300ns of
production run. In addition, the simulation time for a few mutants (Y449A, T478I, Y489A
152
and S494P) was extended to 500 ns.
Gibbs free energy and correlated motions
The last 400ns of simulation was used to explore the dominant motions in SARS-
COV, nCOV-2019 and the mutations with extended simulation, and last 200ns for all other
mutants using principal component analysis (PCA) as part of the quasiharmonic analysis
method. For this method the rotational and translational motions of RBD of all systems
were eliminated by fitting to a reference (crystal) structure. Next, 4,000 snapshots from the
last 400ns of SARS-COV, nCOV-2019 and mutations with extended simulation time, and
2,000 snapshots from last 200ns of all other mutant systems were taken to generate the co-
variance matrix between C? atoms of RBD. In the mutant systems with production run, the
last 400 ns was used for this analysis. Diagonalization of this matrix resulted in a diagonal
matrix of eigenvalues and their corresponding eigenvectors. The first eigenvector which
indicate the first principal component was used to visualize the dominant global motions of
all complexes through porcupine plots.
The principal components were used to calculate and plot the approximate free energy
landscape (aFEL). We refer to the free energy landscape produced by this approach to
be approximate in that the ensemble with respect to the first few PC?s (lowest frequency
quasiharmonic modes) is not close to convergence, but the analysis can still provide valu-
able information and insight. Hydrogen bonds were analyzed in VMD where the distance
cutoff was 3.2 A? and the angle cutoff between donor and acceptor was 30?.
The dynamic cross-correlation maps (DCCM) were obtained using MD TASK package
to identify the correlated motions of RBD residues. In DCCM the cross-correlation matrix
Ci j is obtained from displacement of backbone C? atoms in a time interval ?t. The DCCM
was constructed using the last 400ns of SARS-COV and nCOV-2019 and the extended
153
mutant systems and last 200ns of all other mutant systems with a 100ps time interval.
Binding free energy from MMPBSA method
The Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) method was
used to calculate the binding free energy between RBD and ACE2 in all complexes.[310,
243]for SARS-COV and nCOV-2019, 200 snapshots of the last 400 ns and for the mutant
systems, 100 snapshots of the last 200 ns simulation were used for the calculation of bind-
ing free energies with an interval of 2 ns. Simulation for a few mutant systems (Y449A,
T478I, Y489A and S494P) were extended to 400 ns a the binding energies were calculated
for the last 400 ns to assess the convergence of free energies. The binding free energy of a
ligand-receptor complex can be calculated as:
?Gbind,aq = ?H?T ?S = ?Gcomplex? [?Gprotein +?Gligand] (4.1)
?Gbind,aq = ?EMM +?Gbind,solv?T ?S (4.2)
?EMM = ?Ecovalent +?Eelect +?EV DW (4.3)
?Gbind,solv = ?Gpolar +?Gnon?polar (4.4)
?Gnon?polar = ??SASA+b (4.5)
In the above equations ?EMM, ?Gbind,solv and ?T ?S are calculated in the gas phase,
?EMM is the gas phase molecular mechanics energy changes which includes covalent, elec-
trostatic, and van der Waals energies. Based on previous studies the entropic changes dur-
ing binding is neglected in these calculations.[338, 111, 243] ?Gbind,solv is the solvation
free energy which comprises the polar and non-polar components. The polar activation is
calculated using the MMPBSA method by setting the value of 80 and 2 for solvent and
solute dielectric constants. The non-polar free energy is simply estimated from solvent
154
Figure 4.3: C? RMSD plots for nCOV-2019 and SARS-COV and mutants of SARS-COV-2
accessible surface area (SASA) of the solute from equation 4.5
4.1.3 Results
Structural dynamics
To compute the RMSD of systems, the rotational and translational movements were
removed by first fitting the C? atoms of the RBD to the crystal structure and then computing
the RMSD with respect to C? atoms of RBD in each system.
Figure 4.3 shows the RMSD plot in RBD of SARS-COV, nCOV-2019 and some of its
variants. Comparison of RMSD of SARS and nCOV-2019 shows SARS-COV has a larger
RMSD throught the 500ns simulation. In nCOV-2019, RMSD is very stable with a value of
1.5A? whereas in SASR-COV the RMSD increases up to 4A? after 100ns and then fluctuates
between 3 and 4 A?. The change in RMSD of SARS is partially related to the motions in the
C-terminal which is flexible loop. In most variants of nCOV-2019 the RMSD is very stable
during the 300ns simulation which shows tolerance of interface for mutations. Although
some mutations Y489A, Y505A and N487A the RMSD slightly increases.
To characterize the dynamic behavior for each amino acid in the RBD, we analyzed
the root mean square fluctuation (RMSF) of all systems. The RMSF plots for nCOV-2019,
155
Figure 4.4: RMSF plots for nCOV-2019-WT, SARS-COV, Y505A, N487A, G496A and E484A
mutants. The red shaded region shows the fluctuations in L1 and the green shaded area shows teh
flucturations in L3. The orange shaded region in SARS-COV shows the fluctuations in C-terminal.
For comparison, RMSF of nCOV-2019 WT shows in Cyan in other plots.
SARS-COV and four other mutations are shown in figure 4.4. nCOV-2019 shows less
fluctuations compared to SARS-COV. L3 in nCOV-2019 corresponding to residues 476 to
487 (shown in red in Figure 4.4) have smaller RMSF (1.5A?) than SARS-COV L3 residues
463 to 474. L1 in both nCOV-2019 and SARS-COV (green) have small fluctuation (less
than 1.5 A?). Moreover, the C-terminal residues of SARS-COV show high fluctuations
(orange). Few mutants show higher fluctuation in L1. Mutants Y505A and S494A had a
RMSF of 2.5 A? and mutant N487A had RMSF of about 4 A? in the L1. Mutation Y449A
had a higher RMSF of about 3 A? in the L1. Mutants G496A, E484A and G447A show a
high fluctuation of about 4.5 A? in the L3.
PCA and approximate free energy landscape
Most of the combined motions were captured by the first ten eigenvectors generated
from the last 400 ns for SARS-COV, nCOV-2019 and extended mutant systems and the last
156
200 ns for other nCOV-2019 mutants. The percentage of the motions captured by the first
three eigenvectors was 51% for nCOV-2019 and 68% for SARS-COV. In all mutations,
more than 50% of the motions were captured by the first three eigenvectors. The first few
PC?s describe the highest motions in a protein which are related to a functional motion such
as binding or unbinding of protein from receptor. The first three eigenvectors were used
to calculate the approximate FEL (aFEL) using the last 400 ns of simulation for nCOV-
2019 and SARS-COV shown in Figure 4.5, which displays the variance in conformational
motion. . SARS-COV showed two distinct low free energy states shown as blue separated
by a metastable state. There is a clear separation between the two regions by a free energy
barrier of about 6-7.5 kcal/mol. These two states correspond to the loop motions in the L3
as well as the motion in C-terminal residues of SARS-COV. The L3 motion in nCOV-2019
is stabilized by H-bond between N487 on RBD and Y83 on ACE2 as well as a ?-stacking
interaction between F486 and Y83. It is evident that the nCOV-2019 RBD is more stable
than SASR-COV RBD and exist in one conformation whereas the SARS-COV interface
fluctuates and the aFEL is separated into two different regions. The first two eigenvectors
were used to calculate and plot the aFEL as a function of first two principal components
using the last 200 ns of the simulation for mutant systems.
Dynamic Cross Correlation Maps (DCCM)
The correlated motions of RBD atoms were analyzed with DCCM based on hte C?
atoms of RBD from the last 400 ns of simulation for nCOV-2019, SARS-COV and extended
mutatnt systems and the last 200ns for other mutations. (figure 4.6) DCCM for nCOV-
2019 showed a correlation of motions between residues 490-505 (containing ?5, L4 and ?5
regions) and residues 440-455 (containing ?4, L1 and ?5 regions) shown in red rectangle in
figure 4.6. Another important correlation that appears in DCCM of nCOV-2019 is between
157
Figure 4.5: Mapping the principal components of RBD for the aFEL from the last 400ns of simula-
tions for SARS-COV (top-row) and nCOV-2019-WT (bottom row). The color bar is relative to the
lowest free energy state
residues 473-481 with residues 482-491. These residues are in L3 and ?6 regions and
their correlation in nCOV-2019 is stronger than SARS-COV. This is due to the presence
of ?6 strand in nCOV-2019, whereas in SARS-COV these residues all belong to L3. This
indicates that L3 in nCOV-2019 has evolved from SARS-COV to adopt a new secondary
structure, which causes strong correlation and make the loop act as a recognition region for
binding. Some of the mutations disrupted the patterns of correlation and anti-correlation in
nCOV-2019 RBD. Mutation N487A showed a stronger correlation in L3 and ?6 strand than
the wild-type RBD. For mutation E484A, correlation in L3 is stronger than wild type. It is
worth mentioning that mutation F486A disrupts the DCCM of nCOV-2019 by introducing
strong correlations in the core region of RBD as well as the extended loop region. Residue
F486 resides in L3 and plays a crucial role in stabilizing the recognition loop by making a
?-stacking interaction with residue Y83A on ACE2.
158
Figure 4.6: Dynamic cross correlation maps for nCOV-2019, SARS-COV and mutants with residue
numbers of the RBD domain. The RED box shows teh correlation between ?5, L4 and ?5 regions
and ?4, L1 and ?5. Blue box shows the correlation in L3 and ?6 regions.
Binding free energies
The binding energetics between ACE2 and the RBD of SARS-COV, nCOV-2019 and
all its mutants were investigated by the MMPBSA method.[310]. The binding energy was
partitioned into its individual components including: electrostatic, van der Waals, polar
solvation and solvent accessible surface area (SASA) to identify important factors affecting
the interface of RBD and ACE2 in all complexes. nCOV-2019 has a total binding energy of
?50.22?1.93kcal/mol, whereas SARS-COV has a higher binding energy of?18.79?1.53
kcal/mol. Decomposition of binding energy to its components show that the most striking
difference between nCOV-2019 and SARS-COV is the electrostatic contribution which is
?746.69? 2.66kcal/mol for nCOV-2019 and ?600.14? 7.65 kcal/mol for SARS-COV.
This high electrostatic contribution is compensated by a large polar solvation free energy
which is 797.30? 3.12 kcal/mol for nCOV-2019 and 659.61? 8.98 kcal/mol for SARS-
159
COV. nCOV-2019 also possess a higher VDW contribution (?89.93?0.46kcal/mol) than
SARS-COV (?70.07? 1.22kcal/mol). Furthermore, the SASA contribution to binding
for SARS-COV was ?8.30? 0.15kcal/mol and ?10.58kcal/mol for nCOV-2019. Both
hydrophobic and electrostatic interactions play major roles in the higher affinity of nCOV-
2019 RBD than SARS-COV RBD to ACE2.
The binding free energies for nCOV-2019 and SARS-COV were decomposed into a
per-residue based binding energy to find the residues that contribute strongly to the bind-
ing and are responsible for higher binding affinity of nCOV-2019 than SARS-COV (Figure
4.7). Most of the residues in the RBM of nCOV-2019 had more favorable contribution to
total binding energy than SARS-COV. Residues Q498, Y505, N501, Q493 and K417 in
nCOV-2019 RBM contributed more than 5 kcal/mol to binding affinity and are crucial for
complex formation. A few residues such as E484 and S494 contributed unfavorably to the
total binding energy. Among all the interface residues K417 had the highest contribution to
the total binding energy (-12.34?0.23 kcal/mol). The corresponding residue in SARS-COV
is V404 only had a -0.02?0.01 kcal/mol contribution, which points to the importance of this
residue for nCOV-2019 binding to ACE2. Residue Q498 contributed -6.72?0.18 kcal/mol
and its corresponding residue in SARS-COV is a Y484 that contributed to total binding
by -1.83?0.06 kcal/mol. Other important residues Y505 and N501 have more negative
contribution to total binding energy than their counterparts in SARS-COV residues Y491
and T487 respectively (Figure 4.7). Residue D480 in SARS-COV contributed positively to
binding energy by 6.2?0.15 kcal/mol and the corresponding residue in nCOV-2019 which
is a S494 residue lowered this positive contribution to only 1.17?0.06 kcal/mol. Mutation
D480A/G appeared to be a dominant mutation in SARS-COV in 2002-2003.51 This muta-
tion was reported to escape neutralization by antibody 80R.52 To investigate the effect of
this point mutation on binding of SARS-COV RBD to ACE2 we performed an additional
simulation and calculated the binding affinity for this mutant in SARS-COV RBD with the
160
Figure 4.7: Binding energy decomposition per residue for RBD of nCOV-2019 and SARS-COV
same approach for other mutation in this study. D480A mutation showed a binding affinity
of 23.46?3.07 kcal/mol which is about 5 kcal/mol higher than the wild-type SARS-COV
RBD. In SARS-COV residue R426 had the highest contribution to total binding energy
(-6.27?0.22 kcal/mol) although the corresponding residue in nCOV-2019 is N439 with a
contribution of -0.32?0.02 kcal/mol. These important mutations on RBM of nCOV-2019
from SARS-COV caused RBD of nCOV-2019 to bind ACE2 with much stronger (about 30
kcal/mol) affinity.
Binding free energy decomposition to its individual components for all mutants is rep-
resented in Table 4.1. In all complexes, a large positive polar solvation free energy disfavors
the binding and complex formation, which is compensated by a large negative electrostatic
free energy of binding. All variants had similar solvent accessible surface area energies.
The vdw free energy of binding ranged from -84.68 ?0.68 kcal/mol for mutant Q493A
to -103.85?0.66 kcal/mol for Y489A. Mutant K417A had the lowest electrostatic contri-
bution to binding -415.67?5.07 kcal/mol and mutants N439K and E484A had the highest
electrostatic binding contribution of -989.80?5.6 kcal/mol and -941.20?3.95 kcal/mol re-
spectively.
Most alanine substitutions exhibited similar or lower total binding affinities to nCOV-
161
2019, however a few mutants had higher binding affinity than the wild type. Mutant Y489A
had a total binding energy of -61.782.59 kcal/mol which was about 11 kcal/mol lower than
wild type binding energy. Mutants G446A, G447A and T478I also demonstrated higher to-
tal binding affinities than nCOV-2019. Other alanine substitutions had similar or lower to-
tal binding energy than nCOV-2019. Mutant G502A has the lowest binding affinity among
all the mutants with a binding energy of -24.312.98 kcal/mol. Mutant systems K417A,
L455A, T500A and N501A are the other mutants with total binding affinities significantly
lower than the wild type complex. The electrostatic component of binding contributes the
most to the low binding affinities for these mutants. The contribution of RBM residues to
binding with ACE2 for nCOV-2019 were mapped to the RBD structure and is shown in
Figure 4.9B.
Most natural mutants exhibited similar binding affinities compared to wild-type nCOV-
2019 with a few exceptions. Mutation T478I which is one of the most frequent muta-
tions based on GISAID database has a binding affinity which is about 6 kcal/mol higher
than wild-type. S494P and A475V showed a slightly lower binding affinity than the wild-
type complex. Other natural mutants showed binding affinities similar to wild-type RBD.
N439K demonstrated a high electrostatic energy which is compensated by large polar sol-
vation energy and this mutant has a total binding energy of -48.27?3.07 kcal/mol which is
similar to nCOV-2019.
Important interactions at the RBD-ACE2 interface
Important hydrogen bonds (H-bonds) and salt bridges between nCOV-2019 RBD or
SARS-COV RBD and ACE2 for the last 400ns of trajectory are shown in Table 4.2. nCOV-
2019 RBD makes 10 H-bonds/1 salt bridge with ACE2 whereas SARS-COV makes only 5
H-bonds/1 salt bridge with ACE2 with more than 30% persistence.
162
Figure 4.8: Total binding energy of SARS-COV, nCOV-2019 and mutants. Natural mutants are
shown with X at the bar base.
Table 4.1: Binding free energy decomposition in kcal/mol for nCOV-2019, SARS-COV and mutants
of SASR-COV-2
VDW Electrostatic Polar-solv SASA Total
SARS-COV -70.07?1.22 -600.14?7.65 659.61?8.98 -8.39?0.15 -18.79?1.53
SARS-D480A -88.30?0.69 -897.14?3.8 972.07?3.90 -10.14?0.07 -23.46?3.07
nCOV-2019 -89.93?0.46 -746.59?2.66 797.3?3.12 -10.58?0.05 -50.22?1.93
K417A -88.23?0.58 -415.67?5.07 484.87?4.89 -10.29?0.09 -29.56?2.95
N439K 95.4?0.63 -989.84?5.57 1047.70?5.08 -10.86?0.06 -48.27?3.07
G446A -91.9?0.50 -730.12?3.68 774.7?4.18 -10.6?0.08 -57.79?2.92
G447A -93.95?0.75 -756.61?4.63 803.48?5.02 -11.08?0.09 -58.37?2.32
Y449A -97.98?0.56 -717.67?2.71 774.49?3.50 -10.80?0.05 -51.91?2.33
Y453A -92.3?0.65 -712.76?3.61 765.63?3.98 -10.96?0.07 -49.98?2.92
L455A -84.96?0.56 -734.72?4.44 795.41?4.06 -10.63?0.08 -33.47?2.93
F456A -95.43?0.57 -770.12?3.38 832.56?4.14 -11.59?0.07 -44.84?3.38
Y473A -90.48?0.54 -725.17?4.27 779.23?4.16 -10.61?0.06 -47.23?2.59
A475V -93.12?0.49 712.12?4.59 769.23?4.71 -10.68?0.08 46.07?4.11
G476A -92.84?0.57 -746.46?4.92 796.97?4.84 -10.99?0.08 -53.57?2.58
G476S -92.25?0.64 712.08?4.10 767.40?4.32 -10.77?0.08 47.38?2.86
T478I -88.95?0.69 753.93?4.00 797.03?4.01 -10.33?0.08 56.06?2.83
V483A -90.74?0.60 737.48?3.94 789.22?3.74 -10.76?0.07 49.55?3.17
V483F 87.23?0.6 738.08?4.73 782.82?4.04 -10.59?0.09 53.70?3.35
E484A -95.7?0.62 -941.2?3.95 1002.21?4.92 -11.15?0.08 -46.67?2.89
F486A -90.07?0.66 -724.64?4.39 779.73?5.01 -10.68?0.09 -45.23?3.66
N487A -102.23?0.72 -724.44?3.69 791.41?4.3 -11.37?0.08 -46.33?2.73
Y489A -103.85?0.66 -773.72?3.15 827.52?4.36 -11.59?0.05 -61.78?2.59
Q493A -84.68?0.68 -713.28?3.67 758.56?3.63 -9.87?0.07 -48.19?2.64
S494A -94.98?0.66 -736.93?4.05 793.94?3.84 -11.02?0.07 -49.09?3.26
S494P -89.39?0.60 737.54?5.25 789.35?5.74 -10.67?0.08 47.90?2.59
G496A -93.17?0.55 -728.93?4.39 784.81?4.93 -10.89?0.07 -48.38?2.67
Q498A -90.48?0.61 -756.18?4.6 812.84?4.75 -11.02?0.09 -45.4?2.74
T500A -93.64?0.62 -704.44?4.1 769.65?4.73 -10.86?0.08 -39.27?3.18
N501A -88.59?0.66 -730.53?3.63 788.41?4.3 -10.75?0.08 -40.36?3.3
G502A -87.61?0.63 -706.13?4.56 780.08?4.61 -10.51?0.07 -24.31?2.98
Y505A -91.35?0.7 -746.12?4.31 802.16?5.29 -10.78?0.08 -46.49?2.92
163
The evolution of the coronavirus from SARS-COV to nCOV-2019 has reshaped the in-
terfacial hydrogen bonds with ACE2. G502 in nCOV-2019 has a persistent H-bond with
residue K353 on ACE2. This residue was G488 in SARS-COV, which also makes H-bond
with K353 on ACE2. Q493 in nCOV-2019 makes H-bond with E35 and another H-bond
with K31 on ACE2. This residue was an N479 in SARS-COV, which only makes one
H-bond with K31 on ACE2. An important mutation from SARS-COV to nCOV-2019 is
residue Q498, which was Y484 in SARS-COV. Q498 makes two H-bonds with residues
D38 and K353 on ACE2, whereas Y484 in SARS-COV does not make any H-bonds. Im-
portantly a salt bridge between K417 and D30 in nCOV-2019/ACE2 complex contributes to
total binding energy by -12.34?0.23 kcal/mol. This residue is V404 in SARS-COV which
is not able to make any salt-bridge and does not make H-bond with ACE2. Gao et al. used
a free energy perturbation (FEP) approach and showed that mutation V404 to K417 low-
ers the binding energy of nCOV-2019 RBD to ACE2 by -2.2?0.9 kcal/mol. A salt bridge
between R426 on RBD and E329 on ACE2 stabilizes the complex in SARS-COV/ACE2.
This residue is N439 in nCOV-2019 which is unable to make salt-bridge with ACE2 residue
E329. One of the most observed mutations in nCOV-2019 according to GISAID database
is N439K which recovers some of the electrostatic interactions with ACE2 at this position.
Y436 in SARS-COV and Y449 in nCOV-2019 both make H-bonds with D38 on ACE2.
The unchanged T486 in SARS-COV corresponds to T500 in nCOV-2019, both of which
make consistent H-bonds with ACE2 residue D355.
Hydrophobic interactions also play an important role in stabilizing RBD/ACE2 com-
plex in nCOV-2019. An important interaction between nCOV-2019 RBD and ACE2 is the
-stacking interaction between F486 (RBD) and Y83 (ACE2). This interaction helps in sta-
bilizing L3 in nCOV-2019 compared to SARS-COV where this residue is L472. It was
observed by Gao et al. that mutation L472 to F486 in nCOV-2019 results in a net change
in binding free energy of -1.2?0.2 kcal/mol. Other interfacial residues in nCOV-2019
164
RBD that participate in hydrophobic interaction with ACE2 are L455, F456, Y473, A475
and Y489. It is interesting to note that all these residues except Y489 have mutated from
SARS-COV. Spinello and co-workers[281] performed long-timescale (1?s) simulation of
nCOV-2019/ACE2 and SARS-COV ACE2 and found that L3 in nCOV-2019 is more stable
due to presence of the ?6 strand and existence of two H-bonds in L3 (H-bonds G485-C488
and Q474-G476). Importantly, an amino acid insertion in L3 makes this loop longer than
L3 in SARS-COV and enable it to act like a recognition loop and make more persistent
H-bonds with ACE2. L455 in nCOV-2019 RBD is important for hydrophobic interaction
with ACE2 and mutation L455A lowers the vdw contribution of binding affinity by about 5
kcal/mol. The H-bonds between RBD of nCOV-2019 and SARS-COV are shown in Figure
4.9A. The structural details discussed here are in agreement with other structural studies of
nCOV-2019 RBD/ACE2 complex.[315, 290]
H-bond analysis was also performed for the mutant systems and the results for H-bonds
with more than 40%. Few of the alanine substitutions increase the number of interfa-
cial H-bonds between nCOV-2019 RBD and ACE2. Interestingly, the ala-substitution at
Y489A increased the number of H-bonds in the wild-type complex. Mutation in some
of the residues having consistent H-bonds in the wild type complex such as Q498A and
Q493A, stunningly maintain the number of H-bonds in the wild-type complex. This in-
dicates that the plasticity in the network of H-bonds in RBM of nCOV-2019 which can
reshape the network and strengthen other H-bonds upon mutation in these locations. How-
ever, few mutations decrease the number of H-bonds from the wild-type complex. Alanine
substitution at residue G502 has a significant effect on the network of H-bonds between
nCOV-2019 and SARS-COV. This residue locates at the end of L4 loop near two other
important residues Q498 and T500. This mutation breaks the H-bonds at these residues.
Mutation K417A decreases the number of H-bonds to only 5 where the H-bond at residue
Q498 is broken. This indicates the delicacy of H-bond from residue Q498 which can eas-
165
Table 4.2: H-bonds and salt bridges between nCOV-2019(salt bridges are shown as bold)
# nCOV-2019 ACE2 % Occupancy SARS-COV ACE2 % Occupancy
1 G502 K353 89 Y436 D38 96
2 Q493 E35 83 R426 E329 87
3 N487 Y83 80 T486 D355 83
4 Q498 D38 73 G488 K353 80
5 K417 D30 55 N479 K31 52
6 T500 D355 53 Y440 H34 47
7 Y505 E37 52
8 Q498 K353 49
9 Y449 D38 45
10 G496 K353 37
11 Q493 K31 32
ily be broken upon ala-substitution at other residues. Furthermore, mutation N487 also
decreases the number of H-bonds by breaking the H-bond at Q498.
4.1.4 Discussion and Conclusion
In this work, we preformed MD simulations to unveil the detailed molecular mech-
anism for receptor binding of nCOV-2019 and comparing it with SARS-COV. The role
of key residues at the interface of nCOV-2019 with ACE2 were investigated by compu-
tational ala-scanning. A rigorous 500 ns MD simulation was performed for nCOV-2019,
SARS-COV and few mutants (Y449, T478I, Y489A and S494P) as well as 300 ns MD
simulation on each mutant. These simulations aiding in our understanding in the dynamic
role of RBD/ACE2 interface residues and estimating the binding free energy of these vari-
ants, which shed light on crucial residues for the RBD/ACE2 complex stability. Moreover,
numerous mutations have been identified in the RBD of different nCOV-2019 strains from
all over the world not known to be critical for infection.[100] The effect of these mutations
on the stability of RBD/ACE2 complex were investigated to shed light on their role in the
viral infection of coronavirus.
Changes in RBD structure of nCOV-2019, SARS-COV and mutants from their crystal
166
Figure 4.9: A) H bonds between RBD of nCOV-2019 and ACE2 B) Mapping contribution of inter-
face residues to structure in RBD of nCOV-2019. The RBD domain is purple and ACE2 is yellow.
The RBD in contact with ACE2 is rendered in surface format with red being a favorable contribution
to binding (more negative) and blue unfavorable (more positive)
167
structure were analyzed by RMSD and RMSF. nCOV-2019 showed a stable structure with
a RMSD=1.5 A?, whereas SARS-COV had a larger RMSD value between 3-4 A? during the
simulation. Most mutations of nCOV-2019 maintained similar stability to the wild-type.
However, a few nCOV-2019 mutations resulted in larger deviations (?2 A?), i.e., Y489A,
F456A, Y505A, N487A, K417A, Y473A, Y449A. We further investigated the structure
of the extended loop domain (Figure 4.1B) and discovered that nCOV-2019 is stable with
RMSD of less than 1 A?, whereas the extended loop in SARS-COV shows an RMSD of
about 3A? during simulation. Some mutants showed high RMSD in this region. Alanine-
substitution at residue N487 increased the extended loop RMSD to 2.5 A?. Other mutations
that increased the extended loop RMSD (?2 A?) include Y449A, G477A and E484A. The
dynamic behavior of RBD was further investigated by analyzing the RMSF of all systems.
As shown in Figure 4.4, nCOV-2019 shows less fluctuation in L3 than SARS-COV. This
is due to the presence of a 4-residue motif (GQTQ) in nCOV-2019 L3, which forces the
loop to adopt a compact structure by making 2 H-bonds (G485-C488 and Q474-G476) and
thereby reducing the fluctuations in the loop. Residues F486 and N487 play major roles in
stabilizing the recognition loop by making ?-stacking and H-bond interactions with residue
Y83 on ACE2. Alanine substitution at N487 introduced large RMSF to L1. Mutation L472
to F486 in SARS-COV was shown to favor binding by -1.2?0.2 kcal/mol using FEP.[317]
In addition, this mutation was shown to be among the five mutations that produce a su-
per affinity ACE2 binder based on SARS-COV RBD.[309] Alanine mutations at residues
Y449, G447 and E484 increased the motion in L3 characterized by large RMSF in this
region. Using principal component analysis, the approximate free energy landscape for
nCOV-2019 and SARS-COV demonstrated that the former occupies only one low energy
state whereas the latter forms two distinct low energy basins separated by a metastable state
with a barrier of about 6-7.5 kcal/mol. This confirms that the level of binding for the RBD
domain is weaker in SARS-COV due to the presence of two basins. Similarly, alanine-
168
substitution for a few residues caused the free energy landscape to degenerate into separate
multiple low energy regions.
To better characterize the functional motions of RBD, DCCM for all systems are con-
structed and demonstrated in Figure 4.6. nCOV-2019 showed large correlation between the
?4- L1- ?5 and ?5- L4- ?5 region. This correlation was stronger in SARS-COV and few
mutants such as Y449A, G447A and E484A. Another important correlation in nCOV-2019
is inside L3 and ?6. This correlation is stronger in nCOV-2019 than SARS-COV due to the
presence of ?6 which makes the loop to adopt correlated motions. Few mutants impact the
correlation in this region such as N487A. Interestingly, mutant F486A which is in L3 and
participates in binding by ?-stacking interaction with Y83 on ACE2, disrupts the DCCM
of wild-type nCOV-2019 and introduces strong correlation in the extended loop region as
well as the core structure of RBD.
The details of hydrogen bond and salt-bridge pattern in nCOV-2019 and SARS-COV
to ACE2 (Table 4.2) are key to the virus attachment to the host. nCOV-2019 residues
participate in 10 H-bonds/1 salt bridge with ACE2, whereas SARS-COV only has 5 H-
bonds/1 salt bridge with ACE2. This significantly contributes to 30 kcal/mol difference in
total binding free energy of nCOV-2019 and SARS-COV. The binding energies calculated
here for nCOV-2019 and SARS-COV (-50.22?1.93 and -18.79?1.53 kcal/mol , respec-
tively) are in good agreement with the binding energies calculated using Generalized Born
method (GB) by Spinello et al.[281] Moreover, the patterns of H-bonds between nCOV-
2019 and ACE2 was also already characterized by other groups[317, 281] which agrees
with our work. An important H-bond between nCOV-2019 and ACE2 is between G502
on RBD and K353 of ACE2. G502 is in L4 region, which is populated by 5 H-bonds
between RBD and ACE2. The contribution of this residue to the total binding energy is
-2.03?0.04 kcal/mol and the Ala-substitution at G502 has the highest effect on the binding
energy among all the residues by lowering the total binding affinity to 24.31?2.98 kcal/mol
169
which is the lowest among all mutants. This mutation breaks the other H-bonds in L4 such
as H-bonds from residues Q498 and T500. This residue is preserved and corresponds to
residue G488 in SARS-COV, which also makes a H-bond with residue K353 on ACE2.
Residue Q493 in nCOV-2019 participates in binding ACE2 by making two H-bonds with
residues E35 and K31 on ACE2. Q493 corresponds to residue N479 in SARS-COV, which
only makes one H-bond with residue Lys31 on ACE2. This caused Q493 to have more
contribution to total binding than its counterpart N479. However, alanine substitution at
Q493 did not affect the total binding energy and this mutant had a total binding energy
similar to wild-type complex as it maintains the number H-bonds in the wild-type com-
plex. Residues Q498 and T500 in nCOV-2019 are crucial for binding by making H-bonds
with ACE2 residues D38, D355 and K353. Residue Q498 corresponds to residue Y484 in
SARS-COV which does not make any H-bond in SARS-COV/ACE2 complex. Q498 con-
tributes to binding by -6.72?0.18 kcal/mol which is more than the contribution of Y484
in SARS-COV (-1.83?0.06 kcal/mol ). Ala-substitution at Q498 did not show large im-
pact on total binding energy. Residue T500 is conserved and corresponds to residue T486
which also makes a H-bond with Asp355 on ACE2. Mutation of T500 to Alanine low-
ers the binding affinity by about 10 kcal/mol. Residue N487 in nCOV-2019 locates in L3
and plays a crucial role in stabilizing the recognition loop by making H-bond with Y83
on ACE2. This residue contributes to total binding energy of nCOV-2019 by -1.52?0.06
kcal/mol, whereas its corresponding residue in SARS-COV does not show any contribu-
tion to binding energy (-0.02?0.05 kcal/mol). This demonstrates that L3 in SARS-COV
has evolved to be an important recognition loop in nCOV-2019, which participates in bind-
ing with ACE2. Residue K417 in nCOV-2019 has the most contribution to total binding
energy (-12.34?0.23 kcal/mol) by making a salt-bridge with residue D30 on ACE2. This
residue is crucial for binding of RBD and ACE2 and alanine substitution lowers the to-
tal binding energy to -29.56?2.95 kcal/mol. This salt-bridge is found to be important for
170
Figure 4.10: Binding energy decomposition for systems: nCOV-2019, T478I and N439K.
stability of the crystal structure of RBD/ACE2 complex in nCOV-2019 K417 is Val404 in
SARS-COV which does not participate in binding ACE2. Another important residue in
nCOV-2019 is L455 which contributes to binding by -1.86?0.03 kcal/mol. This residue
is important for hydrophobic interaction with ACE2 and mutating it to alanine lowers the
total binding affinity by about 17 kcal/mol. The hydrophobic residue F456 in nCOV-2019
also has a favorable contribution to binding energy and F456A lowers the binding affinity
by 5 kcal/mol. These results are in fair agreement with experimental binding measurements
with deep mutational scanning of RBD in nCOV-2019 where they used flow cytometry for
different ACE2 concentrations to measure the dissociation constant KD.[282] It was shown
that mutations at K417, N487, T500 and G502 are detrimental for binding to ACE2 which
agrees with the results here. These experiments showed that mutations at Q493 and Q498
does not impact the binding affinity of RBD to ACE2 which demonstrates the high plas-
ticity of the network of H-bonds at the interface where upon mutation at these residues the
network can reshape to form new H-bonds. Mutations at hydrophobic residues L455 and
F456 are shown to reduce the binding affinity in these experiments.
Total binding energy calculation of all the variants showed that mutation Y489A has
the highest binding affinity among all systems which is about 11 kcal/mol stronger than
171
that of nCOV-2019 complex. This residue is located in ?6 , which is part of the recogni-
tion region of RBD for binding to ACE2. Removal of this bulky hydrophobic residue at
the interface with ACE2 caused the extended loop to move closer to ACE2 interface and
make more H-bonds with ACE2. A high electrostatic interaction energy is the reason for
higher binding energy of mutant Y489A than wild-type complex. It is interesting to note
that among the 5 residues L455, F456, Y473, A475 and Y489 that make hydrophobic inter-
actions with ACE2, Y489 is the only residue that is conserved from SARS-COV. However,
the experimental binding affinity measurements using deep mutational scanning showed
that mutations at this position lower the binding affinity to ACE2. Other alanine substi-
tutions that increase the binding energy are G446A, G447A. Residues G446 and G447
reside in L1 and mutation to alanine can make L1 take a more rigid form. However, exper-
iment showed that these mutations have similar or lower binding affinities to ACE2 than
the wild-type RBD and care must be taken when interpreting these results. This discrep-
ancy could be due to forcefield inaccuracy and the deficiencies in the PBSA method for the
treatment of solvent in binding energy calculation. Further studies are needed to investigate
whether these mutations will increase the binding affinity to ACE2. Deep mutational scan-
ning using Flow cytometry is a qualitative method to measure the impact of a large number
of mutations of protein-protein interactions and further experiments such as Surface Plas-
mon Resonance (SPR) or Isothermal Titration Calorimetry (ITC) which are conventional
method for measuring binding affinities are needed to study the effect of these mutations in
detail.
Important mutations found in naturally occurring nCOV-2019 appear to influence to
some extent the binding to ACE2. Mutation T478I which is one of the most frequent muta-
tions according to GISAID database, increases the binding affinity of nCOV-2019 to ACE2
by about 6 kcal/mol. Mutation N439K has the highest occurrence among all strains of coro-
navirus in the GISAID database which demonstrated the highest electrostatic interaction
172
among all studied systems. This residue corresponds to R426 in SARS-COV which exerts
a salt-bridge interaction with E329 on ACE2. Mutation N439K recovers some of this ACE2
interaction however, it exerts a binding affinity similar to that of wild-type RBD. Contribu-
tion of important interface residues to binding affinity was compared for mutations T478I,
N439K and wild-type nCOV-2019 (Figure 4.10). The most striking differences between
wild-type RBD and mutation T478I are residues Y449 and Q498 which have significantly
higher contribution to binding than the wild type residue. Most other residues at the in-
terface have similar binding affinities to the nCOV-2019. A higher H-bond persistence is
also seen for these two residues Y449 and Q498 compared to the wild type RBD which
is the reason for their higher contribution to total binding energy. Mutation N439K has a
slightly lower binding affinity to ACE2 than the wild type RBD. Per residue binding energy
decomposition showed that K439 in this system has a favorable contribution of -1.80?0.15
kcal/mol to the total binding energy which is balanced by a lower contribution of K417
which resulted in a binding affinity similar to wild-type RBD. Mutant E484A which is also
one of the observed mutations based on GISAID database, demonstrates a high electrostatic
interaction with ACE2. E484 contributes to binding by 3.56?0.15 kcal/mol whereas the
corresponding residue in SARS-COV, P469 contributes to binding of SARS-COV to ACE2
by -0.27?0.01 kcal/mol. This residue is close to D30 on ACE2 and have electrostatic re-
pulsion with this residue. Most natural mutants including N439K, A475V, G476S, V483A,
V483F, E484A and S494P showed similar or slightly lower binding affinities to ACE2 com-
pared to wild-type complex which agrees with experimental binding measurements.[282]
However, the experimental binding affinity for T478I also showed similar binding affin-
ity to wild-type complex. This difference could be due to the use of MMPBSA approach
for calculation of polar solvation and further studies are needed to study the effect of this
mutation on viral infectivity of coronavirus.
Additional sequence differences between nCOV-2019 and SARS-COV influence RBD/ACE2
173
binding. Residue D480 in SARS-COV contributes negatively to total binding energy (6.25?0.14
kcal/mol) and mutating this residue to S494 in nCOV-2019 lowers this negative contribu-
tion to 1.17?0.06 kcal/mol. D480 in SARS-COV is located in a region of high negative
charge from residues E35, E37 and D38 on ACE2. Electrostatic repulsion between D480
on SARS-COV and the acidic residues on ACE2 is the reason for highly negative contribu-
tion of this residue to binding of SARS-COV to ACE2. Mutation to S494 in this location
removes this highly negative contribution. Gao and coworkers[315] computed the relative
free energies of binding due to mutations from the RBD-ACE2 of SARS-COV to the cor-
responding residues in nCOV-2019. They used a free energy perturbation approach and
showed that mutation D480S in SARS-COV changed the binding free energy by -1.9?0.8
kcal/mol which is consistent with our study. Furthermore, we performed an additional sim-
ulation on D480A mutant in SARS-COV and found that this mutation has a binding affinity
of 23.46?3.07 kcal/mol which is about 5 kcal/mol higher than the wild-type SARS-COV
RBD. In addition, experimental binding affinity measurements showed that mutations of
S494 to an acidic residue highly reduces the binding affinity to ACE2 which confirms the
hypothesis here.
Previous computational studies have found that nCOV-2019 binds to ACE2 with a to-
tal binding affinity which was about 30 kcal/mol stronger than SARS-COV and is in fair
agreement with the results here. The critical role of interface residues and residues are
computationally investigated here and in other articles and the results of all the studies in-
dicate the importance of these residues for the stability of the complex and finding hot-spot
residues for the interaction with receptor ACE2.[317, 281, 35] It is interesting to note the
role of L3 in the stability of the RBD/ACE2 complex. The amino acid insertions in L3 for
nCOV-2019 has converted as unessential part of RBD in SARS to a functional domain of
the RBD. This loop participates in binding ACE2 by making H-bond as well as -stacking
interaction with ACE2, which makes this region to act as a recognition loop. Previous
174
studies on SARS-COV have shown there is a correlation between higher binding affinity
to receptor and higher infection rate by coronavirus.[309, 178] Higher binding affinity of
nCOV-2019 for ACE2 than SARS-COV to ACE2 is suggested to be the reason for its high
infection rate. Most natural mutations showed similar binding affinities to wild-type ACE2
which indicates that the virus was already effective at the beginning of the crisis for binding
ACE2. A few mutations such as N489A and T478I are shown to increase the binding affin-
ity to ACE2. However, more studies are needed to investigate the effect of these mutations
in detail. Mutations of nCOV-2019 RBD that do not change the binding affinity and com-
plex stability, could have implications for antibody design purposes since they could act as
antibody escape mutants. Escape from monoclonal antibodies are observed for mutations
of SARS-COV in 2002 and these mutations should be considered for any antibody design
endeavors against these escape mutations.
175
4.2 Exploring dynamics and network analysis of spike glycoprotein of
SARS-COV-2
2The ongoing pandemic caused by coronavirus SARS-COV-2 continues to rage with
devastating consequences on human health and global economy. The spike glycoprotein
on the surface of coronavirus mediates its entry into host cells and is the target of all cur-
rent antibody design efforts to neutralize the virus. The glycan shield of the spike helps
the virus to evade the human immune response by providing a thick sugar-coated barrier
against any antibody. To study the dynamic motion of glycans in the spike protein, we
performed microsecond-long MD simulation in two different states that correspond to the
receptor binding domain in open or closed conformations. Analysis of this microsecond-
long simulation revealed a scissoring motion on the N-terminal domain of neighboring
monomers in the spike trimer. Role of multiple glycans in shielding of spike protein in
different regions were uncovered by a network analysis, where the high betweenness cen-
trality of glycans at the apex revealed their importance and function in the glycan shield.
Microdomains of glycans were identified featuring a high degree of intra-communication
in these microdomains. An antibody overlap analysis revealed the glycan microdomains as
well as individual glycans that inhibit access to the antibody epitopes on the spike protein.
Overall, the results of this study provide detailed understanding of the spike glycan shield,
which may be utilized for therapeutic efforts against this crisis.
4.2.1 Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) has rapidly spread
worldwide since early 2020 and has been considered one of the most challenging global
2Taken from published paper: Ghorbani, M., Brooks, B. R., Klauda, J. B. (2021). Exploring dynamics
and network analysis of spike glycoprotein of SARS-COV-2. Biophysical Journal, 120(14), 2902-2913.
176
health crises within the century. SARS-COV-2 has caused more than 100 million cases
and more than 2 million deaths worldwide as of February 2021.[353] Drug and vaccine
development are underway and multiple vaccines have entered clinical trials and some of
them are in the last stages of development.[82, 184, 224]
SARS-COV-2 is a lipid-enveloped single stranded RNA virus belonging to the beta-
coronavirus family, which also includes MERS, SARS and bat related coronaviruses.[315,
309, 45, 68] A major characteristic of all coronaviruses is the spike protein (S), which pro-
trudes outward from the viral membrane and plays a key role in the entry of the pathogen
into the host cells by binding to human angiotensin converting enzyme-2 (h-ACE2).[315,
329, 337] The structure of each monomer of the trimeric spike protein in SARS-COV-
2 (Figure 4.11) can be divided into two subunits (S1 and S2), which can be cleaved at
residues 685-686 (furin cleavage site) by TMPRSS protease after binding to host cell re-
ceptor h-ACE2.[73] S1 subunit in S trimer includes a N-terminal domain (NTD) and a
receptor binding domain (RBD) that is responsible for binding to h-ACE2.[307] S2 subunit
contains fusion peptides (FP), heptad repeats (HR1 and HR2), a transmembrane (TM) and
a cytoplasmic domain (CP). The homotrimeric spike protein is highly glycosylated with 22
predicted N-linked glycosylation and 4 O glycosylation sites per monomer, most of which
are confirmed by Cryo-EM studies.[328, 271, 350] Glycosylation of proteins plays a cru-
cial role in numerous biological process such as protein folding and evasion of immune
response.[242]
The spike protein in SARS-COV-2 is highly glycosylated with 22 N-linked glycosyla-
tion sites and 4 O-linked glycosylation sites.[322, 323] For example, the HIV-1 envelope
glycoprotein (Env) features about 93 N-linked glycosylation sites with mostly high man-
nose glycans (Man5-9), which covers most of the surface of the spike protein in HIV-1 and
comprises over half its mass.[284, 355, 66] N-linked glycosylation starts with synthesis
of precursor oligosaccharides, which are modified to high mannose forms by glucosidases
177
Figure 4.11: Structure of spike protein and its glycosylation pattern A) Different regions of spike
protein including N-terminal domain (NTD), receptor binding domain (RBD), Furin cleavage site
for cleaving between S1 and S2 subdomains (FS), Fusion peptides (FP), Heptad repeat (HR2),
transmembrane (TM) and cytoplasmic (CP) regions. The spike protein is divided into a head and
a stalk region B) Glycans on the spike protein color-coded based on their types C) Sequence of
full-length spike protein with domain assignments.
and then trimmed to complex forms in the Golgi by glucosyltransferases for signaling and
other glycobiological functions.[164] A higher degree of processing is usually indicative
of exposure or accessibility of glycans to enzymes.[284] Dense crowding glycan regions
limit the activity of processing enzymes at these locations.
In this study, we report the microsecond long MD simulation of all-atom solvated fully
glycosylated spike protein embedded in a viral membrane model in both RBD-up or open
state (PDB:6VSB)[329] and RBD-down (PDB:6VXX)[307] or closed state. The struc-
ture of the two states for glycosylated S protein in viral membrane were taken from the
178
CHARMM-GUI website for this study.[328] Details of modeling different regions were
presented in detail by Im and coworkers.[328] Structural changes in the spike protein hap-
pens on the order of microsecond timescale, where a scissoring motion is observed between
the NTDs in the RBD-up conformation. We have used network analysis and centrality
measures in graph theory to pinpoint the structural features of the glycan shield in the con-
text of glycan-glycan interactions as well as binding of antibodies to the spike protein. A
modularity algorithm helped us to find glycan microdomains featuring high glycan-glycan
interactions with breaches between microdomains for antibodies to bind and neutralize the
virus.
4.2.2 Methods
Molecular dynamcis simulations
Structures for glycosylated spike protein in the viral membrane with RBD-up and RBD-
down states were taken from the CHARMM-GUI website. Im et. al.[328] provided 8 dif-
ferent structures of spike protein in open and closed conformation where two models were
built for heptad repeat linker (HR2), two models for the HR2-TM domain and two mod-
els for the cytoplasmic (CP) region. We used model 1 2 1 in which the missing loops for
RBD were built by template-based modeling and N-terminal loop was constructed based on
electron density map. Ab initio monomer structure prediction and ab initio trimer docking
were used for the HR2 linker domain using PDB:5SZS[308] from SARS-COV-1 where two
models were built (model 1 used here). Two models were constructed for HR2-TM junc-
tion using PDB:5JYN[81] as template, a model using a template structure (model 2 used
here) and another model with more structural differences. Finally, the Cys-rich CP domain
was constructed using PDB:5L5K[155] as a template. Moreover, the palmitoylation sites
for this model are at residues C1236 and C1241.CHARMM36 forcefield[23, 153, 116] was
179
used for protein, lipids and carbohydrates in this study. The glycan composition for each
site in the spike protein represents the most abundant based on experimental mass spec-
troscopy data. The selected glycan sequences include 22 N-linked and 1 O-linked for each
monomer of trimeric spike.[321, 271] GROMACS[301] software was used for molecular
dynamics (MD) simulation. Energy minimization was performed in 5000 steps using the
steepest descent algorithm. A LINCS algorithm in all steps constrained the bonds con-
taining hydrogen atoms. Equilibration was performed with a standard 6-step equilibration
scripts from CHARMM-GUI with restraints on protein, lipid and glycan atoms.[143] For
the first three steps of equilibration, each step included 125 ps using Berendsen thermostat
at temperature 310.15 K and a coupling constant of 1 ps. In the last 4 equilibration steps a
Berendsen barostat was used to maintain the pressure at 1 bar. For the production step, all
restraints were removed, and the system was simulated under a NPT ensemble using the
Parrinello-Rahman barostat[226] with a compressibility of 4.510?5 bar?1 and a coupling
constant of 5 ps. The temperature was maintained at 310 K using a Nose?-Hoover ther-
mostat with a temperature coupling constant of 1 ps.[96] The production run lasted 1s for
each system (RBD-up and RBD-down) with a 2 fs timestep and the particle-mesh Ewald
(PME)[69] for long range electrostatic interactions using GROMACS 2018.3 package.[2]
Solvent accessible surface area (SASA)
SASA was calculated using VMD[135] with a probe radius of 7.2 A? which represents
the hypervariable region of antibodies.[41] Multiple regions were chosen for SASA calcu-
lation: RBM (residues 440 to 508). RBD (residues of RBD away from RBM 330-440 and
509-520) and NTD (residues 13 to 310)
180
Network analysis
In the glycan network, each glycan is represented with a node (69 nodes for 69 gly-
cans). To assign edges between nodes in the graph we first calculated the distance between
heavy atoms of different glycans in the starting structure and if the distance between two
glycan heavy atoms is less than 50 A?, an edge is assigned between the two nodes. Next to
incorporate simulation data and dynamics of glycans into the network, we calculated the
absolute value of average non-bonded interaction energy between every two glycans and
normalized the values to be between 0 and 1 and assigned the as the edge weights. To sim-
plify the network, we removed the edges with weights less than 0.05. This further reduced
the number of edges form the starting graph. These normalized interaction energies repre-
sent the adjacency matrix for the graph from which different centrality measurements can
be made. Two different centrality measurements were used to analyze the network of gly-
cans. Betweenness centrality quantifies the number of times a node acts as a bridge along
the shortest path connecting every two nodes in the graph and is an important indicator of
the influence of the node within the network. Eigenvector centrality measures the node?s
importance in the network by considering the importance of the neighbors of that node. If
the node is connected to many other nodes that are themselves well connected, that node
is assigned a high eigenvector centrality score. For a graph with adjacency matrix A, the
relative centrality score of the node i is defined as:
xi = 1/? j?M(i)x j (4.6)
The sum is over all j such that nodes i and j are connected. To write this in a matrix
form let x be the vector containing the centrality scores and A be the adjacency matrix.
181
Then we can write:
Ax = ?x (4.7)
With a constraint that all the components of the eigenvector x be positive, there is
only one eigenvalue that satisfies the equation and therefore a unique centrality score is
assigned to each node in the graph. A modularity algorithm in Gephi was used to find
glycan microdomains that have high number of edges. All network analysis was performed
in networkx and graph visualizations were done using Gephi.[25, 117, 12]
Glycan-antibody overlap analysis
Antibodies that bind and neutralize spike protein are divided into three categories:[185]
antibodies that bind to the exposed part of RBD (RBM-binder), antibodies that bind to
epitopes away from the RBM in RBD or RBD-binder and antibodies that bind to epitopes
on NTD (NTD-binder). Three different antibodies were used in this study: B38 for RBM
binder (PDB:7BZ5)[334], S309 for RBD binder (PDB:6WPT for open and PDB:6WPS for
closed state)[233] and 4A8 antibody as NTD-binder (PDB:7C2L).[55]
4.2.3 Results
Dynamical motions of the spike protein
The root-mean-square deviation (RMSD) for each region of the spike protein in RBD-
up and RBD-down states are represented in Figure 4.12. The stalk region in both RBD-up
and RBD-down states show a fluctuating RMSD, which is due to bending motion in this
region of spike protein. The tilting in the head of spike is also observed in high reso-
lution cryo-ET images as well as other recent MD studies of glycosylated spike in viral
182
membrane[275, 298] The bending dynamics in the stalk region is suggested to assist the
virus in scanning the cell surface for receptor proteins more efficiently.[298] The angle
distribution for tilting of the head and stalk domains were calculated from total 2 ?s sim-
ulation. The head region had an angle distribution of 18?10? and the stalk region angle
distribution was 34.5?6?. The head of the spike protein was shown to bend up to 90?
toward the membrane However, observing larger tilting angles requires more sampling of
spike protein in viral membrane.[340] A snapshot of the open system at 1 ?s represented in
Figure 4.13, which shows the incline between the head and the stalk domains of spike. Gly-
can root-mean-square fluctuation (RMSF) were calculated for heavy atoms in each glycan
and then computing the mean and standard deviation for the glycan (4.14). Consistently,
in all chains, few of the NTD glycans such as N74 show high RMSF, which is due to the
high solvent exposure of this glycan in NTD compared to other regions. The glycans near
the RBD in chain A (RBD-up), such as N234 and T323 showed less fluctuation than other
chains. N234 and T323 are sandwiched between RBD and NTD of neighboring monomers
in the trimeric spike protein. Glycans in the stalk region showed high RMSF values demon-
strating the effective shielding of the spike in this region for both RBD-up and RBD-down
states.
A principal component analysis (PCA) was performed on the head region (residues 1-
1140) of both RBD-up and RBD-down states to extract the fundamental motions of the
trimeric protein. The first two PCs are visualized in 2D plot (Figure 4.15A and B). Both
RBD-down and -up states feature similar behaviors in their PCA plot. In RBD up, PC-1
captures 56% of conformational motion whereas in RBD-down, PC-1 captures only 44% of
all conformational motion. This shows the higher conformational change in RBD-up state
compared to RBD-down state. The first eigenvector was used to construct the porcupine
plots to visualize the most dominant motions in RBD up (Figure 4.16A) and RBD-down
(Figure 4.16C) states. In the RBD-up state, a scissoring motion is observed between the
183
Figure 4.12: RMSD of different regions of the spike protein for A)open and B)closed conformation
Figure 4.13: A) snapshot of spike in open state after 1000ns. Different monomers of the spike trimer
are color coded with monomer A(up) in red, B in blue and C in purple B) RMSF of glycans in the
open state of spike for different chains A to C from top to bottom.
184
Figure 4.14: RMSF of glycans in the open and closed state of spike ofr different chains A to C from
top to bottom
NTD of chain A and NTD of chain B. Based on the PCA for RBD-up state, the total
simulation time was separated into three clusters; cluster-1:0-200ns, cluster-2: 200-600ns
and cluster-3: 600-1000ns. The distribution of distance between center of mass of NTD of
different chains are calculated for the three different clusters and shown in Figure 4.15B. In
the open state, NTD of chain B goes toward the center of apex. The distribution of distance
between the center of apex and each NTD is shown in Figure 4.16C. In the RBD-down state,
PC-1 shows the motion in RBD of chain A toward the open conformation (Figure 4.16B).
The simulation for RBD-down states was separated into two clusters and distribution of
the distance between center of mass of RBD and the apex center was calculated for all
the chains in both clusters. A clear separation is observed where RBD of chain A is more
distant from the apex center in the second cluster (Figure 4.15D).
185
Figure 4.15: A) 2-dimensional PCA for the open state of spike protein head B) distribution of dis-
tance between the center of masses of NTDs on different monomers in the spike trimer for clusters
of simulation data 0-200ns, 200-600ns and 600-1000ns C) 2D PCA for the closed spike head D)
distribution of distance between the center of mass of RBD of different monomer and the center of
apex for 0-500ns and 500-1000ns of simulation
186
Figure 4.16: A) porcupine plot of first principal component (PC1) for RBD-up state B) distribution
of distance between center of mass of NTD of each monomer in open state from the center of apex.
From top to bottom corresponds to 0-200 ns, 200-600ns and 600-1000ns of trajectory C) porcupine
plot of (PC1) for the closed state.
187
Occupancy of spike protein by glycans
Despite the highly dense glycan shield in the spike protein, there are breaches within
the shield that antibodies can bind and neutralize the virus.[41] A volume map of glycans
in the spike protein in both up and down conformations are shown in Figure 4.17, where
isosurfaces were visualized for glycans from the 1 ?s MD simulation trajectory. The stalk
region in the spike protein is highly shielded by the glycans and it is unlikely for antibodies
to bind to this region. The spike head on the other hand, shows glycan holes providing
opportunities for antibodies to bind. Importantly, NTD of all chains show epitopes free
from glycans. The RBM of chain A in the up conformation is completely free from glycans
presents the least shielded domain of spike. Furthermore, there are regions on the RBD
away from the RBM, that show epitopes for antibodies and not shielded by glycans. We
have computed the solvent accessible surface area (SASA) of different regions of the spike
for antibodies with a probe radius of 7.2 A? that represents the hypervariable domains of
antibodies.[41] RBM of chain A in up conformation shows the highest SASA in all chains
followed by RBM of chains B and C. On the other hand, RBD of chain A shows the lowest
SASA among the three monomers. NTD of all chains show high SASA in all chains. In
the closed state, RBM and NTD of chain A shows higher SASA than other chains which
is due its conformational change toward the open state. In summary, epitopes on RBM,
RBD and NTD of spike protein show high SASA for antibodies to bind and neutralize the
spike. These findings are consistent with observations from Amaro et al. In their study
RBM of the open conformation showed higher SASA than closed using different probe
sizes. Moreover, using a probe radius of 7.2 A? they showed that RBD has a lower SASA
than RBM and NTD which is consistent with the findings in our study.
188
Figure 4.17: A) Glycan occupacy (grey surface) of different regions in the spike head for RBD up
state. Monomer A (up) shown as red, monomer B as blue and C as cyan.
Glycan-Glycan interaction and network analysis of glycans
A network analysis was carried out on the glycan shield of the spike protein in both
RBD-up and RBD-down states to find the glycans that are most important for an effec-
tive shield. This approach has recently been applied to the glycan shield of HIV-1 spike
protein.[170, 44, 20] In the glycan network, each glycan represents a node in the graph
and two nodes are connected by an edge if the glycans have a distance less than 50 A? in
the starting structure. Edges are weighted by the normalized absolute value of non-bonded
interaction energy (vdw+electrostatic) between each pair of glycans. A network was built
for the whole simulation time in both RBD-up and -down states. All network analysis was
performed in networkx and graph visualizations were done using Gephi. The adjacency
matrix for these two networks are calculated. Two centrality measurements in graph the-
ory were utilized to analyze the network of glycans. Betweenness centrality quantifies the
189
Figure 4.18: Centrality measurements (eigenvector and betweenness) for A) RBD-up and B) RBD-
down states.
number of times a node acts as a bridge along the shortest path connecting every two nodes
in the graph and is an important indicator of the influence of the node within the network.
Eigenvector centrality measures the node?s importance in the network by considering the
importance of the neighbors of that node. If the node is connected to many other nodes that
are themselves well connected, that node is assigned a high eigenvector centrality score
(Figure 4.18).
Glycans in the stalk region show high eigenvector centrality in both RBD-up and down
states. This means that glycan-glycan interactions in the stalk region is strong and glycans
in this region are well connected which result in effective shielding of the stalk against
antibodies. In contrast, connections in the spike head and specially at the apex are sparser
as the eigenvector centralities are small for this region. Two apex glycans (N234A and
N165B) near RBD of chain A (up) in the open state, show a high betweenness centrality
(BC). These two glycans show a low BC in the closed state. Glycan N616B in RBD-up and
N603B in RBD-down show highest BC in their corresponding network which shows the
great impact of this glycan in the proper shielding of the spike protein. Glycans N603 and
190
N616 connect lower head with upper head glycans and are highly central in the network.
Glycans in head region are well separated from glycans that shield the stalk of spike
protein. Consequently, we performed network analysis of glycans in the head region of
spike protein for the total simulation time. Centrality measurements for open and closed
conformations of spike head are shown in Figures 4.19A and B. A modularity maximiza-
tion algorithm is used, which resulted in identifying 5 different glycan microdomains for
RBD-up (Figure 4.20) and 4 glycan microdomains in RBD-down states (Figure 4.20B).
Microdomains feature a high glycan-glycan interaction among them and lower number of
edges between different microdomains.[170] The higher number of microdomains in RBD-
up state shows that the spike protein is more vulnerable when RBD is in up conformation.
Glycans in the lower head all belong to the same microdomain (Cyan-I) in both RBD-
up and RBD-down states. This demonstrates the effective shielding of the lower head by
glycans regardless of the RBD conformation as all these glycans belong to the same mi-
crodomain. Glycan microdomains in RBD-up and -down conformations were mapped onto
the spike head to visualize these clusters on the protein (figure 4.20). Overall connectiv-
ity is lower near the RBD as this region is divided into three different microdomains in
RBD-down and four microdomains in RBD-up state. In RBD-down state, most glycans
have similar BC and only one glycan N603B shows a relatively higher BC in the network.
This is a high mannose glycan, which connects the upper head with the lower head region
in the glycan network and is crucial for effective shielding. In both RBD-up and -down
states, glycans at the lower head also showed high eigenvector centrality, indicating the ef-
fective shield of this region. When RBD is in up conformation, glycans near RBD of chain
A (up) can interact with glycans from NTD of chain B. As a result, when RBD is open,
glycans N234A, T323A, N331A and N343A all belong to the microdomain that comprises
glycans of chain B (shown as Green). This leads to encompassing the RBD of chain (A) in
RBD-up state by the same microdomain (Green-III), which enhances the shielding of RBD
191
Figure 4.19: A) network centralities for the head region in open state of spike and the total simula-
tion time B) network centralities for the head in closed state and total simulation time. C) change
in betweenness centrality of open state between first (0-200 ns) and second (200-600 ns) BC(1-2)
and second and third (600 ? 1000 ns) BC(2-3). D) change in betweenness centrality of closed state
from 0-500 ns to 500-1000 ns.
192
region away from the RBM. This is also demonstrated by the lower SASA of the region
of RBD of chain A away from the binding interface with ACE2 (Figure 4.17B). However,
in RBD-down state, the mentioned glycans of chain A are distant from glycans of chain
B and therefore they belong to the microdomain that includes other glycans from chain A
(shown as Orange-II). Furthermore, glycans N234A and N165B show high BC in RBD-up
state, which is due to their interaction with other glycans from the space left open from
RBD of chain A. Glycan N616, which is a fucosylated complex glycan, also shows a high
BC in RBD-up state. Interestingly, most of the glycans in the lower head region are oligo-
mannose, whereas the glycans at the upper head region (apex) are mostly complex glycans.
Complex glycans have a higher degree of processing by glycan processing enzymes. This
correlates with the higher number of microdomains in upper head region where connec-
tions between glycans are sparser than the lower head region. Glycan sparsity was shown
before to correlate with the degree of processing.
Dynamic motions of the spike protein can affect the patterns of glycan-glycan interac-
tions by bringing glycans of different domains in closer proximity. To study whether the
conformational motions identified by PCA are coupled with any changes in centrality of
glycans, we performed network analysis on the clusters of simulation found from PCA (
three cluster for open and two for closed conformation). BCs were calculated for glycans
in these clusters and changes in betweenness centrality ?BC were measured between these
networks of glycans built on different clusters (Figure 4.19C and D). For the open state
simulation, the scissoring motion in NTD is coupled with increasing the BC of N234A,
N603A and N165B. As the NTD of chains A and B come closer together glycans in these
two chains make stronger connections especially near RBD of chain A where due to the
open state, glycan N234A, which is inserted in the space left open by RBD in open con-
formation, can freely interact with glycans of chain B. N603 is a high mannose glycan in
the middle regions of spike and the NTD scissoring motion grants a higher BC for this
193
Figure 4.20: A) microdomains in the open state of spike head with each microdomain color-coded.
Glycans are connected through and the thickness of the edge shows the edge weight. B) mi-
crodomains in the closed state of spike head C) microdomains color coded on the spike protein
in open state D) Microdomains in the closed state of spike head
194
glycan in RBD-up state. In the RBD-down state, the motion in RBD of chain A to the up
conformation increases the BC of glycans such as T323A and N657A and does not affect
the BC of most other glycans. Glycan T323A is located at the tail of RBD and N657A is in
close proximity of RBD of chain A in the middle head region. The conformational change
in RBD brings these glycans closer resulting in a more compact network of glycans in the
middle head region and higher BC of these glycans and other neighboring glycans in chain
A (Figure 4.19D).
Antibody overlap analysis
Neutralizing antibodies look for breaches in the glycan shield, where the glycan densi-
ties are sparse.[170] Within a microdomain, glycans are highly connected by a high number
of edges in the network and most antibodies bind the regions between these microdomains
as comparatively sparse edges connect different microdomains.[284] Therefore, these mi-
crodomains help identify susceptible regions of spike protein for immunological studies.
Antibodies for spike protein were divided into three categories: RBM-binder, RBD-binder
(region of RBD away from RBM) and NTD-binder. The antibodies chosen for each cate-
gory are presented in the methods section. To investigate the relation between binding of
antibodies to known epitopes of the spike protein and the identified glycan microdomains,
we utilized an antibody overlap analysis.[284] Antibodies were first overlaid with the spike
protein by fitting to their corresponding region in the spike protein. Next, we calculated
the average number of clashes between each overlaid antibody and the glycan heavy atoms
in each microdomain with a cutoff distance of 5 A? during 1 s simulation. Results of this
analysis are shown in Figure 4.21A for RBD-up and 4.21B for RBD-down states.
The RBM-binder antibody had the lowest number of clashes among all antibodies with
microdomains. RBD-binder antibody of chain A had the highest number of clashes among
195
Figure 4.21: A) Antibody overlap analysis for the open state. Microdomains are representated by
their group and their color in figure 6 with Yellow-V (Y), Purple-IV (P), Green-III (G), Orange-II
(O) and Cyan-I (C). The character after each region in the x-axis specifies the chain on which the
antibody was overlaid, and the number of clashes was calculated B) Antibody overlap analysis for
closed state. Since all the RBD?s are in closed conformation we didn?t calculate clashes with RBM-
A C) overlaid antibodies with spike protein. Top left RBD binder antibody (pink) with closed state
spike. Top right NTD binder antibody (silver) with closed spike. Lower left RBD binder antibody
(pink) with open state of spike and lower right RBM binder antibody (blue) with spike protein.
all chains with glycans in microdomain III(G). RBD-binder antibodies of chains B and C
have lower number of clashes with glycan microdomains than chain A, where RBD is in the
up state. Antibody binding to NTD of chain C (NTD-C) in open state shows a high number
of clashes with microdomain IV(P). Similarly, in the closed state, NTD-binder antibody in
chain C (NTD-C) also shows a high number of clashes with cluster IV(P). NTD of chain
C also showed a lower SASA than NTD of other chains. In the open state, microdomain
III(G) comprises the high BC glycans N234A, N165B and T323B. The high BC of glycans
in this microdomain correlates with its high number of clashes with antibodies of RBD-A
and NTD-A. In RBD-down state, RBD-binder antibodies seem to have similar number of
clashes with different glycan microdomains. To identify glycans that have most effect on
antibody binding, we also quantified the number of clashes of antibodies with each glycan
in different chains and averaged over the simulation time of 1?s (Figure 4.22). RBM
196
antibody has only a low number of clashes with N165A glycan. In open state, RBD-A
antibody has a high number of clashes with multiple glycans (N122B, N149B, N331A
and N343A). RBD antibodies bound to the other chains (RBD-B and RBD-C) show lower
number of clashes with glycans N122 and N165. NTD antibody of chain C (NTD-C) in
both open and closed states show a high number of clashes with glycans N74C and N149C.
4.2.4 Discussion and Conclusion
Understanding the structure and dynamics of glycan shield in the spike protein of
SARS-COV-2 is an indispensable requirement for any antibody and vaccine design endeavors.[323,
51] To this end, we have performed MD simulations of fully glycosylated spike protein of
SARS-COV-2 in both open and closed states. Analysis of dynamics for 2 ?s trajectories
showed a tilting motion in the stalk region, which was also demonstrated with experimental
cryo-ET images[275, 298, 340] and suggested to aid the virus with screening the host cells
for receptor proteins (Figure 4.13). Glycan motions were characterized by RMSF (Figures
4.14), which showed higher values for stalk glycans demonstrating the high shielding po-
tential of this normally solvent-exposed region. PCA of the head region of open state of
spike demonstrated a scissoring motion between NTDs of neighboring monomers A(up)
and B. This scissoring motion resulted in trimer asymmetry with NTD of monomer B
advancing toward the center of the apex region and NTD of monomer A showing an an-
gular motion centering on the apex center and toward NTD of monomer B. Based on the
PCA, the simulation was divided into three different clusters with distribution of distances
between the center of masses of NTDs of different monomers showing different asymmet-
rical trimers in each cluster (Figures 4.15 and 4.16). A scissoring motion in the trimeric
spike of HIV-1 on the sub-microsecond timescale was observed by Leminn et. al[170]
which was suggested to be essential for receptor binding. The NTD scissoring motion is
197
Figure 4.22: Avg number of clashes of each glycan with the overlaid antibody in A) open and B)
closed states of the spike
only observed in the open state and this could suggest, the scissoring motion is a means
employed by the virus to camouflage parts of the spike protein (such as regions of RBD
excluding the RBM) when RBD is in the up conformation. The first PC for RBD-down
state was visualized (Figure 4.16C) and shows the conformational change of RBD in chain
A from down toward up conformation. Distribution of distance between center of mass of
RBD in each monomer from the center of apex region (Figure 4.15D) exhibited this motion
for two clusters of data 0-500ns and 500-1000 ns separated based on the PCA of open state
simulation.
The abundance of information in MD simulations of glycosylated spike protein may
hinder identifying important biological features of the glycan shield. Therefore, a net-
work analysis approach is used to identify collective behavior of glycans. The most central
region of spike based on eigenvector centrality of the network, is shown to be the stalk
domain and the lower head region of spike where a dense array of glycans gives rise to
198
resilience to enzymatic actions. Most of the glycans at the lower head region and upper
stalk domain are high mannose with the lower degree of processing which correlates with
their high centralities in the graph. High eigenvector centrality of lower head and the stalk
glycans also makes it hard for neutralizing enzymes to target and to date, no epitopes for
antibodies have been found that target this region.[350, 322]Glycans on the head region
demonstrated different behaviors depending on the RBD state (up or down conformation).
Interestingly two glycan at the open state, N234A and N165B show a high BC. N234A
occupies the volume left open by the RBD in the open state, whereas in the closed state
it is directed toward the solvent. The highly conserved glycan N165B is inserted between
RBD of chain A(up) and NTD of chain B and in the open state it occupies the volume
of left vacant by RBD (A) in the up conformation. Amaro and coworkers[41] studied the
fully glycosylated spike protein of SARS-COV-2 computationally and showed that N234A
and N165B are crucial for stabilizing the RBD in the up conformation in the open state of
spike. Their simulation showed that mutating N234 and N165 to Ala destabilized the RBD
in the open state. Furthermore, experimental negative stain electron microscopy and single
particle cryo-EM showed that the equilibrium population ratio between the open and closed
state is 1:1. Deletion of N234 glycan shifts this ratio to 1:4 favoring the closed state and
deletion of glycan at N165 increased the population of open state with a ratio of 2:1. It was
shown that N234 glycan stabilizes the open state of the RBD and inhibits the up-to-down
conformational change and N165 glycan sterically inhibits the down-to-up conformational
change of RBD. Here we have shown that these two glycans in the open state exhibit a
high BC, which is due to their interaction with each other as well as other neighboring
glycans at RBD of chain A as well as NTD of chain B. We have further demonstrated that
the BC of glycans in the head region is coupled with the scissoring motion between NTDs
of monomers A and B. In addition, the scissoring motion gives rise to high BC for gly-
cans in the middle region of spike for glycans N616A and N616B. This is caused by the
199
tighter packing of glycans in the asymmetric trimer. In the closed state, the BC of most
glycans do not change, which correlates with lower fluctuation of RBD-down simulation.
It was demonstrated for spike protein of HIV virus that glycans with high BC display a
high degree of interaction with other neighboring glycans and are less accessible to glycan
processing enzymes. These highly central glycans such as N603 and N234 are essential
to maintaining the mannose character of the glycan shield. Regions with dense crowding
glycans have steric constraints on glycans that limit their processing by carbohydrate pro-
cessing enzymes. Experimental studies using mass spectroscopy along with site-directed
glycan removal are needed to understand how deletion of these glycans such as N603, N616
and N234 can affect the glycan processing of the neighboring glycans on the spike protein.
Modularity maximization in network analysis allowed us to find 5 microdomains of glycans
in RBD-up and 4 microdomains in RBD-down states. The higher number of microdomains
at the apex of RBD-up state shows the more vulnerability of spike protein to antibodies
in the open state. Glycans at the lower head in both open and closed states belong to the
same microdomain (Cyan-I), which shows large number of edges between glycans in this
region and thereby effective shielding. Apex glycans are divided into three microdomains
in closed and four microdomains in open state. Glycan microdomains was shown in HIV-1
Env glycan shield to have a broad implication for anticipating immune escape.[120] The
antibody overlap analysis showed that RBM binder antibody in open state (up) shows the
lowest number of clashes with glycan microdomains (Figure 4.20A). The RBM of chain
A also showed the highest SASA among other epitopes on spike protein which shows its
great potential for antibody design strategies. NTD-binder antibodies can also bind to epi-
topes on the NTD of spike protein. These antibodies showed high number of clashes with
glycans N74 and N149 of the respective monomer that they bind. Experimental studies
are needed to explore the effect of deletion of these glycans on the neutralization effect by
antibodies. The glycans on the surface of spike protein exert a collective behavior, which
200
is an important property that needs to be considered in the context of vaccine and antibody
design.
In this work, we have studied the microsecond time dynamics and network analysis
of the glycosylated spike protein in SARS-COV-2. To answer the need for quantification
of the glycan shield of the spike protein of SARS-COV-2, we have utilized MD simulation
and network analysis to aid in understanding the collective behavior of glycans. The role of
glycans N234A and N165B as the central glycans in the network of glycans in the RBD-up
system is discussed. Glycan microdomains are identified featuring high interaction inside
them and lower interaction of glycans between different microdomains, which indicates
most neutralizing antibodies would bind to regions in between these microdomains. Higher
number of microdomains in the open state suggest the higher vulnerability of spike protein
in the open state. An antibody overlap analysis identified the microdomains of glycans with
higher number of clashes with antibodies. Collectively, this work present insights to design
antibodies and vaccines against coronavirus.
201
4.3 Molecular dynamics of ligand binding to PAS domain of EAG channel
The KCNH voltage gated potassium channels (including ether-a-go-go EAG, ERG,
ELK) are major regulators of cellular excitability and play important roles in diseases such
as epilepsy, schizophrenia, cancer and cardiac long QT syndrome type 2.[320, 14, 256, 348,
40] KCNH channels are tetrameric proteins containing 6 transmembrane (TM) helices (S1-
S6). The S1-S4 construct the voltage sensing domain (VS) whereas the S5 and S6 of all
four subunits together with the pore-forming loops form the centrally located pore domain
of the channel.[318] A PAS domain (Per-Arnt-Sim) exists in the N-terminal interacellular
part of EAG channels which is structurally similar to the PAS domain of non-ion channels
where they act as ligand binding domains. The interacellular C-terminal contains a cyclic
nucleotide binding homology domain (CNBHD) which is connected to the channel pore
via a C-linker domain. The CNBHD region in KCNH is structurally similar to the cyclic
nucleotide binding domain of hyperpolarization-activated cyclic nucleotide-gated (HCN)
channels and cyclic nucleotide gated (CNG) channels.[316, 103] However, CNBHD of
KCNH channels are not directly modulated by binding of cyclic nucleotides but instead a
short beta strand known as intrinsic ligand occupies the cavity where the cyclic nucleotide
would bind to HCN and CNG channels.[34, 33] Structure of full-length EAG channel is
shown in figure 4.23.
Despite a highly variable amino acid sequence, the PAS domain fold is well conserved.
[103] PAS domains in other proteins act as ligand binding domains. [127, 205]. However,
the ability of PAS to regulate KCNH channel via ligand binding is not well studied. Bre-
lidze et al. have previously shown that a small molecule drug chlorpromazine hydrochlo-
ride (CPZ) binds to the PAS domain of EAG channel (figure 3.25) and inhibits the current
through the channel in a concentration dependent manner.[318] CPZ was found as a strong
binder of PAS domain with a binding affinity KD of 1?0.7?M. According to their study,
202
deletion of PAS domain significantly reduced the apparent affinity and channel inhibition
by CPZ which points to the fact that PAS domain regulates EAG through binding to CPZ.
CPZ is a widely used antipsychotic drug, can be repurposed to be used for cancer treatment
and neurological disorders associated with increased EAG activity.[318] Deletion of PAS
domain significantly reduced the inhibition by CPZ in mEAG1 channel in their study. The
IC50 of CPZ was 29.7? 0.7?M for the WT channel and 53.6? 8.2?M for ?PAS mEAG
channel. It is important to note that most functional mutations in EAG channels associ-
ated with epilepsy and Zimmerman-Laband and Temple-Baraitser syndromes involve an
increase in the EAG channel activity.[351, 277] Therefore, inhibition of EAG channel by
small molecules has a high therapeutic potential for treatment of cancer and different neu-
rological disorders. Here, we study ligand binding to PAS domain of EAG channel through
molecular dynamics simulations and electrophysiology measurements (experiments done
in Georgetown University in Brelidze lab). We performed docking and binding free energy
calculation of CPZ and a few other ligands to the PAS domain of EAG channel. Impor-
tantly a residue Tyr71 was found to be blocking the entrance to the binding pocket of PAS.
Replica exchange solute tempering (REST2) simulations were performed to find a structure
of PAS where the binding pocket is open for ligands. In the end, we studied the structural
effects of ligand binding to the PAS domain on the full-length structure of EAG by molecu-
lar simulation and network analysis. Our analysis showed that there is an allostery between
the ligand binding site on the PAS and the channel pore. Network of residue-residue fluc-
tuations causes the current inhibition in the channel upon ligand binding.
4.3.1 Methods
Replica exchange solute tempering: Initial docking of CPZ to the PAS domain using
AutoDock Vina[295] led to a binding pose outside the binding cavity. A Tyr71 residue was
203
Figure 4.23: Structure of full-length EAG channel embedded in membrane. VSD stands for voltage
sensing domain which includes S1-S4 transmembrane helices. Pore domain (PD) includes trans-
membranes S1 and S2
found to block the entrance of cavity in the PAS crystal structure. To sample conformations
of the PAS domain where the cavity is accessible for ligands, we used replica exchange
solute tempering (REST2).[186] The details of this method are given in the introduction
section of the dissertation. In summary, initial structure of the protein (pdb id: 4hoi)[3]
was prepared in CHARMM-GUI.[143] Na+ and Cl? ions were added to the system up
to a buffer concentration of 0.15M. MD simulations were performed using CHARMM36
[23] forcefield for the protein and TIP3P model for the waters.[144] Simulations were per-
formed using NAMD program with REST2 support.[230] A force-switching function was
used for van der Waals and electrostatic interactions between 10 and 12 A?.[283] Long range
interactions were computed with Particle Mesh Ewald (PME) method.[69] Langevin piston
was used to maintain pressure at 1 bar.[97] A timestep of 2fs was used for the equilibration.
An integration time-step of 2 fs was used for all simulations using SHAKE algorithm to
constrain hydrogen atoms. [252] We first ran a 10 ns standard MD simulation to equilibrate
the PAS domain. REST simulations ran with 20 replicas between effective temperatures of
204
310 and 610 K. Replica exchanges were attempted every 2ps between neighboring replicas
along the temperature scale. Each replica ran for 50ns and the total accumulated simula-
tions time was 1 ?s. The protein was chosen as the hot region in REST2 simulations.
PAS-ligand simulations and binding free energy calculations: To simulate the lig-
and bound PAS with different ligands, we selected a snapshot of replica exchange sim-
ulation where the binding pocket was readily open and the Tyr71 was no longer block-
ing the entrance (shown in figure 4.24B). Next we used Autodock Vina to dock 5 small
molecules to the binding pocket. These include Chlorpromazine (CPZ), Imipramine (IMP),
Promazine (PRZ), Cyamemazine (CMZ) and 2-Chlorophenothiazine (CFT). These led to
binding of ligand inside the cavity of PAS. MD simulations were done in GROMACS 2018
software.[301] AMBER99SB-ILDN[23] forcefield was used for the protein, TIP3P model
was used for waters and the ligands were parameterized with general amber forcefield
(GAFF).[312] Na+ and Cl? were added to the system to a final concentration of 0.15M.
Simulations were performed with a 2 fs timestep at 310K temperature and pressure of 1bar.
A velocity rescaling thermostat was used to maintain the temperature at 310K. During equi-
libration pressure was maintained at 2bar using Berendsen barostat. During production run
the system was simulation under NPT ensemble with Parinello-Rahman barostat to main-
tain pressure at 1bar using a compressibility of 4.5?10?5 bar?1 and a coupling constant
of 0.5 ps. The simulations lasted for 100ns for each complex. Binding free energies were
calculated using MMPBSA method. A description of MMPBSA is given in chapter 3.1 for
the RBD-ACE2 free energy calculation in the SARS-COV-2. Here we used 80 and 2 as the
solute and solvent dielectric constants.
Full length EAG channel simulation The full length EAG channel (pdb id:5K7L
)[327] was simulated in both apo and ligand bound state to study the effect of ligand bind-
ing on the structure of channel. The apo state was prepared in CHARMM-GUI with 500
POPC lipids per leaflet. AMBER99SB-ILDN[181] parameters were used for the protein,
205
AMBER parameters were used for the lipids. The protein-membrane system was solvated
with TIP3P water and Na+ and Cl? ions were added to a final concentration of 0.15M. To
simulate the bound state, we aligned the PAS domain of each chain in the tetramer with the
ligand-bound PAS from docking and used the coordinates of the ligand-bound PAS domain.
This was done because in the crystal structure of EAG, the Tyr71 residue blocks the binding
pocket and to dock the ligand, we had done replica exchange simulations. (figure 4.24B)
The system was then prepared as for the apo state. Both apo and bound states embedded
in membrane were equilibrated according to the CHARMM-GUI 6 step procedure. During
the equilibration phase, the temperature was controlled with Berendsen thermost and the
pressure was maintained at 1bar using a Berendsen barostat. For the production run, we
used Nose-Hoover thermostat with coupling constant of 1ps?1 to maintain temperature at
310K and Parrinello-Rahman barostat with a compressibility of 4.5? 10?5 bar ?1 for the
pressure. The simulations ran for 1?s for each apo and ligand-bound systems.
Current flow analysis from MD simulation We followed the method laid out by Dele-
motte and coworkers[147] to perform current flow analysis through the channel. In this
approach, first a continuous contact map is calculated given the distance di j(t) between
atoms i and j at time t:
????1 di j(t)<= c
K(di j(t)) =??? (4.8)2exp ?d[ i j(t) ]/exp[? c2?2 2?2 ] otherwise
the cutoff c = 4.5A? was used for the heavy atoms in the simulation as suggested. We also
used K(d ?5cut) = 10 , where dcut = 0.8nm leading to ? ? 0.138. The final contact map
was then averaged over frames. Correlation of residue movements was calculated through
mutual information (MI). MI between residues si and s j was estimated based on distances
from their equilibrium positions where the position of each residue was defined as the
206
centroid of heavy atoms in each residue. Thus MI is calculated as :
MIi j = Hi +H j?Hi j (4.9)
where Hi is the entropy of residue si defined as :
?
Hi =? ?i(x) ln?i(x)dx (4.10)
X
where the density ?i(x) was estimated using Gaussian mixture model as proposed by Dele-
motte et al.[147] A 10 times bootstrapping was performed and the final MI matrix was
averaged over all bootstrap samples. The MI matrix and the semi-binary contact map were
used to build the full adjacency matrix Ai j = Ci jIi j. Using this adjacency matrix, current
flow or information flow which measures the flow of information from a set of source (S0)
to a set of sink (S1) nodes are calculated. The results highlight the nodes that carry the most
information from source to sink and give valuable information about allosteric pathways in
the protein. For a tetrameric protein, the current flow were each subunit is replicated and
summed over the structure and then averaged. This approach was used by Delemotte et al.
to study allostery in KCNQ potassium channels and other membrane proteins. The source
nodes in our analysis were the PAS domain residues (13-138) and as the sink nodes we
used the residues in the lineup of the pore in channel (residue Gln503 on each monomer).
4.3.2 Results
Initially we attempted to dock the CPZ ligand to the binding pocket of PAS domain.
This led to a binding pose outside the cavity of PAS domain. We then performed three
replicas of MD simulation for the docked pose. After 500ns of simulation the ligand drifted
away from the binding pocket. After careful examination of the structure of PAS domain
207
we found that the entrance to the binding pocket was blocked by a Tyr71 residue. Next
we performed Replica exchange solute tempering simulations (REST2) to sample differ-
ent conformations of the PAS domain where the cavity is open for ligand binding. In this
type of enhanced sampling, the conformational exchanges are done for the hot region (so-
lute or the protein) while the solvent remains cold. Figure 4.24A shows the distribution of
enthalpies P(?H,T ) of replicas at different effective temperatures. The significant overlap
between replicas leads to frequent exchanges between neighboring replicas and the average
exchange rate was calculated to be 20%. Figure 4.24C shows the random walk of replicas
over temperatures for the first three replicas at the lowest temperatures. The frequent ex-
change of these replicas with other replicas with higher temperatures shows that sampling
is effective and REST is able to sample conformations at higher temperatures. After the
replica exchange simulations we found a conformation of PAS domain where the Tyr71
residue drifted away from the binding pocket and was no longer blocking the cavity. The
conformational change of PAS domain and the Tyr71 residue from the crystal structure to
after replica exchange is shown in figure 4.24B. The ligand bound to PAS domain is shown
in figure 4.24D.
Next, we docked 5 different small molecule ligands that were experimentally shown to
regulate EAG channel to the PAS domain after the REST2 simulations. The docked pose
of these ligands are shown in figure 4.23. In most of the binding poses, the ligand faces
the binding pocket from its 3 membered ring rather than its tail except for the PRZ ligand
which docked the binding pocket from the tail. All 5 complexes were then subjected to a
100ns MD simulation using GROMACS. In these simulations, the ligands remained in the
binding pocket. However, in PRZ-PAS complex, the ligand drifted away from the binding
pocket. This is most probably due to the initial binding pose for this ligand where the lid
was inside the pocket and the hydrophobic rings were outside. This was contrary to other
ligands where rings are inside the binding pocket. This pose led to a lower binding affinity
208
Figure 4.24: Replica exchange solute tempering. A) distribution of enthalpies for different replicas
showing the high overlap B) transition of Tyr71 residue from the crystal structure to a state after
REST simulations where it no longer blocks the binding pocket C) Random walk of resplicas 1,2
and 3 in the replica space with their neighbors over the course of simulations D) Docking CPZ to
the binding pocket of PAS after REST simulation.
for PRZ.
Binding free energies were calculated using MMPBSA.[310] A detailed description
of this approach to binding free energy calculation is given in the methods section of the
dissertation. Different components of the binding free energy including Van der Waals
(vdw), electrostatic, polar solvation and solvent accessible surface area (SASA) are given
in table 4.3.
The breakdown of binding free energies to its components shows that the binding is
driven by hydrophobic interactions between hydrophobic residues in the binding pocket
and the hydrophobic rings of the ligands. Electrostatic interactions play a negligible role in
209
Figure 4.25: Docked poses of the 5 ligands
Table 4.3: Binding free energies kcal/mol
VDW Elec Polar Solv SASA Total
CPZ -43.34?0.67 -1.0?0.1 12.33?0.45 -4.07?0.05 -36.32?0.79
IMP -42.73?0.23 -0.2?-0.1 12.59?0.17 -4.40?0.02 -34.76?0.25
PRZ -27.26?0.32 -0.33?0.05 9.08?0.17 -3.05?0.02 -21.51?0.25
CMZ -48.27?0.22 -2.45?0.07 14.73?0.16 -4.55?0.02 -40.55?0.28
CFT -33.34?0.23 -0.45?0.05 9.60?0.10 -3.07?0.01 -27.25?0.25
210
Figure 4.26: Binding free energy decomposition for residues with higher than 0.5 kcal/mol contri-
bution to binding free energy of different ligands
the binding for all molecules. PRZ has the lowest binding free energy which is due to its
initial binding pose facing the binding pocket from its lid rather than from its hydrophobic
ring. CFT had the next lowest binding affinity from all tested ligands. This was also
shown in experiments not to bind PAS domain and increase rather than decrease the current
through channel. Unlike other ligands, CFT does not have a lid in this structure and only
has a 3-membered ring. We found that during MD simulation this ligand can penetrate
deeper into the cavity of the PAS domain. On the hand, for other ligands the lid makes the
binding more stable. We have also decomposed the binding free energies for ligands into
a per-residue basis to find residues that contribute most to the binding affinity. Most of the
residues are found to be hydrophobic. Residues with more than 0.5 kcal/mol contribution
to the binding are shown in figure 4.26.
Figure 4.27 shows the electrostatic character of residues in the binding pocket. Most
residues interacting with the ligands in the binding pocket are hydrophobic which drives
the binding as also shown in binding free energy calculations (high vdw component). The
211
Figure 4.27: electrostatic nature of binding pocket. Acidic residues are shown in red, basic residues
in blue, hydrophobic residues in white and polar residues in green
Tyr71 acts as a gatekeeper of the PAS domain and suggest its important role in the ligand
entry into the binding cavity. The role of this residue in ligand binding has also been con-
firmed in other computational and experimental studies.[74] Phe87 in CPZ-PAS complex
has a high contribution to the binding affinity which is majorly driven by non-polar inter-
actions. Other residues that contribute to binding affinity are Val31, Trp40, Cys67, Val80,
Ile113, Phe126 and Cys128.
Full length EAG simulation
It is unknown how conformational changes in PAS correlate with channel?s voltage
dependent activation process. To investigate how the conformational changes upon ligand
binding at the PAS domain affects the gate opening in EAG channel, we performed MD
simulations of mEAG channel at physiological temperature in both apo and bound state
with CPZ to all 4 PAS domains. Each apo and bound state ran for 1 ?s. RMSD of each
region of the EAG channel during the simulation for both apo and bound state is shown in
212
Figure 4.28: RMSD of different regions of the EAG channel for apo and bound states
figure 4.28.
PAS domain in the bound state shows a higher RMSD than apo while other domains
have a similar RMSD between apo and bound states. since most of the difference in RMSD
is in the PAS domain, next we compared the root mean squared fluctuations (RMSF) of the
PAS domain in the apo and bound states. This is shown in figure 4.29 where each region of
the PAS domain is colored to show different locations. Most regions had a higher RMSF in
the bound state than in apo. The PAS-cap residues have a similar to even lower fluctuation
in the bound state than in apo state. Residues near the binding pocket in ?C and ?D
helices have a high fluctuation in the bound state. ?A and and ?B residues which are at
the interface with CNBHD also show a higher fluctuation in the bound state. These high
fluctuations could affect the interface of PAS with CNBHD.
We also computed the H-bonds and salt-bridges between PAS and CNBHD during the
simulation for apo and bound states. The results are shown in figure 4.30. The hydrogen
bonds and salt bridges are mostly similar between apo and bound states. While some of
213
Figure 4.29: RMSF of different regions of the PAS
Figure 4.30: H-bonds and salt-bridges between PAS and CNBHD for apo and bound state
the H-bonds such as N34-E633 and Q62-V635 are weakened after ligand binding, other h-
bonds or electrostatic interactions are enhanced such as Y198-E627, Q62-T698 and K122-
E633. The binding free energy is expected to not change considerably after the binding.
It was shown by Brelizde et al. using SPR that ligand binding slightly enhances the the
affinity of PAS-CNBHD complex. We can reason that this is due to the formation of new
electrostatic interactions that were not present in the apo state.
The conformational changes are communicated to the pore domain via a network of
residue-residue interactions[67, 346] network analysis was used to identify the allosteric
214
pathways in the bound and apo conformations between PAS and the pore. Network analysis
has been previously used by Delemotte and coworkers[147] to shed light on the VSD-pore
coupling pathway. In this approach, first the MD trajectories are converted to a residue in-
teraction network where each node cooresponds to an individual amino acid residue within
the full length EAG channel. The weights on the edges are defined by the spatial proximity
and correlated motions of residues in the MD trajectories. The final network encodes all
residues (nodes) and interactions (edges) in the channel. The allosteric pathway is mea-
sured by calculating the flow of information through the network by defining the PAS
domain (binding pocket) as the source of information and the channel pore as the sink.
The underlying idea is that the perturbation of residue interactions such as those induced
by the movements in the PAS domain spreads to other residues (pore) via diffusion in the
network of residue residue interactions. A current flow betweenness analysis is performed
to account for all pathways between source and sink. Key residues and pathways for the
allostery are identified by this method. PAS domain residues were chosen as the source
and gate residues were chosen as the sink for current flow analysis. Figure 4.31 shows
the projected current flow on the structure of full length EAG channel such that darker
color shows higher current flow. The CNBHD region due to a high interface with the PAS
domain carries most of the information flow to the pore domain. To study how the lig-
and binding influences the allosteric network, we calculated the different in current flow
profiles of apo and bound states of the channel. The resulting difference in information
flow (Delta-Information) is projected onto the full-length structure and is shown in figure
4.32 This shows the difference at different regions of the EAG due to the ligand binding
conformational changes. The Apo state showed higher peaks in information flow at PAS-
CNBHD interface compared to the bound state. Also the post-CNBHD region showed a
higher value of information flow in the bound state. The post-CNBHD-PAS interaction is
attributed to the ligand bound state. Some of the residues at the interface of PAS-CNBHD
215
Figure 4.31: Current flow analysis for the bound state and the current flow plots
with higher CF in the bound state are Val43, His56, Phe17, Gln14 on the PAS domain and
Tyr666, Val634, Ser625 and Gly639 on CNBHD. Residues interacting with intrinsic ligand
have lower CF in the bound state. Residues on Post-CNBHD have high CF in the bound
state. Some notable residues with higher CF in the bound state than apo are Ile37, Arg24,
Phe17, Tyr44 and on CNBHD Gln598, Val600, Ala603, Gly624, Gly639 and Cys667.
The peak at the S1-S4 and PAS interface are weaker compared to the PAS-CNBHD
interface showing that the allosteric coupling between PAS and S1S4 is weaker than PAS-
CNBHD. Concurrent flow reduction and increase in the pore residues and PAS domain
hints that these two regions are allosterically ant coupled together. Reduction in current
flow at the pore coincides with the inactivation of the channel in the bound state. The
critical coupling motifs on the CNBHD, PAS and C-linker are closely correlated with the
locations that features enhanced information flow in the apo state. This means that the
bound conformation less efficiently transmits allosteric signal through structural regions
that are important. The flow strength decrease in the channel pore is accompanied by the
216
Figure 4.32: current flow difference
flow reduction at the CNBHD and flow increase at the PAS domain. the PAS-cap region was
shown to undergo huge conformational changes when CNBHD is not bound. To identify
key structural motifs that might be allosterically downstream of PAS-CNBHD interaction.
This indicates that the perturbation in PAS is propagated to the pore through interactions
with CNBHD.
4.3.3 Discussion and Conclusion
In this study, we investigated ligand binding to the PAS domain of EAG potassium
channel and its effect on the channel activity using MD simulations, free energy calcula-
tions and network analysis. Experiments were performed by Brelidze lab at Georgetown
university on the electrophisiology of EAG channel. A PAS domain in the N-term intra-
cellular part of EAG channel is known for its ligand binding in other proteins. However,
its ligand binding properties have not been studied in detail for EAG channel which could
have major implications for pharmaceutics targeting EAG channel. Brelidze et al. previ-
217
ously showed a small molecule ligand, CPZ binds to the PAS domain of EAG and inhibits
the current through the channel.[318] Inhibition of EAG channel by CPZ or other small
molecule ligands has a high potential for therapeutic use in treatment of cancer and other
neurological disorders. We performed docking of ligands to the PAS domain. A residue
Tyr71 was found to block the entrance to the binding pocket. We performed a replica ex-
change solute tempering simulation to sample conformations of PAS where the binding
pocket is open which led to drifting of Tyr71 away from the binding pocket entrance. Next
we docked 5 different small molecule ligands provided to us by our experimental collab-
orator to the binding pocket and performed 100ns MD simulation. Binding free energy
calculation using MMPBSA showed a favorable binding of these ligands to the PAS do-
main which was majorly driven by hydrophobic interaction. The binding pose of all ligands
were facing the binding pocket from the three membered ring where the tail was outside.
PAS domain of EAG channel has been investigated in several studies. Some evidence
suggest that the PAS domain interacts with S1-S4 linker and directly regulates the VS do-
main movements.[177]. Other evidence points to the interaction of PAS with CNBHD.[114,
211] Zagotta and coworkers used fluorescence anisotropy and found that PAS domain in
EAG channel directly interacts with CNBHD with an affinity of 13.2?2.3?M. [118] The
interface of PAS and CNBHD is shown in figure 4.30. Mapping disease related muta-
tions in KCNH channel onto the structure of PAS-CNBHD complex, it was shown that
most LQT2 and cancer-related mutations are located at the interface of PAS with CNBHD.
For example, N34 in PAS domain forms a hydrogen bond with V634 on CNBHD and its
mutation (N33T in hERG1) was shown to cause LQT2. Y44 of PAS interacts with I637
and G639 and mutation Y44H in hEAG1 correlates with large intestine carcinoma[7] and
mutation in hERG1 (Y43C) causes LQT2.[214] The interface between PAS and CNBHD
can be divided into three subregions: 1) the intrinsic ligand of CNBHD interacts with
?B helix of PAS 2) ?A and ?B strands of PAS interact with post-CNBHD helix 3) The
218
N-terminal of PAS contains an amphipathic helix ?-CAP which interacts with the ? -
roll of CNBHD.[118] Mutations in the intrinsic ligand were shown to regulate channel
activity.[114, 7] The post-CNBHD region which is immediately after the intrinsic ligand of
CNBHD also interacts with the PAS domain. It was shown that this region regulates KCNH
channels by a variety of cellular signaling events such as phosphorylation and interaction
with Ca2+ calmodulin.[54] The interface also includes a salt bridge between R57 on the
B helix of PAS and D642 on 6 of CNBHD. This salt bridge is conserved throughout the
KCNH family and mutations in this site cause LQT2 in hERG1. [118]
The PAS cap includes the first 25 residues of the PAS domain which was shown to be
critical for activation and inactivation of KCNH channels.[114, 118] The PAS-cap helix
interacts with CNBHD and the ?-CAP is positioned near the ?4??5 strands and ?8??9
loop of CNBHD. Alignment of PAS-CAP region with other NMR structures point to the
fact that it takes very different orientations from isolated EAG domains from hERG1 and
ELK channels.[3] It was therefore proposed that the PAS-cap exerts its function through in-
teraction with CNBHD domain. The surrounding of PAS-cap are rich with cancer-related
mutations (hEAG1 E19D) and other hERG1 LQT2 mutations E788D (E627 in mEAG1) in
?4 strand of CNBHD.[118, 114] These mutations can change the gating properties of the
channel by destabilizing interaction of PAS CAP with CNBHD. They studied the gating
properties of (R7A-R8A and R7E-R8E) mutations of PAS and (E727A,E627R) mutations
in CNBHD which are highly conserved residues in KCNH family. The activation of muta-
tions in the PAS cap shifted to more depolarized potentials compared to wild-type channels.
Similarly the CNBHD mutations demonstrated a robust depolarizing shift in potential.
Potential interactions between different residues were computed by looking at the dis-
tance between non-hydrogen residues. Our results showed that there is an allostery between
conformational fluctuations in PAS and CNBHD and the pore residues. The movement of
residues in the PAS domain are transmitted to the pore through interaction with the CNBHD
219
which is linked to the C-linked region directly in contact with the pore. This chain of inter-
actions constitutes the coupling pathways between ligand binding site and the channel pore
which leads to the inactivation. We simulated the full length EAG channel in both apo and
CPZ bound to all PAS domain to study the effect of ligand binding on the interface of PAS
and CNHBD and the channel pore. H-bonds were computed between PAS and CNBHD.
Some of the hydrogen bonds are destablized by ligand binding such as N34-E633 and Q62-
V635 while other hydrogen bonds appear such as Q62-T698, Y198-E627 and K122-E633
Therefore, the binding affinity between PAS and CNBHD is expected to be slightly higher
in the bound state. Experimental measurements using SPR at Brelidze lab showed that the
binding affinity increases upon ligand binding between PAS and CNBHD. We used infor-
mation flow analysis to find if there is an allosteric pathway between the ligand binding
site and the channel pore and regions and residues along the pathway that carry most of the
information flow. Our analysis showed that CNBHD carries most of information flow from
PAS to channel pore. Moreover, the channel residues had a lower current flow in the bound
state. Since the simulated protein is in the inactive state, this implies a further stabilization
of the closed state of the EAG channel.
220
Chapter 5: Conclusions and open problems
In this dissertation, I have explored various computational techniques such as molecular
dynamics, Markov state modeling and machine learning to study biomolecular processes,
for example protein folding, protein-ligand binding and protein-membrane interactions. In
the second chapter, I studied membrane active peptides (MAP). Cell penetrating peptides
are a class of MAPs with the ability to cross the cell membrane and deliver biomolecular
cargo inside the cell. Secondary structure of CPPs during their interaction with cell mem-
brane affect their translocation efficacy. We studied two cell penetrating peptides MPG and
Hst5. Our results showed that MPG enters the membrane via its hydrophobic N-terminus
whereas Hst5 remains attached to the phosphate plane. Further simulations of MPG showed
that this peptide forms a ? -sheet conformation early during interaction with membrane
but upon deeper insertion into membrane core, it adopts an ?-helical conformation. This
structural polymorphism is important for the internalization route of CPPs. Antimicrobial
peptides are another class of MAPs that have been proposed as potential solution against
multi-drug resistant pathogens. Designing AMPs requires exploration of a vast chemical
space, making it a challenging problem. In the second part of chapter 2, we developed
a machine learning model, variational attention based variational autoencoder to generate
novel diverse and high quality antimicrobial peptides. This machine learning model learns
a latent space of real AMPs which are represented as sequences of unique numbers. The
attention mechanism helps with diversity and quality of the generated AMPs. The future
direction of this project will be further evaluating the generated peptides as post-generation
221
evaluation to check the minimum inhibitory concentration (MIC), toxicity and other bio-
logical properties of the in-silico generated AMPs. This will give candidate sequences that
could be synthesized and experimentally evaluated.
In the third chapter, I studied kinetics and thermodynamics of protein folding using
Markov state models and molecular dynamics. In the first part of this chapter, I studied an
amyloid forming protein ?2m. Conformational fluctuations in the monomeric form of this
protein could lead to aggregation prone intermediate states. I performed MD simulations
on this protein for 250?s and then applied a MSM on the trajectories. Transitions between
folded and misfolded/partially folded states happens at 10?s of ?s. The intermediate states
have unfolded outer strands of the protein which leads to exposure of the hydrophobic
core of the protein to the solvent which is an important factor in formation of higher order
oligomers and eventually aggregates. It will be interesting to compare the kinetics of mis-
folding of native state ?2m with that of natural variants such as D76N and ?N6 mutants. In
the second part of chapter 3, I develop a machine learning model GMVAE for simultaneous
dimensionality reduction and clustering of protein folding trajectories. GMVAE can learn a
reduced representation of free energy landscape of protein folding where metastable states
are highly separated clusters. We showed that GMVAE embedding resembles the folding
funnel for protein folding trajectories. The future direction of this section would be to in-
clude a lagtime in the GMVAE to allow it to learn kinetically metastable states. Another
direction would be to use graph neural networks as the features instead of pairwise dis-
tances which is what we used here. In the last section of this chapter, I develop a novel
method GraphVAMPNet to learn a low-dimensional representation and linear dynamical
model of simulation trajectories in an end-to-end manner. We combined VAMPNet which
is based on the variational approach for Markov processes and uses a neural network to
learn coarse-grained dynamical model with graph neural networks as the feature repre-
sentation of the molecule. This gives the model advantages of graph representation and
222
uses graph message passing and generated embedding of each datapoint in VAMPNet. We
showed that this type of representation results in a higher resolution and more interpretable
Markov models than standard VAMPNet. Moreover, attention in the neighbors of each
residue in the graph gave us insight about the importance of residues for each meatastable
state. It would be interesting to further develop GraphVAMPNet with different pooling
mechanisms. For example, one can use graph pooling such that a VAMP score is maxi-
mized for each domain of the protein. This means learning the local dynamics of different
domain or different parts of a protein. On the other hand, since we have not used any hand
crafted features, the GraphVAMPNet model is transferable. It would be interesting to study
the transferability of the learned embeddings in GraphVAMPNet.
In chapter 4, I studied two membrane proteins: the spike protein in SARS-COV-2 and
the EAG potassium channel. The spike protein of SARS-COV-2 makes contact with cell
receptor protein hACE2 through its receptor binding domain (RBD). The RBD of SARS-
COV-2 is ripe with mutations from an earlier SARS-COV in 2002. We showed that these
mutations have given the RBD of SARS-COV-2 a higher affinity to bind hACE2 than
SARS-COV which is the reason for its higher infection rate. The important residues at the
interface of RBD and hACE2 in SARS-COV-2 were identified by decomposing binding
free energies per residues. We found residues whose mutation highly enhanced the binding
affinity ( such as V404 to K417) or mutations lowering the binding affinity (such as R426
to N439). Furthermore, we simulated mutants of RBD (either natural mutants or alanine
scanning) and found residues that are crucial for binding between RBD and hACE2. In the
second part of our SARS-COV-2 work, we investigated the dynamics of glycans in the spike
protein using MD simulation and network analysis. The glycan shield on the spike protein
provides a barrier against antibodies. Our network analysis unraveled the role of different
glycans in providing an effective shield using betweenness centrality measurements. We
uncovered microdomains of glycans that feature high degree of intra-communication and
223
used antibody overlap analysis to find microdomains that inhibit access to antibody epi-
topes. In the last section of chapter 4, I study the ligand binding to the PAS domain of EAG
channel. We showed that a Tyr71 residue blocks the entrance of the binding pocket. Using
replica exchange solute tempering, we found structures of PAS where Tyr71 drifted away
from the binding pocket which allowed us to dock ligands to the binding pocket. Binding
free energy computations using MMPBSA showed the binding affinity of different ligands
and the residues contributing mostly to the binding affinity. Using mutual information and
information flow analysis from the MD simulation of full-length EAG channel in the bound
state we studied the allosteric pathways that lead to current inhibition in the channel as well
as residues that are important in the pathway. Interestingly, we found that ligand binding
in the PAS domain reduces the information flow at the sink residues (channel pore) which
coincides with the current inhibition through the channel. For the future work it will be
interesting to study the effect of ligand binding on the channel opening using enhanced
sampling methods such as metadynamics.
224
Bibliography
[1] Herve? Abdi and Lynne J Williams. Principal component analysis. Wiley interdisci-
plinary reviews: computational statistics, 2(4):433?459, 2010.
[2] Mark James Abraham, Teemu Murtola, Roland Schulz, Szila?rd Pa?ll, Jeremy C
Smith, Berk Hess, and Erik Lindahl. Gromacs: High performance molecular simu-
lations through multi-level parallelism from laptops to supercomputers. SoftwareX,
1:19?25, 2015.
[3] Ricardo Adaixo, Carol A Harley, Artur F Castro-Rodrigues, and Joa?o H Morais-
Cabral. Structural properties of pas domains from the kcnh potassium channels.
PloS one, 8(3):e59265, 2013.
[4] Diletta Ami, Stefano Ricagno, Martino Bolognesi, Vittorio Bellotti, Silvia Maria
Doglia, and Antonino Natalello. Structure, stability, and aggregation of ? -2 mi-
croglobulin mutants: insights from a fourier transform infrared study in solution and
in the crystalline state. Biophysical journal, 102(7):1676?1684, 2012.
[5] Hareesh Bahuleyan. Natural language generation with neural variational models.
arXiv preprint arXiv:1808.09012, 2018.
[6] Mukund Balasubramanian, Eric L Schwartz, Joshua B Tenenbaum, Vin de Silva,
and John C Langford. The isomap algorithm and topological stability. Science,
295(5552):7?7, 2002.
[7] Sally Bamford, Emily Dawson, Simon Forbes, Jody Clements, Roger Pettett, Ahmet
Dogan, A Flanagan, Jon Teague, P Andrew Futreal, Michael R Stratton, et al. The
cosmic (catalogue of somatic mutations in cancer) database and website. British
journal of cancer, 91(2):355?358, 2004.
[8] Rahul Banerjee, Honggao Yan, and Robert I Cukier. Conformational transition in
signal transduction: metastable states and transition pathways in the activation of a
signaling protein. The Journal of Physical Chemistry B, 119(22):6591?6602, 2015.
225
[9] Alessandro Barducci, Massimiliano Bonomi, and Michele Parrinello. Metadynam-
ics. Wiley Interdisciplinary Reviews: Computational Molecular Science, 1(5):826?
843, 2011.
[10] Alessandro Barducci, Giovanni Bussi, and Michele Parrinello. Well-tempered meta-
dynamics: a smoothly converging and tunable free-energy method. Physical review
letters, 100(2):020603, 2008.
[11] Bipasha Barua, Jasper C Lin, Victoria D Williams, Phillip Kummler, Jonathan W
Neidigh, and Niels H Andersen. The trp-cage: optimizing the stability of a globular
miniprotein. Protein Engineering, Design & Selection, 21(3):171?185, 2008.
[12] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: an open source
software for exploring and manipulating networks. In Proceedings of the interna-
tional AAAI conference on web and social media, volume 3, pages 361?362, 2009.
[13] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez,
Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam
Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph
networks. arXiv preprint arXiv:1806.01261, 2018.
[14] CK Bauer and JR Schwarz. Physiology of eag k+ channels. The Journal of mem-
brane biology, 182(1):1?15, 2001.
[15] Javier L Baylon, Josh V Vermaas, Melanie P Muller, Mark J Arcario, Taras V
Pogorelov, and Emad Tajkhorshid. Atomic-level description of protein?lipid in-
teractions using an accelerated membrane model. Biochimica et Biophysica Acta
(BBA)-Biomembranes, 1858(7):1573?1583, 2016.
[16] Kyle A Beauchamp, Daniel L Ensign, Rhiju Das, and Vijay S Pande. Quan-
titative comparison of villin headpiece subdomain simulations and triplet?triplet
energy transfer experiments. Proceedings of the National Academy of Sciences,
108(31):12734?12739, 2011.
[17] Kyle A Beauchamp, Robert McGibbon, Yu-Shan Lin, and Vijay S Pande. Simple
few-state models reveal hidden complexity in protein folding. Proceedings of the
National Academy of Sciences, 109(44):17807?17813, 2012.
[18] Vittorio Bellotti, Maurizio Gallieni, Sofia Giorgetti, and Diego Brancaccio. Dynamic
of ?2-microglobulin fibril formation and reabsorption: The role of proteolysis. In
Seminars in dialysis, volume 14, pages 117?122. Wiley Online Library, 2001.
[19] Rafael C Bernardi, Marcelo CR Melo, and Klaus Schulten. Enhanced sampling
techniques in molecular dynamics simulations of biological systems. Biochimica et
Biophysica Acta (BBA)-General Subjects, 1850(5):872?877, 2015.
226
[20] Zachary T Berndsen, Srirupa Chakraborty, Xiaoning Wang, Christopher A Cottrell,
Jonathan L Torres, Jolene K Diedrich, Cesar A Lo?pez, John R Yates, Marit J van
Gils, James C Paulson, et al. Visualization of the hiv-1 env glycan shield across
scales. Proceedings of the National Academy of Sciences, 117(45):28014?28025,
2020.
[21] Martina Bertazzo, Dorothea Gobbo, Sergio Decherchi, and Andrea Cavalli. Machine
learning and enhanced sampling simulations for computing the potential of mean
force and standard binding free energy. Journal of chemical theory and computation,
17(8):5287?5300, 2021.
[22] Robert B Best, Gerhard Hummer, and William A Eaton. Native contacts determine
protein folding mechanisms in atomistic simulations. Proceedings of the National
Academy of Sciences, 110(44):17874?17879, 2013.
[23] Robert B Best, Xiao Zhu, Jihyun Shim, Pedro EM Lopes, Jeetain Mittal, Michael
Feig, and Alexander D MacKerell Jr. Optimization of the additive charmm all-atom
protein force field targeting improved sampling of the backbone ? , ? and side-chain
?1 and ?2 dihedral angles. Journal of chemical theory and computation, 8(9):3257?
3273, 2012.
[24] Debsindhu Bhowmik, Shang Gao, Michael T Young, and Arvind Ramanathan. Deep
clustering of protein folding simulations. BMC bioinformatics, 19(18):47?58, 2018.
[25] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre.
Fast unfolding of communities in large networks. Journal of statistical mechanics:
theory and experiment, 2008(10):P10008, 2008.
[26] Luigi Bonati, Yue-Yu Zhang, and Michele Parrinello. Neural networks-based vari-
ationally enhanced sampling. Proceedings of the National Academy of Sciences,
116(36):17641?17647, 2019.
[27] Massimiliano Bonomi and Michele Parrinello. Enhanced sampling in the well-
tempered ensemble. Physical review letters, 104(19):190601, 2010.
[28] Berend Jan Bosch, Ruurd Van der Zee, Cornelis AM De Haan, and Peter JM Rottier.
The coronavirus spike protein is a class i virus fusion protein: structural and func-
tional characterization of the fusion core complex. Journal of virology, 77(16):8801?
8811, 2003.
[29] Gregory R Bowman, Kyle A Beauchamp, George Boxer, and Vijay S Pande.
Progress and challenges in the automated construction of markov state models for
full protein systems. The Journal of chemical physics, 131(12):124101, 2009.
227
[30] Gregory R Bowman, Vijay S Pande, and Frank Noe?. An introduction to Markov state
models and their application to long timescale molecular simulation, volume 797.
Springer Science & Business Media, 2013.
[31] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,
and Samy Bengio. Generating sentences from a continuous space. arXiv preprint
arXiv:1511.06349, 2015.
[32] GD Brand, MHS Ramada, TC Genaro-Mattos, and C Bloch. Towards an experimen-
tal classification system for membrane active peptides. Scientific reports, 8(1):1?11,
2018.
[33] Tinatin I Brelidze, Anne E Carlson, Banumathi Sankaran, and William N Zagotta.
Structure of the carboxy-terminal region of a kcnh channel. Nature, 481(7382):530?
533, 2012.
[34] Tinatin I Brelidze, Anne E Carlson, and William N Zagotta. Absence of direct cyclic
nucleotide modulation of meag1 and herg1 channels revealed with fluorescence and
electrophysiological methods. Journal of Biological Chemistry, 284(41):27989?
27997, 2009.
[35] Esther S Brielle, Dina Schneidman-Duhovny, and Michal Linial. The sars-cov-2
exerts a distinctive strategy for interacting with the ace2 human receptor. Viruses,
12(5):497, 2020.
[36] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Van-
dergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal
Processing Magazine, 34(4):18?42, 2017.
[37] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks
and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
[38] Nicolae-Viorel Buchete and Gerhard Hummer. Peptide folding kinetics from replica
exchange molecular dynamics. Physical Review E, 77(3):030902, 2008.
[39] Giovanni Bussi, Davide Donadio, and Michele Parrinello. Canonical sampling
through velocity rescaling. The Journal of chemical physics, 126(1):014101, 2007.
[40] Javier Camacho. Ether a go-go potassium channels and cancer. Cancer letters,
233(1):1?9, 2006.
[41] Lorenzo Casalino, Zied Gaieb, Jory A Goldsmith, Christy K Hjorth, Abigail C Dom-
mer, Aoife M Harbison, Carl A Fogarty, Emilia P Barros, Bryn C Taylor, Jason S
McLellan, et al. Beyond shielding: the roles of glycans in the sars-cov-2 spike pro-
tein. ACS central science, 6(10):1722?1734, 2020.
228
[42] Michele Ceriotti, Gareth A Tribello, and Michele Parrinello. Simplifying the repre-
sentation of complex free-energy landscapes using sketch-map. Proceedings of the
National Academy of Sciences, 108(32):13023?13028, 2011.
[43] Samitabh Chakraborti, Ponraj Prabakaran, Xiaodong Xiao, and Dimiter S Dimitrov.
The sars coronavirus s glycoprotein receptor binding domain: fine mapping and
functional characterization. Virology journal, 2(1):1?10, 2005.
[44] Srirupa Chakraborty, Zachary T Berndsen, Nicolas W Hengartner, Bette T Korber,
Andrew B Ward, and S Gnanakaran. Quantification of the resilience and vulner-
ability of hiv-1 native glycan shield at atomistic detail. Iscience, 23(12):101836,
2020.
[45] Jasper Fuk-Woo Chan, Kin-Hang Kok, Zheng Zhu, Hin Chu, Kelvin Kai-Wang To,
Shuofeng Yuan, and Kwok-Yung Yuen. Genomic characterization of the 2019 novel
human-pathogenic coronavirus isolated from a patient with atypical pneumonia after
visiting wuhan. Emerging microbes & infections, 9(1):221?236, 2020.
[46] Charles H Chen, Charles G Starr, Evan Troendle, Gregory Wiedman, William C
Wimley, Jakob P Ulmschneider, and Martin B Ulmschneider. Simulation-guided
rational de novo design of a small pore-forming antimicrobial peptide. Journal of
the American Chemical Society, 141(12):4839?4848, 2019.
[47] Wei Chen and Andrew L Ferguson. Molecular enhanced sampling with autoen-
coders: On-the-fly collective variable discovery and accelerated free energy land-
scape exploration. Journal of computational chemistry, 39(25):2079?2102, 2018.
[48] Wei Chen, Hythem Sidky, and Andrew L Ferguson. Capabilities and limitations
of time-lagged autoencoders for slow mode discovery in dynamical systems. The
Journal of Chemical Physics, 151(6):064123, 2019.
[49] Wei Chen, Hythem Sidky, and Andrew L Ferguson. Nonlinear discovery of slow
molecular modes using state-free reversible vampnets. The Journal of chemical
physics, 150(21):214114, 2019.
[50] Wei Chen, Aik Rui Tan, and Andrew L Ferguson. Collective variable discovery
and enhanced sampling using autoencoders: Innovations in network architecture and
error function design. The Journal of chemical physics, 149(7):072312, 2018.
[51] Xiangyu Chen, Ren Li, Zhiwei Pan, Chunfang Qian, Yang Yang, Renrong You,
Jing Zhao, Pinghuang Liu, Leiqiong Gao, Zhirong Li, et al. Human monoclonal
antibodies block the binding of sars-cov-2 spike protein to angiotensin converting
enzyme 2 receptor. Cellular & molecular immunology, 17(6):647?649, 2020.
229
[52] Xumin Chen, Chen Li, Matthew T Bernards, Yao Shi, Qing Shao, and Yi He.
Sequence-based peptide identification, generation, and property prediction with
deep learning: a review. Molecular Systems Design & Engineering, 6(6):406?428,
2021.
[53] John TJ Cheng, John D Hale, Melissa Elliot, Robert EW Hancock, and Suzana K
Straus. Effect of membrane composition on antimicrobial peptides aurein 2.2 and
2.3 from australian southern bell frogs. Biophysical Journal, 96(2):552?565, 2009.
[54] Alessia Cherubini, Giovanna Hofmann, Serena Pillozzi, Leonardo Guasti, Olivia
Crociani, Emanuele Cilia, Paola Di Stefano, Simona Degani, Manuela Balzi, Mas-
simo Olivotto, et al. Human ether-a-go-go-related gene 1 channels are physically
linked to ?1 integrins and modulate adhesion-dependent signaling. Molecular biol-
ogy of the cell, 16(6):2972?2983, 2005.
[55] Xiangyang Chi, Renhong Yan, Jun Zhang, Guanying Zhang, Yuanyuan Zhang,
Meng Hao, Zhe Zhang, Pengfei Fan, Yunzhu Dong, Yilong Yang, et al. A neu-
tralizing human antibody binds to the n-terminal domain of the spike protein of
sars-cov-2. Science, 369(6504):650?655, 2020.
[56] Fabrizio Chiti, Palma Mangione, Alessia Andreola, Sofia Giorgetti, Massimo Ste-
fani, Christopher M Dobson, Vittorio Bellotti, and Niccolo? Taddei. Detection of
two partially structured species in the folding process of the amyloidogenic protein
?2-microglobulin. Journal of molecular biology, 307(1):379?391, 2001.
[57] Jae-Hyun Cho, Wenli Meng, Satoshi Sato, Eun Young Kim, Hermann Schindelin,
and Daniel P Raleigh. Energetically significant networks of coupled interactions
within an unfolded protein. Proceedings of the National Academy of Sciences,
111(33):12079?12084, 2014.
[58] Kyunghyun Cho, Bart Van Merrie?nboer, Dzmitry Bahdanau, and Yoshua Bengio.
On the properties of neural machine translation: Encoder-decoder approaches. arXiv
preprint arXiv:1409.1259, 2014.
[59] John D Chodera and Frank Noe?. Markov state models of biomolecular conforma-
tional dynamics. Current opinion in structural biology, 25:135?144, 2014.
[60] John D Chodera, Nina Singhal, Vijay S Pande, Ken A Dill, and William C Swope.
Automatic discovery of metastable states for the construction of markov models
of macromolecular conformational dynamics. The Journal of chemical physics,
126(15):04B616, 2007.
[61] John D Chodera, William C Swope, Jed W Pitera, and Ken A Dill. Long-time pro-
tein folding dynamics from short-time molecular dynamics simulations. Multiscale
Modeling & Simulation, 5(4):1214?1226, 2006.
230
[62] Song-Ho Chong and Sihyun Ham. Examining a thermodynamic order parameter of
protein folding. Scientific reports, 8(1):1?9, 2018.
[63] Song-Ho Chong, Jooyeon Hong, Sulgi Lim, Sunhee Cho, Jinkeong Lee, and Sihyun
Ham. Structural and thermodynamic characteristics of amyloidogenic intermediates
of ? -2-microglobulin. Scientific reports, 5(1):1?9, 2015.
[64] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and
Yoshua Bengio. Attention-based models for speech recognition. Advances in neural
information processing systems, 28, 2015.
[65] Melissa M Coughlin and Bellur S Prabhakar. Neutralizing human monoclonal an-
tibodies to severe acute respiratory syndrome coronavirus: target, mechanism of
action, and therapeutic potential. Reviews in medical virology, 22(1):2?17, 2012.
[66] Max Crispin, Andrew B Ward, and Ian A Wilson. Structure and immune recognition
of the hiv glycan shield. Annual review of biophysics, 47:499?523, 2018.
[67] Jianmin Cui. Voltage-dependent gating: novel insights from kcnq1 channels. Bio-
physical journal, 110(1):14?25, 2016.
[68] Jie Cui, Fang Li, and Zheng-Li Shi. Origin and evolution of pathogenic coron-
aviruses. Nature Reviews Microbiology, 17(3):181?192, 2019.
[69] Tom Darden, Darrin York, and Lee Pedersen. Particle mesh ewald: An n log
(n) method for ewald sums in large systems. The Journal of chemical physics,
98(12):10089?10092, 1993.
[70] Payel Das, Tom Sercu, Kahini Wadhawan, Inkit Padhi, Sebastian Gehrmann, Flaviu
Cipcigan, Vijil Chenthamarakshan, Hendrik Strobelt, Cicero Dos Santos, Pin-Yu
Chen, et al. Accelerated antimicrobial discovery via deep generative models and
molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613?623,
2021.
[71] Payel Das, Kahini Wadhawan, Oscar Chang, Tom Sercu, Cicero Dos Santos,
Matthew Riemer, Vijil Chenthamarakshan, Inkit Padhi, and Aleksandra Mojsilovic.
Pepcvae: Semi-supervised targeted design of antimicrobial peptide sequences. arXiv
preprint arXiv:1810.07743, 2018.
[72] Ruslan L Davidchack, Richard Handel, and MV Tretyakov. Langevin thermostat for
rigid body dynamics. The Journal of chemical physics, 130(23):234101, 2009.
[73] Andrew D Davidson, Maia Kavanagh Williamson, Sebastian Lewis, Deborah Shoe-
mark, Miles W Carroll, Kate J Heesom, Maria Zambon, Joanna Ellis, Philip A
Lewis, Julian A Hiscox, et al. Characterisation of the transcriptome and proteome of
sars-cov-2 reveals a cell passage induced in-frame deletion of the furin-like cleavage
site from the spike glycoprotein. Genome medicine, 12(1):1?15, 2020.
231
[74] Joa?o V de Souza, Sylvia Reznikov, Ruidi Zhu, and Agnieszka K Bronowska. Drug-
gability assessment of mammalian per?arnt?sim [pas] domains using computational
approaches. Medchemcomm, 10(7):1126?1137, 2019.
[75] Scott N Dean, Barney M Bishop, and Monique L Van Hoek. Natural and synthetic
cathelicidin peptides with anti-microbial and anti-biofilm activity against staphylo-
coccus aureus. BMC microbiology, 11(1):1?13, 2011.
[76] Damon Deming, Timothy Sheahan, Mark Heise, Boyd Yount, Nancy Davis, Amy
Sims, Mehul Suthar, Jack Harkema, Alan Whitmore, Raymond Pickles, et al. Vac-
cine efficacy in senescent mice challenged with recombinant sars-cov bearing epi-
demic and zoonotic spike variants. PLoS medicine, 3(12):e525, 2006.
[77] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from
incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series
B (Methodological), 39(1):1?22, 1977.
[78] Nan-jie Deng, Wei Dai, and Ronald M Levy. How kinetics within the unfolded state
affects protein folding: An analysis based on markov state models and an ultra-long
md trajectory. The Journal of Physical Chemistry B, 117(42):12787?12799, 2013.
[79] Daniele Derossi, Alain H Joliot, Gerard Chassaing, and Alain Prochiantz. The third
helix of the antennapedia homeodomain translocates through biological membranes.
Journal of Biological Chemistry, 269(14):10444?10450, 1994.
[80] Se?bastien Deshayes, Marc Decaffmeyer, Robert Brasseur, and Annick Thomas.
Structural polymorphism of two cpp: an important parameter of activity. Biochimica
et Biophysica Acta (BBA)-Biomembranes, 1778(5):1197?1205, 2008.
[81] Jyoti Dev, Donghyun Park, Qingshan Fu, Jia Chen, Heather Jiwon Ha, Fadi Ghan-
tous, Tobias Herrmann, Weiting Chang, Zhijun Liu, Gary Frey, et al. Structural ba-
sis for membrane anchoring of hiv-1 envelope spike. Science, 353(6295):172?175,
2016.
[82] Kuldeep Dhama, Khan Sharun, Ruchi Tiwari, Maryam Dadar, Yashpal Singh Malik,
Karam Pal Singh, and Wanpen Chaicumpa. Covid-19, an emerging coronavirus in-
fection: advances and prospects in designing and developing vaccines, immunother-
apeutics, and therapeutics. Human vaccines & immunotherapeutics, 16(6):1232?
1238, 2020.
[83] Ken A Dill, S Banu Ozkan, M Scott Shell, and Thomas R Weikl. The protein folding
problem. Annu. Rev. Biophys., 37:289?316, 2008.
[84] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh
Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering
232
with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648,
2016.
[85] Christopher M Dobson. Protein folding and misfolding. Nature, 426(6968):884?
890, 2003.
[86] Han Du, Sumant Puri, Andrew McCall, Hannah L Norris, Thomas Russo, and Mira
Edgerton. Human salivary protein histatin 5 has potent bactericidal activity against
eskape pathogens. Frontiers in Cellular and Infection Microbiology, 7:41, 2017.
[87] David Duvenaud, Dougal Maclaurin, Jorge Aguilera-Iparraguirre, Rafael Go?mez-
Bombarelli, Timothy Hirzel, Ala?n Aspuru-Guzik, and Ryan P Adams. Convo-
lutional networks on graphs for learning molecular fingerprints. arXiv preprint
arXiv:1509.09292, 2015.
[88] Timo Eichner, Arnout P Kalverda, Gary S Thompson, Steve W Homans, and
Sheena E Radford. Conformational conversion during amyloid formation at atomic
resolution. Molecular cell, 41(2):161?172, 2011.
[89] Timo Eichner and Sheena E Radford. A generic mechanism of ?2-microglobulin
amyloid assembly at neutral ph involving a specific proline switch. Journal of molec-
ular biology, 386(5):1312?1326, 2009.
[90] Emel??a Eir??ksdo?ttir, Karidia Konate, U?lo Langel, Gilles Divita, and Se?bastien De-
shayes. Secondary structure of cell-penetrating peptides controls membrane in-
teraction and insertion. Biochimica et Biophysica Acta (BBA)-Biomembranes,
1798(6):1119?1128, 2010.
[91] Stefan Elbe and Gemma Buckland-Merrett. Data, disease and diplomacy: Gisaid?s
innovative contribution to global health. Global challenges, 1(1):33?46, 2017.
[92] Charles A English and Angel E Garc??a. Charged termini on the trp-cage roughen
the folding energy landscape. The Journal of Physical Chemistry B, 119(25):7874?
7881, 2015.
[93] G Esposito, R Michelutti, G Verdone, P Viglino, H Hernandez, CV Robinson,
A Amoresano, F Dal Piaz, M Monti, P Pucci, et al. Removal of the n-terminal
hexapeptide from human ?2-microglobulin facilitates protein aggregation and fibril
formation. Protein Science, 9(5):831?845, 2000.
[94] Ulrich Essmann, Lalith Perera, Max L Berkowitz, Tom Darden, Hsing Lee, and
Lee G Pedersen. A smooth particle mesh ewald method. The Journal of chemical
physics, 103(19):8577?8593, 1995.
233
[95] S??lvia G Esta?cio, Heinrich Krobath, Diogo Vila-Vic?osa, Miguel Machuqueiro, Eu-
gene I Shakhnovich, and Patr??cia FN Fa??sca. A simulated intermediate state for
folding and aggregation provides insights into ?n6 ?2-microglobulin amyloidogenic
behavior. PLoS computational biology, 10(5):e1003606, 2014.
[96] Denis J Evans and Brad Lee Holian. The nose?hoover thermostat. The Journal of
chemical physics, 83(8):4069?4074, 1985.
[97] Scott E Feller, Yuhong Zhang, Richard W Pastor, and Bernard R Brooks. Constant
pressure molecular dynamics simulation: The langevin piston method. The Journal
of chemical physics, 103(11):4613?4621, 1995.
[98] Jhosimar Arias Figueroa and Ad??n Ram??rez Rivera. Is simple better?: Revisit-
ing simple generative models for unsupervised clustering. In Second workshop on
Bayesian Deep Learning (NIPS), 2017.
[99] Centers for Disease Control, Prevention, et al. Antibiotic resistance threats in the
United States, 2019. US Department of Health and Human Services, Centres for
Disease Control and . . . , 2019.
[100] Peter Forster, Lucy Forster, Colin Renfrew, and Michael Forster. Phylogenetic net-
work analysis of sars-cov-2 genomes. Proceedings of the National Academy of Sci-
ences, 117(17):9241?9243, 2020.
[101] Alan D Frankel and Carl O Pabo. Cellular uptake of the tat protein from human
immunodeficiency virus. Cell, 55(6):1189?1193, 1988.
[102] Haoyi Fu, Zicheng Cao, Mingyuan Li, and Shunfang Wang. Acep: improving an-
timicrobial peptides recognition through automatic feature fusion and amino acid
embedding. BMC genomics, 21(1):1?14, 2020.
[103] Barry Ganetzky, Gail A Robertson, Gisela F Wilson, Matthew C Trudeau, and
Steven A Titus. The eag family of k+ channels in drosophila and mammals. An-
nals of the New York Academy of Sciences, 868(1):356?369, 1999.
[104] Mahdi Ghorbani, Samarjeet Prasad, Jeffery B Klauda, and Bernard R Brooks. Vari-
ational embedding of protein folding simulations using gaussian mixture variational
autoencoders. The Journal of Chemical Physics, 155(19):194108, 2021.
[105] Aldo Glielmo, Brooke E Husic, Alex Rodriguez, Cecilia Clementi, Frank Noe?, and
Alessandro Laio. Unsupervised learning methods for molecular simulation data.
Chemical Reviews, 2021.
[106] Vladimir Gligorijevic?, P Douglas Renfrew, Tomasz Kosciolek, Julia Koehler Le-
man, Daniel Berenberg, Tommi Vatanen, Chris Chandler, Bryn C Taylor, Ian M
Fisk, Hera Vlamakis, et al. Structure-based protein function prediction using graph
convolutional networks. Nature communications, 12(1):1?14, 2021.
234
[107] Ba?rbara Gomes, Marcelo T Augusto, Ma?rio R Fel??cio, Axel Hollmann, Octa?vio L
Franco, So?nia Gonc?alves, and Nuno C Santos. Designing improved active peptides
for therapeutic approaches against infectious diseases. Biotechnology advances,
36(2):415?429, 2018.
[108] Zifan Gong, Svetlana P Ikonomova, and Amy J Karlsson. Secondary structure
of cell-penetrating peptides during interaction with fungal cells. Protein Science,
27(3):702?713, 2018.
[109] Helmut Grubmu?ller, Helmut Heller, Andreas Windemuth, and Klaus Schulten. Gen-
eralized verlet algorithm for efficient molecular dynamics simulations with long-
range interactions. Molecular Simulation, 6(1-3):121?142, 1991.
[110] Jiuxiang Gu, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing
Shuai, Ting Liu, Xingxing Wang, Gang Wang, Jianfei Cai, et al. Recent advances in
convolutional neural networks. Pattern Recognition, 77:354?377, 2018.
[111] Shan-shan Guan, Wei-wei Han, Hao Zhang, Song Wang, and Ya-ming Shan. Insight
into the interactive residues between two domains of human somatic angiotensin-
converting enzyme and angiotensin ii by mm-pbsa calculation and steered molecular
dynamics simulation. Journal of Biomolecular Structure and Dynamics, 34(1):15?
28, 2016.
[112] Chunsheng Guo, Jialuo Zhou, Huahua Chen, Na Ying, Jianwu Zhang, and Di Zhou.
Variational autoencoder with optimizing gaussian mixture model priors. IEEE Ac-
cess, 8:43992?44005, 2020.
[113] Anvita Gupta and James Zou. Feedback gan for dna optimizes protein functions.
Nature Machine Intelligence, 1(2):105?111, 2019.
[114] Ahleah S Gustina and Matthew C Trudeau. Herg potassium channel regulation by
the n-terminal eag domain. Cellular signalling, 24(8):1592?1598, 2012.
[115] Olgun Guvench and Alexander D MacKerell. Comparison of protein force fields
for molecular dynamics simulations. Molecular modeling of proteins, pages 63?88,
2008.
[116] Olgun Guvench, Sairam S Mallajosyula, E Prabhu Raman, Elizabeth Hatcher, Kenno
Vanommeslaeghe, Theresa J Foster, Francis W Jamison, and Alexander D MacK-
erell Jr. Charmm additive all-atom force field for carbohydrate derivatives and its
utility in polysaccharide and carbohydrate?protein modeling. Journal of chemical
theory and computation, 7(10):3162?3180, 2011.
[117] Aric Hagberg, Pieter Swart, and Daniel S Chult. Exploring network structure,
dynamics, and function using networkx. Technical report, Los Alamos National
Lab.(LANL), Los Alamos, NM (United States), 2008.
235
[118] Yoni Haitin, Anne E Carlson, and William N Zagotta. The structural mechanism of
kcnh-channel regulation by the eag domain. Nature, 501(7467):444?448, 2013.
[119] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning
on large graphs. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, pages 1025?1035, 2017.
[120] Audra A Hargett, Qing Wei, Barbora Knoppova, Stacy Hall, Zhi-Qiang Huang,
Amol Prakash, Todd J Green, Zina Moldoveanu, Milan Raska, Jan Novak, et al.
Defining hiv-1 envelope n-glycan microdomains through site-specific heterogeneity
profiles. Journal of virology, 93(1):e01177?18, 2019.
[121] Matthew P Harrigan, Mohammad M Sultan, Carlos X Herna?ndez, Brooke E Husic,
Peter Eastman, Christian R Schwantes, Kyle A Beauchamp, Robert T McGibbon,
and Vijay S Pande. Msmbuilder: statistical models for biomolecular dynamics. Bio-
physical journal, 112(1):10?15, 2017.
[122] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770?778, 2016.
[123] Niels HH Heegaard. ?2-microglobulin: from physiology to amyloidosis. Amyloid,
16(3):151?173, 2009.
[124] Rainer Hegger, Alexandros Altis, Phuong H Nguyen, and Gerhard Stock. How
complex is the dynamics of peptide folding? Physical review letters, 98(2):028102,
2007.
[125] Martin Held, Philipp Metzner, Jan-Hendrik Prinz, and Frank Noe?. Mechanisms
of protein-ligand association and its modulation by protein mutations. Biophysical
journal, 100(3):701?710, 2011.
[126] Iddo Heller, Gerrit Sitters, Onno D Broekmans, Ge?raldine Farge, Carolin Menges,
Wolfgang Wende, Stefan W Hell, Erwin JG Peterman, and Gijs JL Wuite. Sted
nanoscopy combined with optical tweezers reveals protein dynamics on densely cov-
ered dna. Nature methods, 10(9):910?916, 2013.
[127] Jonathan T Henry and Sean Crosson. Ligand-binding pas domains in a genomic,
cellular, and structural context. Annual review of microbiology, 65:261?286, 2011.
[128] Carlos X Herna?ndez, Hannah K Wayment-Steele, Mohammad M Sultan, Brooke E
Husic, and Vijay S Pande. Variational encoding of complex dynamics. Physical
Review E, 97(6):062412, 2018.
[129] Ren Higashida and Yasuhiro Matsunaga. Enhanced conformational sampling of
nanobody cdr h3 loop by generalized replica-exchange with solute tempering. Life,
11(12):1428, 2021.
236
[130] Sepp Hochreiter and Ju?rgen Schmidhuber. Long short-term memory. Neural com-
putation, 9(8):1735?1780, 1997.
[131] Moritz Hoffmann, Martin Konrad Scherer, Tim Hempel, Andreas Mardt, Brian
de Silva, Brooke Elena Husic, Stefan Klus, Hao Wu, J Nathan Kutz, Steven Brunton,
and Frank Noe?. Deeptime: a python library for machine learning dynamical models
from time series data. Machine Learning: Science and Technology, 2021.
[132] William G Hoover. Canonical dynamics: Equilibrium phase-space distributions.
Physical review A, 31(3):1695, 1985.
[133] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward
networks are universal approximators. Neural networks, 2(5):359?366, 1989.
[134] Adam Hospital, Josep Ramon Gon?i, Modesto Orozco, and Josep L Gelp??. Molecular
dynamics simulations: advances and applications. Advances and applications in
bioinformatics and chemistry: AABC, 8:37, 2015.
[135] William Humphrey, Andrew Dalke, and Klaus Schulten. Vmd: visual molecular
dynamics. Journal of molecular graphics, 14(1):33?38, 1996.
[136] Brooke E Husic, Nicholas E Charron, Dominik Lemm, Jiang Wang, Adria? Pe?rez,
Maciej Majewski, Andreas Kra?mer, Yaoyi Chen, Simon Olsson, Gianni de Fabritiis,
et al. Coarse graining molecular dynamics with graph neural networks. The Journal
of Chemical Physics, 153(19):194101, 2020.
[137] Brooke E Husic, Robert T McGibbon, Mohammad M Sultan, and Vijay S Pande.
Optimized parameter selection reveals trends in markov state models for protein
folding. The Journal of chemical physics, 145(19):194103, 2016.
[138] Brooke E Husic and Vijay S Pande. Ward clustering improves cross-validated
markov state models of protein folding. Journal of chemical theory and compu-
tation, 13(3):963?967, 2017.
[139] Brooke E Husic and Vijay S Pande. Markov state models: From an art to a science.
Journal of the American Chemical Society, 140(7):2386?2396, 2018.
[140] Piet Hut, Jun Makino, and Steve McMillan. Building a better leapfrog. The Astro-
physical Journal, 443:L93?L96, 1995.
[141] Rieko Ishima and Dennis A Torchia. Protein dynamics from nmr. Nature structural
biology, 7(9):740?743, 2000.
[142] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparametrization with gumble-
softmax. In International Conference on Learning Representations (ICLR 2017).
OpenReview. net, 2017.
237
[143] Sunhwan Jo, Taehoon Kim, Vidyashankara G Iyer, and Wonpil Im. Charmm-gui: a
web-based graphical user interface for charmm. Journal of computational chemistry,
29(11):1859?1865, 2008.
[144] William L Jorgensen, Jayaraman Chandrasekhar, Jeffry D Madura, Roger W Impey,
and Michael L Klein. Comparison of simple potential functions for simulating liquid
water. The Journal of chemical physics, 79(2):926?935, 1983.
[145] Atsushi Kameda, Masaru Hoshino, Takashi Higurashi, Satoshi Takahashi, Hironobu
Naiki, and Yuji Goto. Nuclear magnetic resonance characterization of the refolding
intermediate of ?2-microglobulin trapped by non-native prolyl peptide bond. Jour-
nal of molecular biology, 348(2):383?397, 2005.
[146] Motoshi Kamiya and Yuji Sugita. Flexible selection of the solute region in replica
exchange with solute tempering: Application to protein-folding simulations. The
Journal of chemical physics, 149(7):072304, 2018.
[147] Po Wei Kang, Annie M Westerlund, Jingyi Shi, Kelli McFarland White, Alex K Dou,
Amy H Cui, Jonathan R Silva, Lucie Delemotte, and Jianmin Cui. Calmodulin acts
as a state-dependent switch to control a cardiac potassium channel opening. Science
advances, 6(50):eabd6798, 2020.
[148] Xinyue Kang, Fanyi Dong, Cheng Shi, Shicai Liu, Jian Sun, Jiaxin Chen, Haiqi Li,
Hanmei Xu, Xingzhen Lao, and Heng Zheng. Dramp 2.0, an updated data repository
of antimicrobial peptides. Scientific data, 6(1):1?10, 2019.
[149] Steven Kearnes, Kevin McCloskey, Marc Berndl, Vijay Pande, and Patrick Riley.
Molecular graph convolutions: moving beyond fingerprints. Journal of computer-
aided molecular design, 30(8):595?608, 2016.
[150] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.
[151] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv
preprint arXiv:1312.6114, 2013.
[152] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convo-
lutional networks. arXiv preprint arXiv:1609.02907, 2016.
[153] Jeffery B Klauda, Richard M Venable, J Alfredo Freites, Joseph W O?Connor, Dou-
glas J Tobias, Carlos Mondragon-Ramirez, Igor Vorobyov, Alexander D MacK-
erell Jr, and Richard W Pastor. Update of the charmm all-atom additive force
field for lipids: validation on six lipid types. The journal of physical chemistry
B, 114(23):7830?7843, 2010.
238
[154] Kai J Kohlhoff, Diwakar Shukla, Morgan Lawrenz, Gregory R Bowman, David E
Konerding, Dan Belov, Russ B Altman, and Vijay S Pande. Cloud-based simulations
on google exacycle reveal ligand modulation of gpcr activation pathways. Nature
chemistry, 6(1):15?21, 2014.
[155] Youxin Kong, Bert JC Janssen, Tomas Malinauskas, Vamshidhar R Vangoor, Char-
lotte H Coles, Rainer Kaufmann, Tao Ni, Robert JC Gilbert, Sergi Padilla-Parra,
R Jeroen Pasterkamp, et al. Structural basis for plexin activation and regulation.
Neuron, 91(3):548?560, 2016.
[156] Susanna Kube and Marcus Weber. A coarse graining method for the identification of
transition rates between molecular conformations. The Journal of chemical physics,
126(2):024103, 2007.
[157] Jan Kubelka, Thang K Chiu, David R Davies, William A Eaton, and James
Hofrichter. Sub-microsecond protein folding. Journal of molecular biology,
359(3):546?553, 2006.
[158] Jan Kubelka, William A Eaton, and James Hofrichter. Experimental tests of villin
subdomain folding simulations. Journal of molecular biology, 329(4):625?630,
2003.
[159] Jun Lan, Jiwan Ge, Jinfang Yu, Sisi Shan, Huan Zhou, Shilong Fan, Qi Zhang,
Xuanling Shi, Qisheng Wang, Linqi Zhang, et al. Structure of the sars-cov-2 spike
receptor-binding domain bound to the ace2 receptor. Nature, 581(7807):215?220,
2020.
[160] Stefan M Larson, Christopher D Snow, Michael Shirts, and Vijay S Pande. Fold-
ing@ home and genome@ home: Using distributed computing to tackle previously
intractable problems in computational biology. arXiv preprint arXiv:0901.0866,
2009.
[161] Ramanan Laxminarayan, Thomas Van Boeckel, Isabel Frost, Samuel Kariuki,
Ejaz Ahmed Khan, Direk Limmathurotsakul, DG Joakim Larsson, Gabriel Levy-
Hara, Marc Mendelson, Kevin Outterson, et al. The lancet infectious diseases com-
mission on antimicrobial resistance: 6 years later. The Lancet Infectious Diseases,
20(4):e51?e60, 2020.
[162] Tanguy Le Marchand, Matteo de Rosa, Nicola Salvi, Benedetta Maria Sala, Loren B
Andreas, Emeline Barbet-Massin, Pietro Sormanni, Alberto Barbiroli, Riccardo Por-
cari, Cristiano Sousa Mota, et al. Conformational dynamics in crystals reveal the
molecular bases for d76n beta-2 microglobulin aggregation propensity. Nature com-
munications, 9(1):1?11, 2018.
239
[163] Eric H Lee, Jen Hsin, Marcos Sotomayor, Gemma Comellas, and Klaus Schulten.
Discovery through the computational microscope. Structure, 17(10):1295?1306,
2009.
[164] Hui Sun Lee, Yifei Qi, and Wonpil Im. Effects of n-glycosylation on protein con-
formation and dynamics: Protein data bank analysis and molecular dynamics simu-
lation study. Scientific reports, 5(1):1?7, 2015.
[165] Jumin Lee, Xi Cheng, Jason M Swails, Min Sun Yeom, Peter K Eastman, Justin A
Lemkul, Shuai Wei, Joshua Buckner, Jong Cheol Jeong, Yifei Qi, et al. Charmm-gui
input generator for namd, gromacs, amber, openmm, and charmm/openmm simu-
lations using the charmm36 additive force field. Journal of chemical theory and
computation, 12(1):405?413, 2016.
[166] Young-Ho Lee and Yuji Goto. Kinetic intermediates of amyloid fibrillation studied
by hydrogen exchange methods with nuclear magnetic resonance. Biochimica et
Biophysica Acta (BBA)-Proteins and Proteomics, 1824(12):1307?1323, 2012.
[167] Hongxing Lei, Yao Su, Lian Jin, and Yong Duan. Folding network of villin head-
piece subdomain. Biophysical journal, 99(10):3374?3384, 2010.
[168] AS Lemak and NK Balabaev. On the berendsen thermostat. Molecular Simulation,
13(3):177?187, 1994.
[169] Tobias Lemke and Christine Peter. Encodermap: Dimensionality reduction and gen-
eration of molecule conformations. Journal of chemical theory and computation,
15(2):1209?1215, 2019.
[170] Thomas Lemmin, Cinque Soto, Jonathan Stuckey, and Peter D Kwong. Microsecond
dynamics and network analysis of the hiv-1 sosip env trimer reveal collective behav-
ior and conserved microdomains of the glycan shield. Structure, 25(10):1631?1639,
2017.
[171] Ange?lique Lewies, Lissinda H Du Plessis, and Johannes F Wentzel. Antimicrobial
peptides: the achilles? heel of antibiotic resistance? Probiotics and antimicrobial
proteins, 11(2):370?381, 2019.
[172] Chenkai Li, Darcy Sutherland, S Austin Hammond, Chen Yang, Figali Taho, Lauren
Bergman, Simon Houston, Rene? L Warren, Titus Wong, Linda Hoang, et al. Am-
plify: attentive deep learning model for discovery of novel antimicrobial peptides
effective against who priority pathogens. BMC genomics, 23(1):1?15, 2022.
[173] Fang Li. Structural analysis of major species barriers between humans and palm
civets for severe acute respiratory syndrome coronavirus infections. Journal of vi-
rology, 82(14):6984?6991, 2008.
240
[174] Fang Li. Receptor recognition and cross-species infections of sars coronavirus. An-
tiviral research, 100(1):246?254, 2013.
[175] Fang Li. Receptor recognition mechanisms of coronaviruses: a decade of structural
studies. Journal of virology, 89(4):1954?1964, 2015.
[176] Fang Li. Structure, function, and evolution of coronavirus spike proteins. Annual
review of virology, 3:237?261, 2016.
[177] Qingxin Li, Shovanlal Gayen, Angela Shuyi Chen, Qiwei Huang, Manfred Raida,
and CongBao Kang. Nmr solution structure of the n-terminal domain of herg and its
interaction with the s4?s5 linker. Biochemical and biophysical research communi-
cations, 403(1):126?132, 2010.
[178] Wenhui Li, Thomas C Greenough, Michael J Moore, Natalya Vasilieva, Mohan So-
masundaran, John L Sullivan, Michael Farzan, and Hyeryun Choe. Efficient repli-
cation of severe acute respiratory syndrome coronavirus in mouse cells is limited by
murine angiotensin-converting enzyme 2. Journal of virology, 78(20):11429?11433,
2004.
[179] Ying Li, Hui Li, Frank C Pickard IV, Badri Narayanan, Fatih G Sen, Maria KY
Chan, Subramanian KRS Sankaranarayanan, Bernard R Brooks, and Beno??t Roux.
Machine learning force field parameters from ab initio data. Journal of chemical
theory and computation, 13(9):4492?4503, 2017.
[180] Kresten Lindorff-Larsen, Stefano Piana, Ron O Dror, and David E Shaw. How fast-
folding proteins fold. Science, 334(6055):517?520, 2011.
[181] Kresten Lindorff-Larsen, Stefano Piana, Kim Palmo, Paul Maragakis, John L
Klepeis, Ron O Dror, and David E Shaw. Improved side-chain torsion potentials
for the amber ff99sb protein force field. Proteins: Structure, Function, and Bioin-
formatics, 78(8):1950?1958, 2010.
[182] Ross A Lippert, Cristian Predescu, Douglas J Ierardi, Kenneth M Mackenzie,
Michael P Eastwood, Ron O Dror, and David E Shaw. Accurate and efficient in-
tegration for molecular dynamics simulations at constant temperature and pressure.
The Journal of chemical physics, 139(16):10B621 1, 2013.
[183] Jennifer Lippincott-Schwartz, Erik Snapp, and Anne Kenworthy. Studying protein
dynamics in living cells. Nature reviews Molecular cell biology, 2(6):444?456,
2001.
[184] Cynthia Liu, Qiongqiong Zhou, Yingzhu Li, Linda V Garner, Steve P Watkins,
Linda J Carter, Jeffrey Smoot, Anne C Gregg, Angela D Daniels, Susan Jervey,
et al. Research and development on therapeutic agents and vaccines for covid-19
and related human coronavirus diseases, 2020.
241
[185] Lihong Liu, Pengfei Wang, Manoj S Nair, Jian Yu, Micah Rapp, Qian Wang, Yang
Luo, Jasper F-W Chan, Vincent Sahi, Amir Figueroa, et al. Potent neutralizing
antibodies against multiple epitopes on sars-cov-2 spike. Nature, 584(7821):450?
456, 2020.
[186] Pu Liu, Byungchan Kim, Richard A Friesner, and BJ Berne. Replica exchange
with solute tempering: A method for sampling biological systems in explicit water.
Proceedings of the National Academy of Sciences, 102(39):13749?13754, 2005.
[187] Yanxin Liu, Johan Stru?mpfer, Peter L Freddolino, Martin Gruebele, and Klaus
Schulten. Structural characterization of ? -repressor folding from all-atom molecu-
lar dynamics simulations. The journal of physical chemistry letters, 3(9):1117?1123,
2012.
[188] Thomas Lo?hr, Kai Kohlhoff, Gabriella T Heller, Carlo Camilloni, and Michele Ven-
druscolo. A kinetic ensemble of the alzheimer?s a? peptide. Nature Computational
Science, 1(1):71?78, 2021.
[189] Mazin Magzoub, LE Go?ran Eriksson, and Astrid Gra?slund. Conformational states of
the cell-penetrating peptide penetratin when interacting with phospholipid vesicles:
effects of surface charge and peptide concentration. Biochimica Et Biophysica Acta
(BBA)-Biomembranes, 1563(1-2):53?63, 2002.
[190] Ashish Malik, Dwarakanath Prahlad, Naveen Kulkarni, and Abhijit Kayal. Interfa-
cial water molecules make rbd of spike protein and human ace2 to stick together.
bioRxiv, 2020.
[191] Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional
networks for semantic role labeling. arXiv preprint arXiv:1703.04826, 2017.
[192] Andreas Mardt and Frank Noe?. Progress in deep markov state modeling: Coarse
graining and experimental data restraints. The Journal of Chemical Physics, 2021.
[193] Andreas Mardt, Luca Pasquali, Frank Noe?, and Hao Wu. Deep learning markov and
koopman models with physical constraints. In Mathematical and Scientific Machine
Learning, pages 451?475. PMLR, 2020.
[194] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noe?. Vampnets for deep learning
of molecular kinetics. Nature communications, 9(1):1?11, 2018.
[195] Alexandra K Marr, William J Gooderham, and Robert EW Hancock. Antibacterial
peptides for therapeutic use: obstacles and realistic outlook. Current opinion in
pharmacology, 6(5):468?472, 2006.
[196] R Marton?a?k, Alessandro Laio, and Michele Parrinello. Predicting crystal structures:
the parrinello-rahman method revisited. Physical review letters, 90(7):075503, 2003.
242
[197] Glenn J Martyna, Douglas J Tobias, and Michael L Klein. Constant pressure molecu-
lar dynamics algorithms. The Journal of chemical physics, 101(5):4177?4189, 1994.
[198] Deepika Mathur, Sandeep Singh, Ayesha Mehta, Piyush Agrawal, and Gajendra PS
Raghava. In silico approaches for predicting the half-life of natural and modified
peptides in blood. PloS one, 13(6):e0196829, 2018.
[199] Robert T McGibbon, Kyle A Beauchamp, Matthew P Harrigan, Christoph Klein,
Jason M Swails, Carlos X Herna?ndez, Christian R Schwantes, Lee-Ping Wang,
Thomas J Lane, and Vijay S Pande. Mdtraj: a modern open library for the anal-
ysis of molecular dynamics trajectories. Biophysical journal, 109(8):1528?1532,
2015.
[200] Robert T McGibbon and Vijay S Pande. Variational cross-validation of slow
dynamical modes in molecular kinetics. The Journal of chemical physics,
142(12):03B621 1, 2015.
[201] Heleen Meuzelaar, Kristen A Marino, Adriana Huerta-Viga, Matthijs R Panman,
Linde EJ Smeenk, Albert J Kettelarij, Jan H van Maarseveen, Peter Timmerman,
Peter G Bolhuis, and Sander Woutersen. Folding dynamics of the trp-cage minipro-
tein: evidence for a native-like intermediate from combined time-resolved vibra-
tional spectroscopy and molecular dynamics simulations. The Journal of Physical
Chemistry B, 117(39):11490?11501, 2013.
[202] Igor Mezic?. Spectral properties of dynamical systems, model reduction and decom-
positions. Nonlinear Dynamics, 41(1):309?325, 2005.
[203] Igor Mezic?. Analysis of fluid flows via spectral properties of the koopman operator.
Annual Review of Fluid Mechanics, 45:357?378, 2013.
[204] A Brian Mochon and Haoping Liu. The antimicrobial peptide histatin-5 causes a
spatially restricted disruption on the candida albicans surface, allowing rapid entry
of the peptide into the cytoplasm. PLoS pathogens, 4(10):e1000190, 2008.
[205] Andreas Mo?glich, Rebecca A Ayers, and Keith Moffat. Structure and signaling
mechanism of per-arnt-sim domains. Structure, 17(10):1282?1294, 2009.
[206] Viviana Monje-Galvan and Jeffery B Klauda. Modeling yeast organelle membranes
and how lipid diversity influences bilayer properties. Biochemistry, 54(45):6852?
6861, 2015.
[207] Toshifumi Mori and Shinji Saito. Molecular mechanism behind the fast fold-
ing/unfolding transitions of villin headpiece subdomain: Hierarchy and heterogene-
ity. The Journal of Physical Chemistry B, 120(45):11683?11691, 2016.
243
[208] May C Morris, Sebastien Deshayes, Frederic Heitz, and Gilles Divita. Cell-
penetrating peptides: from molecular mechanisms to therapeutics. Biology of the
Cell, 100(4):201?217, 2008.
[209] May C Morris, Pierre Vidal, Laurent Chaloin, Fre?de?ric Heitz, and Gilles Divita. A
new peptide vector for efficient delivery of oligonucleotides into mammalian cells.
Nucleic acids research, 25(14):2730?2736, 1997.
[210] Alex T Mu?ller, Gisela Gabernet, Jan A Hiss, and Gisbert Schneider. modlamp:
Python for antimicrobial peptides. Bioinformatics, 33(17):2753?2755, 2017.
[211] Frederick W Muskett, Samrat Thouta, Steven J Thomson, Alexander Bowen,
Phillip J Stansfeld, and John S Mitcheson. Mechanistic insight into human ether-a-
go-go-related gene (herg) k+ channel deactivation gating from the solution structure
of the eag domain. Journal of Biological Chemistry, 286(8):6184?6191, 2011.
[212] Boaz Nadler, Stephane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Diffu-
sion maps, spectral clustering and eigenfunctions of fokker-planck operators. arXiv
preprint math/0506090, 2005.
[213] Boaz Nadler, Ste?phane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. Dif-
fusion maps, spectral clustering and reaction coordinates of dynamical systems. Ap-
plied and Computational Harmonic Analysis, 21(1):113?127, 2006.
[214] Carlo Napolitano, Silvia G Priori, Peter J Schwartz, Raffaella Bloise, Elena
Ronchetti, Janni Nastoli, Georgia Bottelli, Marina Cerrone, and Sergio Leonardi.
Genetic testing in the long qt syndrome: development and validation of an efficient
approach to genotyping in clinical practice. Jama, 294(23):2975?2980, 2005.
[215] Frank Noe? and Feliks Nuske. A variational approach to modeling slow processes in
stochastic dynamical systems. Multiscale Modeling & Simulation, 11(2):635?655,
2013.
[216] Frank Noe?, Christof Schu?tte, Eric Vanden-Eijnden, Lothar Reich, and Thomas R
Weikl. Constructing the equilibrium ensemble of folding pathways from short
off-equilibrium simulations. Proceedings of the National Academy of Sciences,
106(45):19011?19016, 2009.
[217] Shuichi Nose?. A unified formulation of the constant temperature molecular dynam-
ics methods. The Journal of chemical physics, 81(1):511?519, 1984.
[218] Feliks Nuske, Bettina G Keller, Guillermo Pe?rez-Herna?ndez, Antonia SJS Mey, and
Frank Noe?. Variational approach to molecular kinetics. Journal of chemical theory
and computation, 10(4):1739?1752, 2014.
244
[219] Ruth Nussinov and Chung-Jung Tsai. Allostery in disease and in drug discovery.
Cell, 153(2):293?305, 2013.
[220] Y Zenmei Ohkubo, Taras V Pogorelov, Mark J Arcario, Geoff A Christensen, and
Emad Tajkhorshid. Accelerating membrane insertion of peripheral proteins with a
novel membrane mimetic model. Biophysical journal, 102(9):2130?2139, 2012.
[221] Jim O?Neill. Tackling drug-resistant infections globally: final report and recommen-
dations. 2016.
[222] Jose? Nelson Onuchic and Peter G Wolynes. Theory of protein folding. Current
opinion in structural biology, 14(1):70?75, 2004.
[223] Albert C Pan, Thomas M Weinreich, Stefano Piana, and David E Shaw. Demon-
strating an order-of-magnitude sampling enhancement in molecular dynamics sim-
ulations of complex protein systems. Journal of chemical theory and computation,
12(3):1360?1367, 2016.
[224] Junxiong Pang, Min Xian Wang, Ian Yi Han Ang, Sharon Hui Xuan Tan,
Ruth Frances Lewis, Jacinta I-Pei Chen, Ramona A Gutierrez, Sylvia Xiao Wei
Gwee, Pearleen Ee Yong Chua, Qian Yang, et al. Potential rapid diagnostics, vac-
cine and therapeutics for 2019 novel coronavirus (2019-ncov): a systematic review.
Journal of clinical medicine, 9(3):623, 2020.
[225] Cheol Woo Park, Mordechai Kornbluth, Jonathan Vandermause, Chris Wolverton,
Boris Kozinsky, and Jonathan P Mailoa. Accurate and scalable graph neural network
force field and molecular dynamics with direct force architecture. npj Computational
Materials, 7(1):1?9, 2021.
[226] Michele Parrinello and Aneesur Rahman. Polymorphic transitions in single crystals:
A new molecular dynamics method. Journal of Applied physics, 52(12):7182?7190,
1981.
[227] Michael Patra, Mikko Karttunen, Marja T Hyvo?nen, Emma Falck, Peter Lindqvist,
and Ilpo Vattulainen. Molecular dynamics simulations of lipid bilayers: major arti-
facts due to truncating electrostatic interactions. Biophysical journal, 84(6):3636?
3645, 2003.
[228] Guillermo Pe?rez-Herna?ndez, Fabian Paul, Toni Giorgino, Gianni De Fabritiis, and
Frank Noe?. Identification of slow molecular order parameters for markov model
construction. The Journal of chemical physics, 139(1):07B604 1, 2013.
[229] Anna-Carin Persson, Rene? JM Stet, and Lars Pilstro?m. Characterization of mhc class
i and ?2-microglobulin sequences in atlantic cod reveals an unusually high number
of expressed class i genes. Immunogenetics, 50(1):49?59, 1999.
245
[230] James C Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid,
Elizabeth Villa, Christophe Chipot, Robert D Skeel, Laxmikant Kale, and Klaus
Schulten. Scalable molecular dynamics with namd. Journal of computational chem-
istry, 26(16):1781?1802, 2005.
[231] Stefano Piana, Kresten Lindorff-Larsen, and David E Shaw. Protein folding kinet-
ics and thermodynamics from atomistic simulation. Proceedings of the National
Academy of Sciences, 109(44):17845?17850, 2012.
[232] Thomas J Piggot, Angel Pineiro, and Syma Khalid. Molecular dynamics simulations
of phosphatidylcholine membranes: a comparative force field study. Journal of
chemical theory and computation, 8(11):4593?4609, 2012.
[233] Dora Pinto, Young-Jun Park, Martina Beltramello, Alexandra C Walls, M Alejandra
Tortorici, Siro Bianchi, Stefano Jaconi, Katja Culap, Fabrizia Zatta, Anna De Marco,
et al. Structural and functional analysis of a potent sarbecovirus neutralizing anti-
body. BioRxiv, 2020.
[234] Malak Pirtskhalava, Andrei Gabrielian, Phillip Cruz, Hannah L Griggs, R Burke
Squires, Darrell E Hurt, Maia Grigolava, Mindia Chubinidze, George Gogoladze,
Boris Vishnepolsky, et al. Dbaasp v. 2: an enhanced database of structure and antimi-
crobial/cytotoxic activity of natural and synthetic peptides. Nucleic acids research,
44(D1):D1104?D1112, 2016.
[235] Geoffrey W Platt and Sheena E Radford. Glimpses of the molecular mechanisms
of ?2-microglobulin fibril formation in vitro: Aggregation on a complex energy
landscape. FEBS letters, 583(16):2623?2629, 2009.
[236] Nuria Plattner and Frank Noe?. Protein conformational plasticity and complex ligand-
binding kinetics explored by atomistic simulations and markov models. Nature com-
munications, 6(1):1?10, 2015.
[237] Jan-Hendrik Prinz, Hao Wu, Marco Sarich, Bettina Keller, Martin Senne, Mar-
tin Held, John D Chodera, Christof Schu?tte, and Frank Noe?. Markov models of
molecular kinetics: Generation and validation. The Journal of chemical physics,
134(17):174105, 2011.
[238] Yifei Qi, Xi Cheng, Jumin Lee, Josh V Vermaas, Taras V Pogorelov, Emad Tajkhor-
shid, Soohyung Park, Jeffery B Klauda, and Wonpil Im. Charmm-gui hmmm builder
for membrane simulations with the highly mobile membrane-mimetic model. Bio-
physical journal, 109(10):2012?2022, 2015.
[239] Periathamby Antony Raj, Emil Marcus, and Dinesh K Sukumaran. Structure of hu-
man salivary histatin 5 in aqueous and nonaqueous solutions. Biopolymers: Original
Research on Biomolecules, 45(1):51?67, 1998.
246
[240] Ekagra Ranjan, Soumya Sanyal, and Partha Talukdar. Asap: Adaptive structure
aware pooling for learning hierarchical graph representations. In Proceedings of the
AAAI Conference on Artificial Intelligence, volume 34, pages 5470?5477, 2020.
[241] Lauren M Reid, Chandra S Verma, and Jonathan W Essex. The role of molecular
simulations in understanding the mechanisms of cell-penetrating peptides. Drug
Discovery Today, 24(9):1821?1835, 2019.
[242] Colin Reily, Tyler J Stewart, Matthew B Renfrow, and Jan Novak. Glycosylation in
health and disease. Nature Reviews Nephrology, 15(6):346?366, 2019.
[243] Jiayi Ren, Xiaohui Yuan, Junqi Li, Shujian Lin, Bing Yang, Chun Chen, Jian Zhao,
Weihong Zheng, Huaxin Liao, Zhiwei Yang, et al. Assessing the performance of
the g mmpbsa tools to simulate the inhibition of oseltamivir to influenza virus neu-
raminidase by molecular mechanics poisson?boltzmann surface area methods. Jour-
nal of the Chinese Chemical Society, 67(1):46?53, 2020.
[244] Joa?o Marcelo Lamim Ribeiro, Pablo Bravo, Yihang Wang, and Pratyush Tiwary.
Reweighted autoencoded variational bayes for enhanced sampling (rave). The Jour-
nal of chemical physics, 149(7):072301, 2018.
[245] Susanna Ro?blitz and Marcus Weber. Fuzzy spectral clustering by pcca+: applica-
tion to markov state models and data classification. Advances in Data Analysis and
Classification, 7(2):147?179, 2013.
[246] Barry Rockx, Davide Corti, Eric Donaldson, Timothy Sheahan, Konrad Stadler, An-
tonio Lanzavecchia, and Ralph Baric. Structural basis for potent cross-neutralizing
human monoclonal antibody protection against lethal human and zoonotic severe
acute respiratory syndrome coronavirus challenge. Journal of virology, 82(7):3220?
3235, 2008.
[247] Barry Rockx, Eric Donaldson, Matthew Frieman, Timothy Sheahan, Davide Corti,
Antonio Lanzavecchia, and Ralph S Baric. Escape from human monoclonal antibody
neutralization affects in vitro and in vivo fitness of severe acute respiratory syndrome
coronavirus. The Journal of infectious diseases, 201(6):946?955, 2010.
[248] Carlos HM Rodrigues, Yoochan Myung, Douglas EV Pires, and David B Ascher.
mcsm-ppi2: predicting the effects of mutations on protein?protein interactions. Nu-
cleic acids research, 47(W1):W338?W344, 2019.
[249] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386, 1958.
[250] Katy E Routledge, Gian Gaetano Tartaglia, Geoffrey W Platt, Michele Vendruscolo,
and Sheena E Radford. Competition between intramolecular and intermolecular in-
teractions in an amyloid-forming protein. Journal of molecular biology, 389(4):776?
786, 2009.
247
[251] Alexander M Rush, SEAS Harvard, Sumit Chopra, and Jason Weston. A neural
attention model for sentence summarization. In ACLWeb. Proceedings of the 2015
conference on empirical methods in natural language processing, 2017.
[252] Jean-Paul Ryckaert, Giovanni Ciccotti, and Herman JC Berendsen. Numerical inte-
gration of the cartesian equations of motion of a system with constraints: molecular
dynamics of n-alkanes. Journal of computational physics, 23(3):327?341, 1977.
[253] S Kashif Sadiq, Frank Noe?, and Gianni De Fabritiis. Kinetic characterization of the
critical step in hiv-1 protease maturation. Proceedings of the National Academy of
Sciences, 109(50):20449?20454, 2012.
[254] Michiko Sakata, Eri Chatani, Atsushi Kameda, Kazumasa Sakurai, Hironobu
Naiki, and Yuji Goto. Kinetic coupling of folding and prolyl isomerization of
?2-microglobulin studied by mutational analysis. Journal of molecular biology,
382(5):1242?1255, 2008.
[255] Benjamin Sanchez-Lengeling and Ala?n Aspuru-Guzik. Inverse molecular design
using machine learning: Generative models for matter engineering. Science,
361(6400):360?365, 2018.
[256] Michael C Sanguinetti and Martin Tristani-Firouzi. herg potassium channels and
cardiac arrhythmia. Nature, 440(7083):463?469, 2006.
[257] Carlo Santambrogio, Stefano Ricagno, Matteo Colombo, Alberto Barbiroli,
Francesco Bonomi, Vittorio Bellotti, Martino Bolognesi, and Rita Grandori. De-
loop mutations affect ?2 microglobulin stability, oligomerization, and the low-ph
unfolded form. Protein science, 19(7):1386?1394, 2010.
[258] Soumya Sanyal, Ivan Anishchenko, Anirudh Dagar, David Baker, and Partha Taluk-
dar. Proteingcn: Protein model quality assessment using graph convolutional net-
works. bioRxiv, 2020.
[259] MA Saper, PJy Bjorkman, and DC Wiley. Refined structure of the human his-
tocompatibility antigen hla-a2 at 2.6 a? resolution. Journal of molecular biology,
219(2):277?319, 1991.
[260] Catherine A Sarisky and Stephen L Mayo. The ??? fold: explorations in sequence
space. Journal of molecular biology, 307(5):1411?1418, 2001.
[261] B Peter Sawaya, Josie P Briggs, and Jurgen Schnermann. Amphotericin b nephro-
toxicity: the adverse consequences of altered membrane properties. Journal of the
American Society of Nephrology, 6(2):154?164, 1995.
[262] Martin K Scherer, Brooke E Husic, Moritz Hoffmann, Fabian Paul, Hao Wu, and
Frank Noe?. Variational selection of features for molecular kinetics. The Journal of
chemical physics, 150(19):194108, 2019.
248
[263] Martin K Scherer, Benjamin Trendelkamp-Schroer, Fabian Paul, Guillermo Pe?rez-
Herna?ndez, Moritz Hoffmann, Nuria Plattner, Christoph Wehmeyer, Jan-Hendrik
Prinz, and Frank Noe?. Pyemma 2: A software package for estimation, valida-
tion, and analysis of markov models. Journal of chemical theory and computation,
11(11):5525?5542, 2015.
[264] Martin K. Scherer, Benjamin Trendelkamp-Schroer, Fabian Paul, Guillermo Pe?rez-
Herna?ndez, Moritz Hoffmann, Nuria Plattner, Christoph Wehmeyer, Jan-Hendrik
Prinz, and Frank Noe?. PyEMMA 2: A Software Package for Estimation, Validation,
and Analysis of Markov Models. Journal of Chemical Theory and Computation,
11:5525?5542, October 2015.
[265] Peter J Schmid. Dynamic mode decomposition of numerical and experimental data.
Journal of fluid mechanics, 656:5?28, 2010.
[266] Markus Scho?berl, Nicholas Zabaras, and Phaedon-Stelios Koutsourelakis. Predictive
collective variable discovery with deep bayesian models. The Journal of chemical
physics, 150(2):024109, 2019.
[267] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE
transactions on Signal Processing, 45(11):2673?2681, 1997.
[268] Kristof T Schu?tt, Pieter-Jan Kindermans, Huziel E Sauceda, Stefan Chmiela,
Alexandre Tkatchenko, and Klaus-Robert Mu?ller. Schnet: A continuous-filter
convolutional neural network for modeling quantum interactions. arXiv preprint
arXiv:1706.08566, 2017.
[269] Christian R Schwantes and Vijay S Pande. Improvements in markov state model
construction reveal many non-native interactions in the folding of ntl9. Journal of
chemical theory and computation, 9(4):2000?2009, 2013.
[270] Marwin HS Segler, Mike Preuss, and Mark P Waller. Planning chemical syntheses
with deep neural networks and symbolic ai. Nature, 555(7698):604?610, 2018.
[271] Asif Shajahan, Nitin T Supekar, Anne S Gleinich, and Parastoo Azadi. Deducing
the n-and o-glycosylation profile of the spike protein of novel coronavirus sars-cov-
2. Glycobiology, 30(12):981?988, 2020.
[272] David E Shaw, Peter J Adams, Asaph Azaria, Joseph A Bank, Brannon Batson, Al-
istair Bell, Michael Bergdorf, Jhanvi Bhatt, J Adam Butts, Timothy Correia, et al.
Anton 3: twenty microseconds of molecular dynamics simulation before lunch. In
Proceedings of the International Conference for High Performance Computing, Net-
working, Storage and Analysis, pages 1?11, 2021.
249
[273] David E Shaw, JP Grossman, Joseph A Bank, Brannon Batson, J Adam Butts, Jack C
Chao, Martin M Deneroff, Ron O Dror, Amos Even, Christopher H Fenton, et al.
Anton 2: raising the bar for performance and programmability in a special-purpose
molecular dynamics supercomputer. In SC?14: Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis,
pages 41?53. IEEE, 2014.
[274] Hythem Sidky, Wei Chen, and Andrew L Ferguson. High-resolution markov state
models for the dynamics of trp-cage miniprotein constructed over slow folding
modes identified by state-free reversible vampnets. The Journal of Physical Chem-
istry B, 123(38):7999?8009, 2019.
[275] Mateusz Sikora, So?ren von Bu?low, Florian EC Blanc, Michael Gecht, Roberto
Covino, and Gerhard Hummer. Map of sars-cov-2 spike epitopes not shielded by
glycans. BioRxiv, 2020.
[276] Federica Simeoni, May C Morris, Frederic Heitz, and Gilles Divita. Insight into
the mechanism of the peptide-based gene delivery system mpg: implications for
delivery of sirna into mammalian cells. Nucleic acids research, 31(11):2717?2724,
2003.
[277] Cas Simons, Lachlan D Rash, Joanna Crawford, Linlin Ma, Ben Cristofori-
Armstrong, David Miller, Kelin Ru, Gregory J Baillie, Yasemin Alanay, Adeline
Jacquinet, et al. Mutations in the voltage-gated potassium channel gene kcnh1 cause
temple-baraitser syndrome and epilepsy. Nature genetics, 47(1):73?77, 2015.
[278] Hugh I Smith, Nicolas Guthertz, Emma E Cawood, Roberto Maya-Martinez,
Alexander L Breeze, and Sheena E Radford. The role of the it-state in d76n ?2-
microglobulin amyloid assembly: A crucial intermediate or an innocuous bystander?
Journal of Biological Chemistry, 295(35):12474?12484, 2020.
[279] Wenfei Song, Miao Gui, Xinquan Wang, and Ye Xiang. Cryo-em structure of the
sars coronavirus spike glycoprotein in complex with its host cell receptor ace2. PLoS
pathogens, 14(8):e1007236, 2018.
[280] Giulia Sormani, Alex Rodriguez, and Alessandro Laio. Explicit characterization of
the free-energy landscape of a protein in the space of all its c? carbons. Journal of
chemical theory and computation, 16(1):80?87, 2019.
[281] Angelo Spinello, Andrea Saltalamacchia, and Alessandra Magistrato. Is the rigidity
of sars-cov-2 spike receptor-binding motif the hallmark for its enhanced infectiv-
ity? insights from all-atom simulations. The journal of physical chemistry letters,
11(12):4785?4790, 2020.
250
[282] Tyler N Starr, Allison J Greaney, Sarah K Hilton, Daniel Ellis, Katharine HD Craw-
ford, Adam S Dingens, Mary Jane Navarro, John E Bowen, M Alejandra Tortorici,
Alexandra C Walls, et al. Deep mutational scanning of sars-cov-2 receptor binding
domain reveals constraints on folding and ace2 binding. Cell, 182(5):1295?1310,
2020.
[283] Peter J Steinbach and Bernard R Brooks. New spherical-cutoff methods for long-
range forces in macromolecular simulation. Journal of computational chemistry,
15(7):667?683, 1994.
[284] Guillaume BE Stewart-Jones, Cinque Soto, Thomas Lemmin, Gwo-Yu Chuang, Ali-
aksandr Druz, Rui Kong, Paul V Thomas, Kshitij Wagh, Tongqing Zhou, Anna-
Janina Behrens, et al. Trimeric hiv-1-env structures define glycan shields from clades
a, b, and g. Cell, 165(4):813?826, 2016.
[285] Alexey Strokach, David Becerra, Carles Corbi-Verge, Albert Perez-Riba, and
Philip M Kim. Fast and flexible protein design using deep graph neural networks.
Cell Systems, 11(4):402?411, 2020.
[286] Ernesto Sua?rez, Rafal P Wiewiora, Chris Wehmeyer, Frank Noe?, John D Chodera,
and Daniel M Zuckerman. What markov state models can and cannot do: Correla-
tion versus path-based observables in protein-folding models. Journal of Chemical
Theory and Computation, 17(5):3119?3133, 2021.
[287] Yuji Sugita and Yuko Okamoto. Replica-exchange molecular dynamics method for
protein folding. Chemical physics letters, 314(1-2):141?151, 1999.
[288] Jean-Marie Swiecicki, Margherita Di Pisa, Fabienne Burlina, Pascaline Le?corche?,
Christelle Mansuy, Ge?rard Chassaing, and Solange Lavielle. Accumulation of cell-
penetrating peptides in large unilamellar vesicles: A straightforward screening assay
for investigating the internalization mechanism. Peptide Science, 104(5):533?543,
2015.
[289] William C Swope, Jed W Pitera, and Frank Suits. Describing protein folding kinetics
by molecular dynamics simulations. 1. theory. The Journal of Physical Chemistry
B, 108(21):6571?6581, 2004.
[290] Wanbo Tai, Lei He, Xiujuan Zhang, Jing Pu, Denis Voronin, Shibo Jiang, Yusen
Zhou, and Lanying Du. Characterization of the receptor-binding domain (rbd) of
2019 novel coronavirus: implication for development of rbd protein as a viral at-
tachment inhibitor and vaccine. Cellular & molecular immunology, 17(6):613?620,
2020.
[291] Jan Ter Meulen, Edward N Van Den Brink, Leo L M Poon, Wilfred E Marissen,
Cynthia S W Leung, Freek Cox, Chung Y Cheung, Arjen Q Bakker, Johannes A
251
Bogaards, Els Van Deventer, et al. Human monoclonal antibody combination
against sars coronavirus: synergy and coverage of escape mutants. PLoS medicine,
3(7):e237, 2006.
[292] Xiaolong Tian, Cheng Li, Ailing Huang, Shuai Xia, Sicong Lu, Zhengli Shi, Lu Lu,
Shibo Jiang, Zhenlin Yang, Yanling Wu, et al. Potent binding of 2019 novel coro-
navirus spike protein by a sars coronavirus-specific human monoclonal antibody.
Emerging microbes & infections, 9(1):382?385, 2020.
[293] Marcelo DT Torres, Shanmugapriya Sothiselvam, Timothy K Lu, and Cesar de la
Fuente-Nunez. Peptide design principles for antimicrobial applications. Journal of
molecular biology, 431(18):3547?3567, 2019.
[294] Gareth A Tribello, Massimiliano Bonomi, Davide Branduardi, Carlo Camilloni, and
Giovanni Bussi. Plumed 2: New feathers for an old bird. Computer physics commu-
nications, 185(2):604?613, 2014.
[295] Oleg Trott and Arthur J Olson. Autodock vina: improving the speed and accuracy
of docking with a new scoring function, efficient optimization, and multithreading.
Journal of computational chemistry, 31(2):455?461, 2010.
[296] Jonathan H Tu. Dynamic mode decomposition: Theory and applications. PhD thesis,
Princeton University, 2013.
[297] Andrejs Tucs, Duy Phuoc Tran, Akiko Yumoto, Yoshihiro Ito, Takanori Uzawa, and
Koji Tsuda. Generating ampicillin-level antimicrobial peptides with activity-aware
generative adversarial networks. ACS omega, 5(36):22847?22851, 2020.
[298] Beata Turon?ova?, Mateusz Sikora, Christoph Schu?rmann, Wim JH Hagen, Sonja
Welsch, Florian EC Blanc, So?ren von Bu?low, Michael Gecht, Katrin Bagola, Cindy
Ho?rner, et al. In situ structural analysis of sars-cov-2 spike reveals flexibility medi-
ated by three hinges. Science, 370(6513):203?208, 2020.
[299] Jakob P Ulmschneider and Martin B Ulmschneider. Molecular dynamics simulations
are redefining our view of peptides interacting with biological membranes. Accounts
of chemical research, 51(5):1106?1116, 2018.
[300] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal
of machine learning research, 9(11), 2008.
[301] David Van Der Spoel, Erik Lindahl, Berk Hess, Gerrit Groenhof, Alan E Mark, and
Herman JC Berendsen. Gromacs: fast, flexible, and free. Journal of computational
chemistry, 26(16):1701?1718, 2005.
[302] Yasemin Bozkurt Varolgu?nes?, Tristan Bereau, and Joseph F Rudzinski. Interpretable
embeddings from molecular simulations using gaussian mixture variational autoen-
coders. Machine Learning: Science and Technology, 1(1):015012, 2020.
252
[303] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, ?ukasz Kaiser, and Illia Polosukhin. Attention is all you need.
Advances in neural information processing systems, 30, 2017.
[304] Petar Velic?kovic?, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio,
and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903,
2017.
[305] Vincent A Voelz, Gregory R Bowman, Kyle Beauchamp, and Vijay S Pande. Molec-
ular simulation of ab initio protein folding for a millisecond folder ntl9 (1- 39). Jour-
nal of the American Chemical Society, 132(5):1526?1528, 2010.
[306] Swapnil Wagle, Vasil N Georgiev, Tom Robinson, Rumiana Dimova, Reinhard
Lipowsky, and Andrea Grafmu?ller. Interaction of snare mimetic peptides with lipid
bilayers: effects of secondary structure, bilayer composition and lipid anchoring.
Scientific reports, 9(1):1?14, 2019.
[307] Alexandra C Walls, Young-Jun Park, M Alejandra Tortorici, Abigail Wall, Andrew T
McGuire, and David Veesler. Structure, function, and antigenicity of the sars-cov-2
spike glycoprotein. Cell, 181(2):281?292, 2020.
[308] Alexandra C Walls, M Alejandra Tortorici, Brandon Frenz, Joost Snijder, Wentao
Li, Fe?lix A Rey, Frank DiMaio, Berend-Jan Bosch, and David Veesler. Glycan
shield and epitope masking of a coronavirus spike protein observed by cryo-electron
microscopy. Nature structural & molecular biology, 23(10):899?905, 2016.
[309] Yushun Wan, Jian Shang, Rachel Graham, Ralph S Baric, and Fang Li. Receptor
recognition by the novel coronavirus from wuhan: an analysis based on decade-long
structural studies of sars coronavirus. Journal of virology, 94(7):e00127?20, 2020.
[310] Changhao Wang, D?Artagnan Greene, Li Xiao, Ruxi Qi, and Ray Luo. Recent
developments and applications of the mmpbsa method. Frontiers in molecular bio-
sciences, 4:87, 2018.
[311] Guangshun Wang, Xia Li, and Zhe Wang. Apd3: the antimicrobial peptide database
as a tool for research and education. Nucleic acids research, 44(D1):D1087?D1093,
2016.
[312] Junmei Wang, Romain M Wolf, James W Caldwell, Peter A Kollman, and David A
Case. Development and testing of a general amber force field. Journal of computa-
tional chemistry, 25(9):1157?1174, 2004.
[313] Lingle Wang and BJ Berne. Efficient sampling of puckering states of monosac-
charides through replica exchange with solute tempering and bond softening. The
Journal of Chemical Physics, 149(7):072306, 2018.
253
[314] Lingle Wang, Richard A Friesner, and BJ Berne. Replica exchange with solute
scaling: a more efficient version of replica exchange with solute tempering (rest2).
The Journal of Physical Chemistry B, 115(30):9431?9438, 2011.
[315] Qihui Wang, Yanfang Zhang, Lili Wu, Sheng Niu, Chunli Song, Zengyuan Zhang,
Guangwen Lu, Chengpeng Qiao, Yu Hu, Kwok-Yung Yuen, et al. Structural and
functional basis of sars-cov-2 entry by using human ace2. Cell, 181(4):894?904,
2020.
[316] Weiwei Wang and Roderick MacKinnon. Cryo-em structure of the open human
ether-a?-go-go-related k+ channel herg. Cell, 169(3):422?430, 2017.
[317] Yingjie Wang, Meiyi Liu, and Jiali Gao. Enhanced receptor binding of sars-cov-2
through networks of hydrogen-bonding and hydrophobic interactions. Proceedings
of the National Academy of Sciences, 117(25):13967?13974, 2020.
[318] Ze-Jun Wang, Stephanie M Soohoo, Purushottam B Tiwari, Grzegorz Piszczek, and
Tinatin I Brelidze. Chlorpromazine binding to the pas domains uncovers the ef-
fect of ligand modulation on eag channel activity. Journal of Biological Chemistry,
295(13):4114?4123, 2020.
[319] Michael D Ward, Maxwell I Zimmerman, Artur Meller, Moses Chung, SJ Swami-
dass, and Gregory R Bowman. Deep learning the structural determinants of protein
biochemical properties by comparing structural ensembles with diffnets. Nature
communications, 12(1):1?12, 2021.
[320] Jeffrey Warmke, Rachel Drysdale, and Barry Ganetzky. A distinct potassium chan-
nel polypeptide encoded by the drosophila eag locus. Science, 252(5012):1560?
1562, 1991.
[321] Yasunori Watanabe, Joel D Allen, Daniel Wrapp, Jason S McLellan, and
Max Crispin. Site-specific glycan analysis of the sars-cov-2 spike. Science,
369(6501):330?333, 2020.
[322] Yasunori Watanabe, Zachary T Berndsen, Jayna Raghwani, Gemma E Seabright,
Joel D Allen, Oliver G Pybus, Jason S McLellan, Ian A Wilson, Thomas A Bowden,
Andrew B Ward, et al. Vulnerabilities in coronavirus glycan shields despite extensive
glycosylation. Nature communications, 11(1):1?10, 2020.
[323] Yasunori Watanabe, Thomas A Bowden, Ian A Wilson, and Max Crispin. Exploita-
tion of glycosylation in enveloped virus pathobiology. Biochimica et Biophysica
Acta (BBA)-General Subjects, 1863(10):1480?1497, 2019.
[324] Marcus Weber, Konstantin Fackeldey, and Christof Schu?tte. Set-free markov state
model building. The Journal of chemical physics, 146(12):124133, 2017.
254
[325] Christoph Wehmeyer and Frank Noe?. Time-lagged autoencoders: Deep learning of
slow collective variables for molecular kinetics. The Journal of chemical physics,
148(24):241703, 2018.
[326] Annie M Westerlund and Lucie Delemotte. Inflecs: clustering free energy land-
scapes with gaussian mixtures. Journal of chemical theory and computation,
15(12):6752?6759, 2019.
[327] Jonathan R Whicher and Roderick MacKinnon. Structure of the voltage-gated
k+ channel eag1 reveals an alternative voltage sensing mechanism. Science,
353(6300):664?669, 2016.
[328] Hyeonuk Woo, Sang-Jun Park, Yeol Kyo Choi, Taeyong Park, Maham Tanveer, Yi-
wei Cao, Nathan R Kern, Jumin Lee, Min Sun Yeom, Tristan I Croll, et al. De-
veloping a fully glycosylated full-length sars-cov-2 spike protein model in a viral
membrane. The journal of physical chemistry B, 124(33):7128?7137, 2020.
[329] Daniel Wrapp, Nianshuang Wang, Kizzmekia S Corbett, Jory A Goldsmith, Ching-
Lin Hsieh, Olubukola Abiona, Barney S Graham, and Jason S McLellan. Cryo-
em structure of the 2019-ncov spike in the prefusion conformation. Science,
367(6483):1260?1263, 2020.
[330] Hao Wu and Frank Noe?. Gaussian markov transition models of molecular kinetics.
The Journal of chemical physics, 142(8):02B612 1, 2015.
[331] Hao Wu and Frank Noe?. Variational approach for learning markov processes from
time series data. Journal of Nonlinear Science, 30(1):23?66, 2020.
[332] Hao Wu, Feliks Nu?ske, Fabian Paul, Stefan Klus, Pe?ter Koltai, and Frank Noe?.
Variational koopman models: slow collective variables and molecular kinetics from
short off-equilibrium simulations. The Journal of chemical physics, 146(15):154104,
2017.
[333] Kailang Wu, Guiqing Peng, Matthew Wilken, Robert J Geraghty, and Fang Li.
Mechanisms of host receptor adaptation by severe acute respiratory syndrome coro-
navirus. Journal of Biological Chemistry, 287(12):8904?8911, 2012.
[334] Yan Wu, Feiran Wang, Chenguang Shen, Weiyu Peng, Delin Li, Cheng Zhao, Zhao-
hui Li, Shihua Li, Yuhai Bi, Yang Yang, et al. A noncompeting pair of human
neutralizing antibodies block covid-19 virus binding to its receptor ace2. Science,
368(6496):1274?1278, 2020.
[335] Tian Xie, Arthur France-Lanord, Yanming Wang, Yang Shao-Horn, and Jeffrey C
Grossman. Graph dynamical networks for unsupervised learning of atomic scale
dynamics in materials. Nature communications, 10(1):1?9, 2019.
255
[336] Tian Xie and Jeffrey C Grossman. Crystal graph convolutional neural networks
for an accurate and interpretable prediction of material properties. Physical review
letters, 120(14):145301, 2018.
[337] Renhong Yan, Yuanyuan Zhang, Yaning Li, Lu Xia, Yingying Guo, and Qiang Zhou.
Structural basis for the recognition of sars-cov-2 by full-length human ace2. Science,
367(6485):1444?1448, 2020.
[338] Tianyi Yang, Johnny C Wu, Chunli Yan, Yuanfeng Wang, Ray Luo, Michael B Gon-
zales, Kevin N Dalby, and Pengyu Ren. Virtual screening using molecular simula-
tions. Proteins: Structure, Function, and Bioinformatics, 79(6):1940?1951, 2011.
[339] Yi Isaac Yang, Qiang Shao, Jun Zhang, Lijiang Yang, and Yi Qin Gao. Enhanced
sampling in molecular dynamics. The Journal of chemical physics, 151(7):070902,
2019.
[340] Hangping Yao, Yutong Song, Yong Chen, Nanping Wu, Jialu Xu, Chujie Sun, Jiax-
ing Zhang, Tianhao Weng, Zheyuan Zhang, Zhigang Wu, et al. Molecular architec-
ture of the sars-cov-2 virus. Cell, 183(3):730?738, 2020.
[341] Yuan Yao, Raymond Z Cui, Gregory R Bowman, Daniel-Adriano Silva, Jian
Sun, and Xuhui Huang. Hierarchical nystro?m methods for constructing markov
state models for conformational dynamics. The Journal of chemical physics,
138(17):05B602 1, 2013.
[342] Guizi Ye, Hongyu Wu, Jinjiang Huang, Wei Wang, Kuikui Ge, Guodong Li, Jiang
Zhong, and Qingshan Huang. Lamp2: a major update of the database linking an-
timicrobial peptides. Database, 2020, 2020.
[343] In-Chul Yeh and Gerhard Hummer. System-size dependence of diffusion coeffi-
cients and viscosities from molecular dynamics simulations with periodic boundary
conditions. The Journal of Physical Chemistry B, 108(40):15873?15879, 2004.
[344] Rex Ying, Jiaxuan You, Christopher Morris, Xiang Ren, William L Hamilton, and
Jure Leskovec. Hierarchical graph representation learning with differentiable pool-
ing. arXiv preprint arXiv:1806.08804, 2018.
[345] PEter ZAvodszky, Jo?zsef Kardos, A?da?m Svingor, and Gregory A Petsko. Adjustment
of conformational flexibility is a key event in the thermal adaptation of proteins.
Proceedings of the National Academy of Sciences, 95(13):7406?7411, 1998.
[346] Mark A Zaydman, Marina A Kasimova, Kelli McFarland, Zachary Beller, Panpan
Hou, Holly E Kinser, Hongwu Liang, Guohui Zhang, Jingyi Shi, Mounir Tarek, et al.
Domain?domain interactions determine the gating, permeation, pharmacology, and
subunit modulation of the iks ion channel. Elife, 3, 2014.
256
[347] Dongdong Zhang, Jiaxi Wang, and Donggang Xu. Cell-penetrating peptides as non-
invasive transmembrane vectors for the development of novel multifunctional drug-
delivery systems. Journal of controlled release, 229:130?139, 2016.
[348] Xiaofei Zhang, Federica Bertaso, Jong W Yoo, Karsten Baumga?rtel, Sinead M
Clancy, Van Lee, Cynthia Cienfuegos, Carly Wilmot, Jacqueline Avis, Truc Hunyh,
et al. Deletion of the potassium channel kv12. 2 causes hippocampal hyperexcitabil-
ity and epilepsy. Nature neuroscience, 13(9):1056?1058, 2010.
[349] Junbo Zhao, Yoon Kim, Kelly Zhang, Alexander Rush, and Yann LeCun. Adver-
sarially regularized autoencoders. In International conference on machine learning,
pages 5902?5911. PMLR, 2018.
[350] Peng Zhao, Jeremy L Praissman, Oliver C Grant, Yongfei Cai, Tianshu Xiao, Kate-
lyn E Rosenbalm, Kazuhiro Aoki, Benjamin P Kellman, Robert Bridger, Dan H
Barouch, et al. Virus-receptor interactions of glycosylated sars-cov-2 spike and hu-
man ace2 receptor. Cell host & microbe, 28(4):586?601, 2020.
[351] Yaxian Zhao, Marcel P Goldschen-Ohm, Joa?o H Morais-Cabral, Baron Chanda,
and Gail A Robertson. The intrinsically liganded cyclic nucleotide?binding ho-
mology domain promotes kcnh channel activation. Journal of General Physiology,
149(2):249?260, 2017.
[352] Alex Zhavoronkov, Yan A Ivanenkov, Alex Aliper, Mark S Veselov, Vladimir A
Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil A Polykovskiy,
Maksim D Kuznetsov, Arip Asadulaev, et al. Deep learning enables rapid identi-
fication of potent ddr1 kinase inhibitors. Nature biotechnology, 37(9):1038?1040,
2019.
[353] Peng Zhou, Xing-Lou Yang, Xian-Guang Wang, Ben Hu, Lei Zhang, Wei Zhang,
Hao-Rui Si, Yan Zhu, Bei Li, Chao-Lin Huang, et al. A pneumonia outbreak asso-
ciated with a new coronavirus of probable bat origin. nature, 579(7798):270?273,
2020.
[354] Ruhong Zhou. Trp-cage: folding free energy landscape in explicit water. Proceed-
ings of the National Academy of Sciences, 100(23):13280?13285, 2003.
[355] Tongqing Zhou, Nicole A Doria-Rose, Cheng Cheng, Guillaume BE Stewart-Jones,
Gwo-Yu Chuang, Michael Chambers, Aliaksandr Druz, Hui Geng, Krisha McKee,
Young Do Kwon, et al. Quantification of the impact of the hiv-1-glycan shield on
antibody elicitation. Cell reports, 19(4):719?732, 2017.
[356] Na Zhu, Dingyu Zhang, Wenling Wang, Xingwang Li, Bo Yang, Jingdong Song,
Xiang Zhao, Baoying Huang, Weifeng Shi, Roujian Lu, et al. A novel coronavirus
from patients with pneumonia in china, 2019. New England journal of medicine,
2020.
257
List of publications
1. Mahdi Ghorbani, Bernard R Brooks, and Jeffery B Klauda. Critical sequence hotspots for bind-
ing of novel coronavirus to angiotensin converter enzyme as evaluated by molecular simulations.
The Journal of Physical Chemistry B, 124(45):10034?10047, 2020.
2. Mahdi Ghorbani, Bernard R Brooks, and Jeffery B Klauda. Exploring dynamics and network
analysis of spike glycoprotein of sars-cov-2. Biophysical Journal, 120(14):2902?2913, 2021.
3. Mahdi Ghorbani, Phillip S Hudson, Michael R Jones, Fe?lix Aviat, Rube?n Meana-Pan?eda, Jef-
fery B Klauda, and Bernard R Brooks. A replica exchange umbrella sampling (reus) approach
to predict host?guest binding free energies in sampl8 challenge. Journal of computer-aided
molecular design, 35(5):667?677, 2021.
4. Mahdi Ghorbani, Samarjeet Prasad, Bernard R Brooks, and Jeffery B Klauda. Deep attention
based variational autoencoder for antimicrobial peptide discovery. Biorxiv, 2022.
5. Mahdi Ghorbani, Samarjeet Prasad, Jeffery B Klauda, and Bernard R Brooks. Variational em-
bedding of protein folding simulations using gaussian mixture variational autoencoders. The
Journal of Chemical Physics, 155(19):194108, 2021.
6. Mahdi Ghorbani, Samarjeet Prasad, Jeffery B Klauda, and Bernard R Brooks. Graphvampnet,
using graph neural networks and variational approach to markov processes for dynamical mod-
eling of biomolecules. The Journal of Chemical Physics, 156(18):184103, 2022.
7. Mahdi Ghorbani, Eric Wang, Andreas Kra?mer, and Jeffery B Klauda. Molecular dynamics
simulations of ethanol permeation through single and double-lipid bilayers. The Journal of
Chemical Physics, 153(12):125101, 2020.
8. S Nikfarjam, M Ghorbani, S Adhikari, AJ Karlsson, EV Jouravleva, TJ Woehl, and MA Anisi-
mov. Irreversible nature of mesoscopic aggregates in lysozyme solutions. Colloid Journal,
81(5):546?554, 2019.
258
Talks and Presentations
1. Ghorbani M. ?GraphVAMPNet, using graph neural networks and variational ap-
proach to Markov processes for dynamical modeling of biomolecules?, (Conference
talk, ACS2022, San Diego, US)
2. Ghorbani M. ?Unraveling the allosteric activation of GPCRs using Metadynamics
and deep learning? (Conference talk, BPS2022, San Francisco, US)
3. Ghorbani M. ?Dynamical coarse graining of molecular systems using GraphVAMP-
Nets?(LCB seminar series, 2021, NIH, Bethesda)
4. Ghorbani M.; Brooks B. R.; Klauda J. B.; ?An integrative MD simulation and net-
work analysis approach to study Glycosylation of spike in SARS-COV-2?(Virtual
Poster Presentation, BPS2021)
5. Ghorbani M. ?Gausisan mixture variational autoencoders for dimensionality reduc-
tion and clustering of protein folding simulations?(LCB seminar series, 2020, NIH,
Bethesda)
6. Ghorbani M. ?Investingating dynamics and network analysis of spike protein in
SARS-COV-2?(LCB seminar series, 2020, NIH)
7. Ghorbani M., Harron M., Wang E., Klauda J. B., ?Mechanism of permeability and
toxicity of alcohols to cell membranes by MD simulations?(Poster Presentation ACS2019,
San Diego, US)
8. Ghorbani M., Wang E., Klauda J. B., ?Calculating Ethanol Permeability of Mem-
branes Through Molecular Dynamics Simulations?(Poster Presentation BPS2019,
Baltimore, US)
259