ABSTRACT

Title of Dissertation: APPLICATION OF ADVANCED MACHINE
LEARNING STRATEGIES FOR BIOMEDICAL
RESEARCH

Renee Ti Chou
Doctor of Philosophy, 2023

Dissertation Directed by: Professor Michael P. Cummings
Department of Biology

Biomedical research delves deeply into understanding individual health and disease

mechanisms. Recent advancements in technologies have further transformed the field with

large-scale data sets, enabling data-driven approaches to identify important patterns and

relationships from large data sets. However, these data sets are often noisy and unstruc-

tured. Moreover, missing values and high dimensionality further complicate the analysis

processes aimed at yielding meaningful results. With examples in ocular diseases and

malaria, this dissertation presents novel strategies employing machine learning to tackle

some of the challenges in biomedical research.

In ocular diseases, sustained ocular drug delivery is critical to retain therapeutic lev-

els and improve patient adherence to dosing schedules. To enhance the sustained delivery

system, we engineer peptide sequences as an adapter to impart desired properties to ocular

drugs. Specifically, we develop machine learning models separately for three properties–


melanin binding, cell-penetration, and non-toxicity. We employ data reduction techniques

to reduce the number of features while maintaining the machine learning model perfor-

mance and apply interpretable machine learning techniques to explain model predictions

on the three properties. Experimental validation in rabbits show two-fold increase in drug

retention time with the selected peptide candidate. The developed machine learning frame-

work can be further tailored to engineer other properties in molecular sequences with a wide

variety of potential in biomedical applications.

Malaria is an infectious disease caused by protozoan of the genus Plasmodium and

has been a burden in global health. Developing malaria vaccines is challenging due to the

diversity in parasite antigen sequences, which may lead to immune escape. To facilitate the

vaccine development process, we leverage the wealth of systems data collected from various

sources. For facile data management, a database is constructed to store the structured

data processed from the results of the bioinformatics tools. Due to the small fraction of

Plasmodium proteins labeled as known antigens, and the remaining proteins unknown of

being antigens or non-antigens, a positive-unlabeled machine learning method is applied

to identify potential vaccine antigen candidates. Beyond malaria, our approach provides a

promising framework for identifying and prioritizing vaccine antigen candidates for a broad

range of disease pathogens.


APPLICATION OF ADVANCED MACHINE LEARNING
STRATEGIES FOR BIOMEDICAL RESEARCH

by

Renee Ti Chou

Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment

of the requirements for the degree of
Doctor of Philosophy

2023

Advisory Committee:
Dr. Michael P. Cummings, Chair/Advisor
Dr. Najib El-Sayed, Dean’s Representative
Dr. Laura M. Ensign
Dr. Philip Johnson
Dr. Brian Pierce
Dr. Shannon Takala Harrison


© Copyright by
Renee Ti Chou

2023


Acknowledgments

This endeavor would not have been possible without the individuals who played piv-

otal roles in shaping my Ph.D. journey.

First and foremost, I would like to express my deepest gratitude to my advisor and

committee chair, Dr. Michael P. Cummings. I joined Dr. Cummings’ lab at the Center

for Bioinformatics and Computational Biology with a passion for learning interdisciplinary

research skills and communicating with researchers from diverse scientific backgrounds, fo-

cusing on biomedical research. Dr. Cummings possesses extensive research experience in

applying computational and mathematical methods to biological problems, which aligns

with my interest in developing a machine learning platform for studying biomedical sci-

ences. Dr. Cummings is also exceptionally supportive in both research and professional

development. He is the best mentor, deeply caring for his students, and has taught me not

to be afraid of any challenges. Whether it is about research or fellowship applications, he

is always there to guide and support me.

I am extremely grateful to my two collaborators of my Ph.D. research projects,

Dr. Shannon Takala-Harrison and Dr. Laura M. Ensign. Since joining Dr. Cummings’

lab in June 2019, I have been working on malaria research in collaboration with Dr. Shan-

non Takala-Harrison’s group at the Center for Vaccine Development and Global Health,

University of Maryland School of Medicine. I would like to thank Dr. Takala-Harrison for

ii


her invaluable insights into malaria, from the initial stages of the research to the process of

manuscript writing. In each meeting with Dr. Takala-Harrison, I consistently learn from

her unique perspectives, which enable me to improve the research further based on her

helpful suggestions.

In September 2019, I began working with Dr. Laura M. Ensign’s lab at the Center

for Nanomedicine, Wilmer Eye Institute, which is part of the Johns Hopkins University

School of Medicine. Dr. Ensign’s expertise in drug delivery provides valuable insights and

supplements the interdisciplinary research. She is always very supportive of my professional

development and loves to share her personal stories of challenges she has encountered during

her academic career. These stories have been a great source of encouragement for me when

facing failures along the way. I am deeply grateful to have Dr. Ensign as my Ph.D. project

collaborator.

I also want to extend my sincere thanks to my other dissertation committee members,

Dr. Najib El-Sayed, Dr. Philip Johnson, and Dr. Brian Pierce, for their support over the

past four years, including my initial committee meeting, qualifying examination, as well

as seminars in Computational Biology, Bioinformatics, and Genomics, and the Center for

Bioinformatics and Computational Biology seminars. They have been instrumental in

helping my research in various ways, from teaching and commenting to supporting my

fellowship applications. I cannot thank them enough for their informative and insightful

advice on my dissertation.

Special thanks go to my colleagues in Dr. Cummings’ lab, Alexis S. Boleda, Rana

Khalil, and Yi Chen, for their helpful feedback and comments during lab meetings. I am also

thankful to Jason Fan for his suggestions at the early stages of the malaria research. I would

iii


also like to express my gratitude to all the members of the Center for Bioinformatics and

Computational Biology, the Biological Sciences Graduate Program, and the Computation

and Mathematics for Biological Networks (COMBINE) program who have shown their

support for my research through seminars and courses. My dissertation has been supported

by the COMBINE fellowship and the Ann G. Wylie Dissertation Fellowship awarded by

the Graduate School.

I would also like to acknowledge my family, who have had the most significant impact

on the beginning of my Ph.D. journey. My parents have been consistently supportive,

no matter what decisions I have made. My sister has been my role model since I was

little, and her success as a researcher in the biomedical field has inspired me greatly. Most

importantly, I want to mention the unwavering support of my husband, Henry Hsueh,

throughout my Ph.D. study. Since college, we have been supporting each other’s dreams of

becoming researchers in our respective fields of passion. He is not only my supporter but

also an important collaborator on one of my dissertation projects. Thanks should also go

to my two corgis, Pika and Pichu, for their unconditional love and emotional support.

Without the support and resources provided by these kind and impactful individuals

around me, my Ph.D. journey at the University of Maryland, College Park, would not have

been as fruitful, and my research would not have achieved such high quality.

iv


Table of Contents

Acknowledgements ii

Table of Contents v

List of Tables x

List of Figures xii

I Overview of the Dissertation 1

Chapter 1: Introduction 2
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

II Multifunctional Peptide Engineering 8

Chapter 2: Machine Learning-Driven Multifunctional Peptide Engineering for Sus-
tained Ocular Drug Delivery 9

2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Development of high throughput melanin binding peptide microarray
methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Training of the melanin binding regression model . . . . . . . . . . 13
2.3.3 Training of cell-penetration and cytotoxicity classification models . 16
2.3.4 Validation of predicted peptide properties in vitro . . . . . . . . . . 18
2.3.5 Analysis of peptide variables that contribute to observed properties 19
2.3.6 Characterization and validation of a peptide-drug conjugate in vivo 23

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5.1 Material sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2 Melanin nanoparticle synthesis and characterization . . . . . . . . . 36
2.5.3 Optimization of processing conditions for peptide microarray . . . . 37
2.5.4 Random forest classification model training with the pilot 119-peptide

microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

v


2.5.5 Expansion of the peptide microarray . . . . . . . . . . . . . . . . . 40
2.5.6 Variable reduction of the machine learning input data . . . . . . . 41
2.5.7 Machine learning model training for melanin binding predictions . . 42
2.5.8 Machine learning model training for cell-penetration predictions . . 44
2.5.9 Machine learning model training for cytotoxicity predictions . . . . 46
2.5.10 Peptide generation for machine learning model validation . . . . . . 47
2.5.11 Peptide synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.12 Melanin binding assay for machine learning model validation . . . . 48
2.5.13 Cell-penetration assay with ARPE19 cell type for machine learning

model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5.14 Shapley additive explanations (SHAP) analysis of variable contribu-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.15 Adversarial computational controls . . . . . . . . . . . . . . . . . . 50
2.5.16 Peptide design space visualization . . . . . . . . . . . . . . . . . . . 51
2.5.17 Traceless linker system for conjugating HR97 to brimonidine . . . . 51
2.5.18 In vitro melanin binding assay . . . . . . . . . . . . . . . . . . . . . 53
2.5.19 In vitro stability test for HR97-brimonidine conjugate . . . . . . . . 54
2.5.20 Cathepsin cleavage assay for HR97 and HR97-brimonidine conjugate 55
2.5.21 Cell viability assay of HR97 peptide . . . . . . . . . . . . . . . . . 55
2.5.22 Animal studies—Animal welfare statement . . . . . . . . . . . . . . 56
2.5.23 Rabbit IOP measurements, topical dosing, and ICM injection . . . 56
2.5.24 Measurement of brimonidine in ocular tissues . . . . . . . . . . . . 58
2.5.25 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 3: Engineered Peptide-Drug Conjugate Provides Sustained Protection of
Retinal Ganglion Cells with Topical Administration in Rats 60

3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.1 Conjugation of HR97 peptide to sunitinib increases melanin binding
in vitro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3.2 A deep learning object detection model was more accurate in count-
ing RGCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.3.3 HR97-SunitiGel showed prolonged neuroprotective effects compared
to SunitiGel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3.4 HR97-SunitiGel provided increased intraocular residence time in rats
and therapeutically relevant drug delivery to the posterior segment
in rabbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

3.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5.1 Material sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.5.2 Traceless linker system for conjugating HR97 to sunitinib . . . . . 76
3.5.3 In vitro stability test for HR97-sunitinib conjugate . . . . . . . . . 78
3.5.4 Cathepsin cleavage assay for HR97-sunitinib conjugate . . . . . . . 79

vi


3.5.5 In vitro melanin binding assay . . . . . . . . . . . . . . . . . . . . . 79
3.5.6 In vitro cell uptake assay . . . . . . . . . . . . . . . . . . . . . . . . 81
3.5.7 Characterization of drug solubility . . . . . . . . . . . . . . . . . . 81
3.5.8 Animal studies—Animal welfare statement . . . . . . . . . . . . . . 82
3.5.9 Rat optic nerve head (ONH) crush model . . . . . . . . . . . . . . 82
3.5.10 Retinal ganglion cell staining and imaging . . . . . . . . . . . . . . 83
3.5.11 Retinal ganglion cell counting . . . . . . . . . . . . . . . . . . . . . 84
3.5.12 Pharmacokinetic studies . . . . . . . . . . . . . . . . . . . . . . . . 86
3.5.13 Measurement of sunitinib in ocular tissues . . . . . . . . . . . . . . 86
3.5.14 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

III Malaria Vaccine Antigen Identification 89

Chapter 4: Positive-Unlabeled Learning Identifies Vaccine Candidate Antigens in
the Malaria Parasite Plasmodium falciparum 90

4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.3.1 Identification of potential P. falciparum candidate antigens . . . . . 94
4.3.2 Training positive-unlabeled random forest models . . . . . . . . . . 96
4.3.3 Classification tree filtering using reference antigens . . . . . . . . . 97
4.3.4 Proximity of top-ranked candidates to reference antigens . . . . . . 101
4.3.5 Variable importance of candidate antigen groups . . . . . . . . . . 102
4.3.6 Characteristics of identified potential vaccine antigen targets . . . . 103

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.5.1 Known antigen protein collection . . . . . . . . . . . . . . . . . . . 108
4.5.2 Collection of Plasmodium data and bioinformatic analyses . . . . . 109
4.5.3 Data set assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.4 Positive-unlabeled simulation . . . . . . . . . . . . . . . . . . . . . 113
4.5.5 Positive-unlabeled random forest algorithm implementation . . . . 113
4.5.6 Positive-unlabeled random forest evaluation . . . . . . . . . . . . . 114
4.5.7 Variable space weighting . . . . . . . . . . . . . . . . . . . . . . . . 115
4.5.8 Ensemble constituent filtering . . . . . . . . . . . . . . . . . . . . . 116
4.5.9 Positive-unlabeled random forest validation . . . . . . . . . . . . . 116
4.5.10 Candidate antigen clustering and comparisons . . . . . . . . . . . . 117
4.5.11 Variable importance analyses . . . . . . . . . . . . . . . . . . . . . 118
4.5.12 Variable value comparisons of top important variables . . . . . . . 119
4.5.13 Gene ontology enrichment analysis . . . . . . . . . . . . . . . . . . 119
4.5.14 Candidate antigen characterization . . . . . . . . . . . . . . . . . . 120
4.5.15 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Chapter 5: Plasmodium vivax Antigen Candidate Prediction Improves with the
Addition of Plasmodium falciparum Data 123

vii


5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.3.1 Data engineering and model training . . . . . . . . . . . . . . . . . 126
5.3.2 Comparison of single-species models and the combined model . . . 128
5.3.3 Effects of heterologous positives and unlabeled proteins on combined

model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3.4 Analysis of model prediction space and species effect . . . . . . . . 133
5.3.5 Variables contributing to Plasmodium antigen prediction . . . . . . 136
5.3.6 Characterization of top vaccine antigen candidates . . . . . . . . . 137

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.5.2 Known antigen labeling . . . . . . . . . . . . . . . . . . . . . . . . 146
5.5.3 Machine learning data assembly and data combinations . . . . . . . 147
5.5.4 Positive-unlabeled random forest training . . . . . . . . . . . . . . 148
5.5.5 Positive-unlabeled random forest evaluation . . . . . . . . . . . . . 149
5.5.6 Adversarial control analysis . . . . . . . . . . . . . . . . . . . . . . 150
5.5.7 Comparison of models trained with different data combinations . . 151
5.5.8 Model interpretation of the combined model . . . . . . . . . . . . . 152
5.5.9 Clustering and amino acid composition analyses of model predictions 153
5.5.10 Variable importance analysis . . . . . . . . . . . . . . . . . . . . . 154
5.5.11 Clustering of top candidate antigens . . . . . . . . . . . . . . . . . 155
5.5.12 Gene ontology enrichment analysis . . . . . . . . . . . . . . . . . . 156

IV Appendices 157

Appendix A: Supplementary Information for Machine Learning-Driven Multifunc-
tional Peptide Engineering for Sustained Ocular Drug Delivery 158

A.1 Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
A.1.1 Machine learning input data sets . . . . . . . . . . . . . . . . . . . 158
A.1.2 Machine learning cross-validation results . . . . . . . . . . . . . . . 159
A.1.3 Adversarial control machine learning cross-validation results . . . . 163

A.2 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
A.3 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

Appendix B: Supplementary Information for Engineered Peptide-Drug Conjugate
Provides Sustained Protection of Retinal Ganglion Cells with Topical
Administration in Rats 185

B.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Appendix C: Supplementary Information for Positive-Unlabeled Learning Identifies
Vaccine Candidate Antigens in the Malaria Parasite Plasmodium fal-
ciparum 194

viii


C.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
C.2 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Appendix D: Supplementary Information for Plasmodium vivax Antigen Candidate
Prediction Improves with the Addition of Plasmodium falciparum Data 209

D.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
D.2 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

Bibliography 223

ix


List of Tables

4.1 Significantly enriched gene ontology terms with false discovery rate (FDR)
<0.05 in gene ontology enrichment analysis of candidate antigen groups with
the background proteome of P. falciparum 3D7. . . . . . . . . . . . . . . . 122

5.1 P. vivax and P. falciparum known antigen prediction accuracies of PURF
models trained separately on P. vivax, P. falciparum, and combined data sets.128

5.2 Different combinations of data from P. vivax and P. falciparum and their
corresponding model types. . . . . . . . . . . . . . . . . . . . . . . . . . . 132

A.1 Cross-validation performance (mean ± SEM) of the melanin binding general
and adversarial control models. . . . . . . . . . . . . . . . . . . . . . . . . 180

A.2 Cross-validation performance (mean ± SEM) of the cell-penetration general
and adversarial control models. . . . . . . . . . . . . . . . . . . . . . . . . 180

A.3 Cross-validation performance (mean ± SEM) of the cytotoxicity general and
adversarial control models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

A.4 Ocular grading 7 days after a single ICM injection of saline, HR97 (equiva-
lent to the amount of HR97 in HR97-brimonidine conjugate), or a physical
mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine,
200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 181

A.5 Ocular grading 14 days after a single ICM injection of saline, HR97 (equiv-
alent to the amount of HR97 in HR97-brimonidine conjugate), or a physical
mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine,
200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 182

A.6 Ocular grading 21 days after a single ICM injection of saline, HR97 (equiv-
alent to the amount of HR97 in HR97-brimonidine conjugate), or a physical
mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine,
200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 183

A.7 Ocular grading 28 days after a single ICM injection of saline, HR97 (equiv-
alent to the amount of HR97 in HR97-brimonidine conjugate), or a physical
mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine,
200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 184

x


C.1 Top important variables (upper part) and variable categories (lower part)
in group 1 candidate antigens. Ranks in groups 2 and 3 individual variable
and variable category importance are also shown (MDA: Mean Decrease
Accuracy). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

C.2 Top important variables (upper part) and variable categories (lower part) in
group 2 candidate antigens. Ranks in groups 1 and 3 variable and variable
category importance are also shown (MDA: Mean Decrease Accuracy). . . 207

C.3 Top important variables (upper part) and variable categories (lower part) in
group 3 candidate antigens. Ranks in groups 1 and 2 variable and variable
category importance are also shown (MDA: Mean Decrease Accuracy). . . 208

D.1 Associations between Plasmodium species and antigen predictions from mod-
els trained on different combinations of autologous and heterologous data
(CI: confidence interval). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

xi


List of Figures

2.1 Pilot 119 melanin binding peptide microarray screening with machine learn-
ing analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 Schematic of the machine learning pipeline based on the super learner frame-
work for the melanin binding data set. . . . . . . . . . . . . . . . . . . . . 17

2.3 Experimental validations of final model predictions on melanin binding and
cell-penetration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Melanin binding, cell-penetration model interpretation, and variable contri-
butions to HR97 multifunctional peptide predictions. . . . . . . . . . . . . 22

2.5 Visualization of the peptide design space based on sequences and physio-
chemical properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.6 Characterization of HR97-brimonidine in vitro and in vivo. . . . . . . . . . 27

3.1 Characterization of HR97-sunitinib stability and solubility. . . . . . . . . . 64
3.2 Characterization of HR97-sunitinib melanin binding and cell uptake in vitro. 65
3.3 Comparison between SSD-MobileNet, Faster R-CNN Inception ResNet v2,

and CellProfiler software. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 HR97-SunitiGel extended RGC protection to at least 2 weeks after the last

topical dose in rat model of optic nerve injury. . . . . . . . . . . . . . . . . 69
3.5 Characterization of intraocular drug concentrations after topical dosing with

SunitiGel or HR97-SunitiGel in rats and rabbits. . . . . . . . . . . . . . . 70

4.1 Database schema of P. falciparum vaccine target identification. . . . . . . 95
4.2 Model evaluation and validation of positive-unlabeled random forest models. 99
4.3 Positive-unlabeled random forest model interpretation based on known anti-

gens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Clustering of top 200 candidate antigens based on proximity measured from

tree-based model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.1 Performance of PURF models with the optimal hyper-parameter setting. . 129
5.2 Probability score distributions of PURF models. . . . . . . . . . . . . . . . 131
5.3 Visualization of the prediction space of the combined PURF model. . . . . 135
5.4 Model interpretation of the combined PURF model on the prediction of

known antigens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.5 Venn diagram of top 10 important variables from different PURF models. . 140

xii


A.1 Characterization of melanin nanoparticles (mNPs) and biotinylated-melanin
nanoparticles (b-mNPs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

A.2 Interaction profilings of b-mNPs against peptides in the pilot 119 microarray.168
A.3 Variable reduction of peptide data sets with random forests. . . . . . . . . 169
A.4 Base model coefficients in final super learners. . . . . . . . . . . . . . . . . 170
A.5 Comparison of melanin binding and cell-penetration of candidate peptides

in non-induced ARPE-19 cells. . . . . . . . . . . . . . . . . . . . . . . . . . 171
A.6 Cytotoxicity model interpretation. . . . . . . . . . . . . . . . . . . . . . . . 172
A.7 Variable contributions to the prediction of the adversarial models. . . . . . 173
A.8 Cytotoxicity validation of the HR97 peptide. . . . . . . . . . . . . . . . . . 174
A.9 NMR spectrum of brimonidine. . . . . . . . . . . . . . . . . . . . . . . . . 175
A.10 NMR spectrum of Mc-VC-PAB-Cl (Maleimidocaproyl-L-valine-L-citrulline-

p-aminobenzyl chloride). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
A.11 NMR spectrum of Mc-VC-PAB-brimonidine. . . . . . . . . . . . . . . . . . 177
A.12 MALDI-TOF spectrum of the HR97-brimonidine conjugate. . . . . . . . . 178
A.13 Comparison of intraocular pressure (IOP) change from baseline. . . . . . . 179

B.1 Synthesis scheme for HR97-sunitinib. . . . . . . . . . . . . . . . . . . . . . 185
B.2 NMR spectrum of sunitinib base. . . . . . . . . . . . . . . . . . . . . . . . 186
B.3 NMR spectrum of Mc-VC-PAB-Cl. . . . . . . . . . . . . . . . . . . . . . . 187
B.4 NMR spectrum of Mc-VC-PAB-sunitinib. . . . . . . . . . . . . . . . . . . 188
B.5 Molecular structure of HR97-sunitinib conjugate and the MALDI-TOF spec-

trum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
B.6 HPLC analysis of cathepsin cleavage assay of the HR97-sunitinib conjugate. 190
B.7 RGC quantification using SSD Mobile-Net. . . . . . . . . . . . . . . . . . . 191
B.8 RGC quantification using the Faster R-CNN Inception Resnet v2 model. . 192
B.9 Time course of RGC loss in the rat optic nerve head crush model. . . . . . 193

C.1 Database schema of P. falciparum reverse vaccinology data. . . . . . . . . 195
C.2 Evaluation of model performance on simulated data set. . . . . . . . . . . 196
C.3 Hyper-parameter tuning before variable space weighting. . . . . . . . . . . 197
C.4 Evaluation of known antigen predictions before variable space weighting. . 198
C.5 Hyper-parameter tuning after variable space weighting. . . . . . . . . . . . 199
C.6 Evaluation of known antigen predictions after variable space weighting. . . 200
C.7 Comparison of mean differences in probability scores after known antigen

label removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
C.8 Probability scores of candidate antigen groups. . . . . . . . . . . . . . . . . 202
C.9 Statistical comparisons of distances between candidate and reference antigens.203
C.10 Statistical comparisons of variable values of top important variables between

the candidate antigen groups and randomly selected non-antigens. . . . . . 204
C.11 Candidate antigen characterization across various P. falciparum life stages. 205

D.1 Hyper-parameter tuning for PURF model trained on the P. vivax data set. 210
D.2 Evaluation of known antigen predictions of the P. vivax model. . . . . . . 211
D.3 Hyper-parameter tuning for PURF model trained on the combined data set. 212

xiii


D.4 Evaluation of known antigen predictions of the combined model. . . . . . . 213
D.5 Validation of PURF models. . . . . . . . . . . . . . . . . . . . . . . . . . . 214
D.6 Evaluation of known antigen predictions of PURF models. . . . . . . . . . 215
D.7 Relationship between proportion of labeled positives in the data set and

mean tree depth in the PURF model. . . . . . . . . . . . . . . . . . . . . . 216
D.8 Visualization of hierarchical clustering dendrogram investigation. . . . . . 217
D.9 Variable importance for the P. vivax model. . . . . . . . . . . . . . . . . . 218
D.10 Comparison of variable importance values between PURF models. . . . . . 219
D.11 Clustering analysis of top candidate antigens. . . . . . . . . . . . . . . . . 220
D.12 Gene ontology (GO) enrichment analysis of candidate antigen groups. . . . 221

xiv


Part I

Overview of the Dissertation

1


Chapter 1: Introduction

1.1 Background

The advancement of biological and computational technologies has enabled the gener-

ation of large and complex data in biological sciences research, and has promoted the broad

application of machine learning in various biomedical domains over the past decades [1,2].

As the volume of data increases, multiple research fields gradually transitioned from tradi-

tional, model-focused approaches, to approaches that are more data-driven [3]. However,

challenges emerge when extracting meaningful patterns and relationships from the large

amount of data. Machine learning, including both statistical methods and computational

algorithms, aims at learning relationships among data, which can gain important insights

from complex and large-scale data by computing the underlying and inherent structures

within a data set. However, in biomedical application, there is a wide variety of biological

data types, such as genome sequences, gene expressions, and molecular structures [2]. Be-

cause of such diversity, the selection of representative features from the high-dimensional

data set and the usage of machine learning algorithms are usually problem-specific [4].

Moreover, the rapid growth of data could lead to a lack of substantial labeling, hamper-

ing the model performance due to insufficient information [5]. Therefore, it is critical to

develop adaptive and advanced strategies to solve the biomedical research problems more

2


effectively and efficiently.

The dissertation delves into two types of biomedical problems: sustained ocular drug

delivery and malaria vaccine antigen identification. This dissertation introduces machine

learning-based platforms that can be further extended to various other biomedical appli-

cations. In the research domain of sustained ocular drug delivery, patient adherence may

be enhanced through maintaining the drug therapeutic levels in the eye. Utilizing melanin

residing in the pigmented tissues in the eye, drugs with melanin binding and cell penetra-

tion properties can be stored and slowly released from the depot. To impart the desired

properties to drugs, peptides, which are short sequences of amino acids, can be used as an

adapter and be conjugated to drugs. For peptides with lengths ranging between 7 and 12

amino acids, the number of possible combinations is ∼4.3 × 10−15, given 20 different amino

acids. Among other methods, machine learning is an appropriate approach to rationally

design peptides with desired properties. By performing interpretable machine learning

techniques, the predictions of the model can be explained, leading to reproducible and

transparent results.

Regarding the research of identifying malaria vaccine antigen candidates, effective

malaria vaccines targeting either of the most-predominant species, Plasmodium falciparum

and Plasmodium vivax, are an unmet need. Reverse vaccinology, which leverages the wealth

of systemic data derived from pathogen genomes, has been adopted to facilitate the process

of vaccine development. However, most methods involved filtering candidate antigens with

criteria solely based on domain knowledge, and a more comprehensive, data-centric machine

learning approach is less explored. Without prior assumptions about the importance of

protein variables, machine learning assists in learning the variable importance through

3


training data sets, and, if provided, the corresponding labels, which indicate whether a

protein is an antigen or non-antigen. Nevertheless, due to the fact that validating the

antigenicity of a protein requires several rigorous experiments and thus is time-consuming,

the antigenic labeling of the proteins is sparse, with only a few proteins labeled as antigens,

and the remaining proteins being unlabeled. To overcome such challenge, an advanced

approach of positive-unlabeled learning is adapted to identify potential antigen candidates

with the goal to further improve the reverse vaccinology pipeline in vaccine development.

1.2 Dissertation Outline

The dissertation is structured so that each chapter corresponds to a manuscript.

Part I: Overview of the Dissertation, provides the background and scope of the problem

domains, as well as a brief introduction to each chapter. Part II: Multifunctional Peptide

Engineering, including Chapters 2 and 3, focuses on using an ensemble machine learning

method to engineer multifunctional peptides to improve sustained ocular drug delivery.

Part III: Malaria Vaccine Identification, consisting of Chapters 4 and 5, emphasizes on

using the positive-unlabeled learning technique to identify potential candidates for malaria

vaccine antigens to facilitate the vaccine development process. The appendices in Part IV

provide additional materials related to the research findings.

Chapter 2: Machine Learning-Driven Multifunctional Peptide Engineering for Sus-

tained Ocular Drug Delivery, presents research results published in Nature Communi-

cations (https://doi.org/10.1038/s41467-023-38056-w), authored by H. T. Hsueh,

R. T. Chou (co-first author), U. Rai, W. Liyanage, Y. C. Kim, M. B. Appell, J. Pejavar,

4

https://doi.org/10.1038/s41467-023-38056-w


K. T. Leo, C. Davison, P. Kolodziejski, A. Mozzer, H. Kwon, M. Sista, N. M. Anders,

A. Hemingway, S. V. K. Rompicharla, M. Edwards, I. Pitha, J. Hanes, M. P. Cummings,

and L. M. Ensign. The research addresses the challenge of delivering drugs into the eye,

which stems from the inherent ocular barriers and clearing mechanisms [6, 7], resulting in

an intensive dosing schedule that discourages patient compliance. Thus, it is important

to develop effective ocular drug delivery systems that can maintain sustained therapeutic

drug levels. To assist in the delivery of drugs to the depot formed by the melanin in-

side the pigmented tissues in the eye, the research leverages machine learning models to

guide the engineering of multifunctional peptide adapters, which imparts melanin binding

and cell penetrating properties to ocular drugs. My contributions to this work include:

(i) developing melanin binding peptide microarray assays; (ii) designing an ensemble ma-

chine learning pipeline to predict melanin binding, cell penetration, and low cytotoxicity

peptides; and (iii) conducting interpretable machine learning analyses to understand and

explain model predictions. The corresponding supplementary information is in Appendix

A. H. T. Hsueh, R. T. Chou, J. Hanes, M. P. Cummings, and L. M. Ensign are named as

inventors on the U.S. Provisional Patent Application No. 63/340,714, which covers aspects

of this work.

Chapter 3: Engineered Peptide-Drug Conjugate Provides Sustained Protection of

Retinal Ganglion Cells with Topical Administration in Rats, presents further application

of the selected peptide candidate from machine learning models trained in Chapter 2 to

another ocular drug, sunitinib, that protects retinal ganglion cells. The research work

is published in Journal of Controlled Release (https://doi.org/10.1016/j.jconrel.

2023.08.058), and is authored by H. T. Hsueh, R. T. Chou (co-first author), U. Rai,

5

https://doi.org/10.1016/j.jconrel.2023.08.058
https://doi.org/10.1016/j.jconrel.2023.08.058


P. Kolodziejski, W. Liyanage, J. Pejavar, A. Mozzer, C. Davison, M. B. Appell, Y. C. Kim,

K. T. Leo, H. Kwon, M. Sista, N. M. Anders, A. Hemingway, S. V. K. Rompicharla,

I. Pitha, D. J. Zack, J. Hanes, M. P. Cummings, and L. M. Ensign. The research focuses on

improving the drug delivery system to treat chronic diseases related to the posterior segment

of the eye, such as retina, choroid and optic nerve, with the ultimate goal of enhancing

patient adherence for better disease management. My contributions to this manuscript

include: (i) participating in conceptualizing, designing, and interpreting experiments and

results; (ii) using machine learning models to predict and select peptide candidates to be

conjugated to the drug; and (iii) applying an object detection technique to facilitate the

measurement of cell survival rates to validate the effectiveness of the peptide-drug conjugate

in the drug delivery system. The supplementary materials are described in Appendix B.

Chapter 4: Positive-Unlabeled Learning Identifies Vaccine Candidate Antigens in the

Malaria Parasite Plasmodium falciparum, discusses research that studies approaches to fa-

cilitate malaria vaccine development. The manuscript is currently under review by npj

Systems Biology and Applications, and is a collaborative work by the authors, R. T. Chou,

A. Ouattara, M. Adams, A. A. Berry, S. Takala-Harrison, and M. P. Cummings. Malaria is

a mosquito-borne infectious disease caused by Plasmodium species. The parasite has mul-

tiple life stages, and exhibits various immune evasion strategies, such as extremely variable

surface antigens [8]. Thus, it is critical to identify conserved potential vaccine antigens that

are less variable but with subdominant immunogenicity. The research employs a machine

learning-based reverse vaccinology approach to identify potential vaccine antigen candi-

dates for malaria. Since only a few known antigens are selected based on our stringent

criteria, the data set is largely unlabeled. My contributions to this research include: (i)

6


adapting a positive-unlabeled learning algorithm to classify P. falciparum proteins into

antigens or non-antigens while tackling the problem with sparse antigenic labeling; (ii) im-

proving the machine learning model by utilizing the tree structure in the positive-unlabeled

random forest model; and (iii) performing downstream computational analyses to charac-

terize top antigen candidates and to further select a smaller set of antigen candidates based

on desired properties for future experimental validation experiments. The supplementary

details for Chapter 4 can be found in Appendix C.

Chapter 5: Plasmodium vivax Antigen Candidate Prediction Improves with the Addi-

tion of Plasmodium falciparum Data, highlights research findings of a comprehensive study

conducted to improve the identification of vaccine antigen candidates for P. vivax, the

second-most prevalent species causing malaria, by integrating data from the well-studied

species, P. falciparum. The study also employs positive-unlabeled learning to construct a

machine learning model with multiple different training sets generated by integrating the

data of the two species. The research work is jointly conducted by the authors, R. T. Chou,

A. Ouattara, S. Takala-Harrison, and M. P. Cummings, and will be submitted to npj System

Biology and Applications soon. My contributions to the manuscript include: (i) applying

the positive-unlabeled learning framework described in Chapter 4 to various combinations

of training data from P. vivax and P. falciparum; (ii) decomposing and quantifying the ef-

fects of the addition of known antigens and/or unlabeled proteins; and (iii) characterizing

top candidate antigens, analyzing important protein variables for identifying top candi-

dates, and comparing important variables identified from across various machine learning

models to gain insights into the proposed integration methodology. Additional information

for Chapter 5 is provided in Appendix D.

7


Part II

Multifunctional Peptide Engineering

8


Chapter 2: Machine Learning-Driven Multifunctional Peptide Engineering

for Sustained Ocular Drug Delivery

2.1 Abstract

Sustained drug delivery strategies have many potential benefits for treating a range

of diseases, particularly chronic diseases that require treatment for years. For many chronic

ocular diseases, patient adherence to eye drop dosing regimens and the need for frequent in-

traocular injections are significant barriers to effective disease management. Here, we utilize

peptide engineering to impart melanin binding properties to peptide-drug conjugates to act

as a sustained-release depot in the eye. We develop a super learning-based methodology to

engineer multifunctional peptides that efficiently enter cells, bind to melanin, and have low

cytotoxicity. When the lead multifunctional peptide (HR97) is conjugated to brimonidine,

an intraocular pressure lowering drug that is prescribed for three times per day topical dos-

ing, intraocular pressure reduction is observed for up to 18 days after a single intracameral

injection in rabbits. Further, the cumulative intraocular pressure lowering effect increases

∼17-fold compared to free brimonidine injection. Engineered multifunctional peptide-drug

conjugates are a promising approach for providing sustained therapeutic delivery in the

eye and beyond.

9


2.2 Introduction

In many disease settings, sustained delivery of therapeutic levels of drug can improve

treatment efficacy, reduce side effects, and avoid challenges with patient adherence to in-

tensive dosing regimens [9, 10]. This is particularly critical in the management of chronic

diseases, where long-term adherence to medication usage and clinical monitoring can suf-

fer [11, 12]. In the ophthalmic setting, the leading causes of irreversible blindness and low

vision are primarily age-related, chronic diseases, such as glaucoma and age-related mac-

ular degeneration [13–15]. Recent approvals of devices that provide sustained therapeutic

release, such as the Durysta® intracameral implant for continuous delivery of an intraocu-

lar pressure (IOP) lowering agent, and the surgically implanted port-delivery system that

provides continuous intravitreal delivery of ranibizumab, highlight the importance of these

next generation approaches for ocular disease management [16–19]. Conventionally, sus-

tained therapeutic effect is achieved by an injectable or implantable device that controls the

release of the therapeutic moiety into the surrounding environment. However, these devices

typically require injection through larger gauge needles or a surgery for implantation, with

both procedures having associated risks [20–22]. Further, the buildup of excipient material,

the need for device removal, and the potential for foreign body reaction can cause further

issues [18, 23, 24].

One approach for circumventing the issues associated with sustained release devices is

to impart enhanced retention time and therapeutic effect to drugs upon administration to

the eye without the need for an excipient matrix/implant. Binding to melanin, a pigment

present within melanosomes in multiple ocular cell types, was previously reported to affect

10


ocular drug biodistribution [25]. Due to the low turnover rate of ocular melanin, a drug

that can bind to melanin may accumulate in pigmented eye tissues, leading to drug toxicity

or drug sequestration [26, 27]. However, with the right balance of melanin-binding affinity

and capacity, melanin may act as a sustained-release drug depot in the eye that results in

prolonged therapeutic action [28]. Several drugs have been demonstrated to have intrinsic

melanin binding properties due to particular physicochemical properties, which in some

cases, prolongs the pharmacologic activity in the eye [28–30].

To impart beneficial melanin-binding properties to drugs, one approach is to engi-

neer peptides with high melanin binding that could be conjugated to small molecule drugs

through a reducible linker. Thus, the peptide would provide enhanced retention time,

while the linker would ensure that drug could be released and exert its therapeutic action

in a sustained manner. In addition, there are available databases describing how peptide

sequence affects cell-penetration [31,32], and separately cytotoxicity [33], enabling the po-

tential for engineering multifunctional peptides that can be chemically conjugated to drugs.

Incorporating multiple functions into one peptide sequence remains challenging, and thus

multifunctional peptides are often designed by fusing peptides via a linker, thus forgoing

potentially more efficient rational design, or by testing additional properties on peptides

with known functions [34–36]. In contrast, machine learning could allow for designing

peptide sequences that simultaneously provide multiple desired properties.

Here, we describe the development of engineered peptides informed by machine learn-

ing, which have three properties: high binding to melanin, cell-penetration (to enter cells

and access melanin in the melanosomes), and low cytotoxicity. As there was no prior infor-

mation for how peptide sequences affect melanin binding, we experimentally determine the

11


effect of peptide sequence on melanin binding using a microarray. We then apply machine

learning-based analyses to identify peptide sequences that display all three desired proper-

ties. Importantly, with the Shapley additive explanation (SHAP) analysis [37] of peptide

variables, the machine learning model interpretation provides additional insights and rea-

soning for the multifunctionality of the peptides. As a proof-of-principle, we demonstrate

here that an engineered peptide, HR97, can be conjugated to the intraocular pressure (IOP)

reducing drug, brimonidine tartrate. A single intracameral (ICM) injection of the HR97-

brimonidine conjugate is able to provide sustained IOP reduction in normotensive rabbits

compared to ICM injection of an equivalent amount of brimonidine tartrate, or a topical

dose of Alphagan® P 0.1% eye drops. Further, the maximum measured change in IOP

from baseline (∆IOP) is increased with ICM injection of the HR97-brimonidine conjugate.

We anticipate that engineered peptide-drug conjugates will facilitate the development of

implant-free injectables for use in a variety of ophthalmic indications.

2.3 Results

2.3.1 Development of high throughput melanin binding peptide microarray

methodology

To determine how peptide sequence affects melanin binding properties, we adapted

a high-throughput flow-based peptide microarray system to characterize melanin binding

events (Fig. 2.1a). Commercially available eumelanin was processed into nanoparticles

(mNPs) to prevent sedimentation and provide reproducible surface area available for bind-

ing to peptides printed on the substrate surface. The mNPs had an mean size of 200.7±5.99

12


nm and ζ-potential of −23.7 ± 1.39 mV (Fig. 2.1b, c). The mNPs were further biotinylated

(b-mNPs) to facilitate fluorescent labeling with streptavidin DyLight680. The b-mNPs

showed slightly larger mean size of 216.0 ± 14.85 nm and ζ-potential of −21.2 ± 2.15 mV

(Fig. 2.1b, c), and maintained similar spherical morphology (Fig. A.1a) and binding to

small molecule drugs brimonidine tartrate and sunitinib malate (Fig. A.1b). The first

microarray was printed with 119 peptides to screen flow conditions for the highest fluo-

rescent reporter signal, which identified that the 500 µg/mL of biotinylated mNPs in pH

6.5 PBS buffer at room temperature was optimal (Fig. 2.1d and Fig. A.2). We then used

the fluorescent reporter signals to construct a melanin binding classification random forest

model. The prediction accuracy was 0.92. The permutation-based variable importance

analysis [38] further revealed that the net charge, basic amino acids, and isoelectric point

(pI) may contribute to distinguishing melanin binding and non-melanin binding peptides

(Fig. 2.1e).

2.3.2 Training of the melanin binding regression model

A second larger peptide screen was implemented to generate melanin binding data

to use for the additional model generation (Fig. 2.2a). Specifically, we used the trained

random forest model to predict melanin binding for ∼630,000 randomly generated pep-

tides, and those classified as melanin binding were selected. A total of 5499 peptides were

printed in duplicate, and the fluorescent reporter intensities were reported as the amount

of the b-mNPs that bind to the printed peptides on the microarray. Surprisingly, we

identified 780 peptides displaying higher levels of fluorescent reporter intensities than any

13


Figure 2.1 Pilot 119 melanin binding peptide microarray screening with machine
learning analysis. a Schematic illustration of the first peptide microarray. Peptides were
anchored to a microarray, and melanin nanoparticles (mNPs) with surface biotinylation (b-mNPs)
were flowed over to characterize binding events. The fluorescence intensity of the biotin was
detected using DyLight 680-conjugated streptavidin to quantify melanin binding for each peptide.
An initial classification model was trained using the data generated. Random peptides were then
classified by the model as melanin binding or non-melanin binding. Created with BioR- ender.com.
b,c Plot showing the sizes (b) and ζ-potential (c) of mNPs (black dots, n = 6 and b-mNPs
(gray squares, n = 6). Data are presented as mean ± SD. Group means were compared using
Student’s t tests (two-tailed). d The optimal interaction profiling of b-mNPs against 16 positive
control peptides (peptide numbers: 1–16) and 103 random peptides (peptide numbers: 17–119).
e Permutation-based variable importance analysis of the melanin binding classification random
forest. The x-axis indicates the mean decrease in prediction accuracy after variable permutation.
The values are shown at the end of the bars. The top 20 important variables ranked by mean
decrease in accuracy are shown.

14


of the 16 peptides described in the literature that bound to human melanoma cells [39]

and melanized C. neoformans [40], which were previously screened by the phage display

technique. Furthermore, there were 758 peptides showing higher fluorescent values than

the highest melanin binding peptides (661.5 arb. units) from the 119-peptide microarray,

demonstrating the enrichment of melanin binding properties from training the random

forest model. Next, the fluorescent reporter intensities values were used as the response

variable in training a regression model. Applying a variable reduction procedure using ran-

dom forest to eliminate less informative variables from the data set, reduced the number of

variables from 1094 to 64 (Fig. A.3a), and model performance measured by the coefficient

of determination (R2) improved from 0.48 to 0.53. A wide array of machine learning mod-

els was explored and trained on the variable-reduced data set and were integrated with a

super learning (SL) framework that combined various types of base models weighted using

a meta-learner. By applying the iterative base model filtering procedure (Fig. 2.2b), the

complexity of the SL was further reduced. To explore other combinations of base models in

the SL ensemble, homogeneous base models consisting of models from only one algorithm

family were constructed. A nested cross-validation (Fig. 2.2c) was applied to estimate an

unbiased generalization performance. All SL models with base model reduction were se-

lected as the top model in the inner loop cross-validations, and the performance evaluated

in the outer loop cross-validation improved to R2 = 0.54 ± 0.01 (Table A.1). The reduced

SL was selected amongst 31 competitive models as the final melanin binding regression

model. When training the same set of models on the whole data set, and number of base

models in the SL was reduced from 907 to 38 (Fig. 2.2d). Adversarial computational con-

trol was performed, and the generalization performance was R2 = −0.04 ± 0.02, indicating

15


that the machine learning was effective in learning meaningful relationships in the melanin

binding data set.

2.3.3 Training of cell-penetration and cytotoxicity classification models

Engineered peptides must enter cells to reach and bind to melanin within the melano-

somes and should be minimally toxic to cells. Thus, the SkipCPP-Pred [31] and the Toxin-

Pred [33] databases were used to create SL classification ensembles to engineer tri-functional

peptides. Variable reduction decreased the number of variables from 1094 to 11 for the

cell-penetration data set (Fig. A.3b) and from 1094 to 56 for the cytotoxicity data set

(Fig. A.3c). The prediction accuracies calculated from out-of-bag samples improved from

0.91 to 0.93 and from 0.951 to 0.958 for cell-penetration and cytotoxicity, respectively. We

subsequently trained base models and SL ensembles, and the generalization performances

in terms of Matthews correlation coefficient (MCC), F1 (harmonic mean of precision and re-

call), and balanced accuracy for cell-penetration were 0.79±0.01, 0.90±0.01, and 0.90±0.01,

respectively; and those for cytotoxicity were 0.88±0.004, 0.92±0.002, and 0.95±0.002, re-

spectively (Tables A.2, A.3). The number of base models in the reduced SL models trained

on the whole data sets were decreased from 310 to 65 for cell-penetration, and from 311

to 22 for cytotoxicity (Fig. A.4). There were 300 competitive cell-penetration models and

175 competitive cytotoxicity models. A GBM model and the reduced SL were selected as

the final predictive cell-penetration and cytotoxicity models. Similar to melanin binding,

adversarial controls had decreased generalization performances, where the MCC, F1, and

balanced accuracy were −0.002±0.05, 0.52±0.03, and 0.50±0.03 for cell-penetration, and

16


Figure 2.2 Schematic of the machine learning pipeline based on the super learner
framework for the melanin binding data set. a Scheme of a larger microarray, which
includes 5499 peptides used to train a regression super learner. Random peptides were generated
based on position-dependent amino acid frequencies calculated using the second peptide array
data, and the melanin binding levels were predicted. Peptides with desired melanin binding levels
were selected for further experimental validation. Created with BioRender.com. b Scheme of
the super learner complexity reduction. Holdout predictions of peptides (shown as rows) were
generated for each base model (shown as columns) with tenfold cross-validation (CV) on the input
data set. A meta-learner (generalized linear model) was fitted on the holdout predictions with
another tenfold cross-validation. The number of base models was reduced by applying an iterative
reduction procedure (see Section 2.5). The final super learner ensemble was trained on the input
data set with the optimal combination of the selected base models. c Scheme of the machine
learning pipeline for an unbiased model performance evaluation. The nested cross-validation

17


Figure 2.2 (previous page) includes an outer loop for model evaluation and an inner loop for
model selection (cyan). The outer loop generated 10 sets of train-test splits using a Monte Carlo
method, and the inner loop generated 10 sets of train-test splits using a modulo method. d Plot
of the base models of the final melanin binding super learner. Coefficients of determination (R2)
are denoted with color and conveyed as white text on the bars or gray text adjacent bars. Base
model coefficients are indicated at the bar ends. There is one model having zero coefficient and
not shown. See Sections 2.5 and A.1 for information about model hyperparameter details and
statistics of model performance.

0.001 ± 0.01, 0.05 ± 0.02, and 0.62 ± 0.04 for cytotoxicity.

2.3.4 Validation of predicted peptide properties in vitro

A position-dependent amino acid frequency matrix was used to generate 127 peptides

that spanned the range of low to high predicted melanin binding. Among the 127 peptide

candidates, 113 peptides were classified as cell-penetrating and 117 peptides were predicted

as non-toxic. To experimentally measure melanin binding in vitro, biotinylated peptides

were incubated with mNPs, and the bound fraction was calculated using an avidin-based

fluorescent reporter (Fig. 2.3a). The Pearson correlation coefficient was computed to com-

pare the predicted and experimental melanin binding values, and the correlation coefficient

was r = 0.84, showing a high level of correlation between the predicted and experimental

values (Fig. 2.3b). We next characterized how the predicted cell-penetrating properties

of the peptides affected cell uptake in a retinal pigment epithelium cell line (ARPE-19).

ARPE-19 cells were cultured using standard methods (non-induced, n = 3) and using

culture conditions that induce melanin production (induced, n = 3) [28]. A positive corre-

lation was observed between the measured in vitro melanin binding of the peptides and the

intracellular peptide concentrations in melanin-induced cells for cell-penetrating peptides

(r = 0.77, p < 2.2 × 10−16) but not non-cell-penetrating peptides (r = 0.28, Fig. 2.3c, d),

18


suggesting correlation between the two properties. Further, peptides predicted to be cell-

penetrating demonstrated significantly higher intracellular concentrations (median 229.4

pmol/100 K cells) than those of non-cell-penetrating peptides (median 26.7 pmol/ 100 K

cells) in the melanin-induced cells (p = 6.9 × 10−6, Fig. 2.3e). In contrast, the intracellular

peptide concentrations were not affected by the predicted properties in non-induced cells

(Fig. A.5).

2.3.5 Analysis of peptide variables that contribute to observed properties

To identify which peptide variables contributed to the properties observed in vitro,

Shapley additive explanation (SHAP) analysis of the final predictive models was performed.

The results showed that peptide property predictions were based on contribution from mul-

tiple variables. More specifically, basic peptides and higher net charge variables had higher

contributions to melanin binding predictions (Fig. 2.4a), which was consistent with the top

variables identified by the random forest classification model trained on the pilot peptide

microarray. Similarly, higher net charge and higher isoelectric point contributed more to

cell-penetration (Fig. 2.4b), and less free cysteines had more influence on non-toxic pre-

dictions (Fig. A.6). To understand how reliable the interpretable results were, adversarial

controls were constructed with the final predictive models using a 10-fold cross-validation.

Indeed, the distributions and levels of variable contributions changed for melanin binding,

cell-penetration, and cytotoxicity (Fig. A.7). Among all the peptide candidates, HR97 (FS-

GKRRKRKPR) was selected based on combination of the three peptide properties (melanin

bindingHR97 = 79.1 ± 0.7%, cell uptakeHR97 = 759.9 ± 19.6 pmol/100 K cells, non-toxicHR97

19


Figure 2.3 Experimental validations of final model predictions on melanin binding and
cell-penetration. a Schematic showing an in vitro melanin binding assay with melanin nanopar-
ticles (mNPs) using a biotin quantification kit. The DyLight 494-tagged avidin emitted fluores-
cence when the biotinylated peptides displaced the weakly interacting 4′-hydroxyazobenzene-2-
carboxylic acid (HABA or H). Created with BioRender.com. b Plot of the relationship between
predicted melanin binding and binding measured experimentally in vitro. The x-axis indicates
melanin binding predictions from the final super learner, and the y-axis indicates the experimental
melanin binding values (n = 4 for each peptide). Dots represent the mean value for peptides. The
black linear trend line conveys the Pearson correlation relationship (two-tailed), and the gray area
indicates the 95% confidence interval. c, d Comparison of melanin binding and cell-penetration
in melanin-induced human adult retinal pigment epithelial (ARPE-19) cells. Blue triangles de-
note predicted non-cell-penetrating peptides (non-CPP), and magenta dots represent predicted
cell-penetrating peptides (CPP). The x-axes indicate melanin binding measured in vitro (n = 4
for each peptide), and the y-axes convey intracellular peptide concentration measured from the
cell uptake assay (n = 3 for each peptide). Black linear trend lines indicate Pearson correlation
relationships, with 95% confidence intervals shown as shaded areas. The correlation coefficients

20


Figure 2.3 (previous page) and p-values (two-tailed) are shown. e Summary of CPP (n = 113)
and non-CPP (n = 14) intracellular concentrations. Box plot conveys median (middle line),
25th and 75th percentiles (box), and the 1.5 × interquartile range (whiskers). The p value was
calculated using a Mann–Whitney U test (two-tailed).

= 96.9%, Fig. 2.4c). HR97 had the highest intracellular concentration, which outperformed

the well-characterized cell-penetrating peptide fragment of the HIV trans-activator protein

(TAT47−57, YGRKKRRQRRR). HR97 demonstrated increased cell uptake compared to

TAT47−57 in both the induced ARPE-19 cells (cell uptakeHR97 = 759.9 ± 19.6 pmol/100 K

cells, cell uptakeTAT47−57 = 457.1 ± 34.2 pmol/100 K cells) and the non-induced cell type

(cell uptakeHR97 = 82.5 ± 9.1 pmol/100 K cells, cell uptakeTAT47−57 = 68.3 ± 4.6 pmol/100

K cells). In addition, HR97 showed no sign of cytotoxicity in ARPE-19 cells at concen-

trations up to 5 mg/mL (Fig. A.8). HR97 predictions embodied all the properties that

were the largest contributors to each functionality, including being basic (63.64% basic

amino acids), possessing a high net charge (6.98) and a high isoelectric point value (12.99),

and no cysteines (Fig. 2.4d–f). By visualizing the peptide design space defined by the

sequences and variables used in training the desired functional properties, the peptide can-

didates with high melanin binding predictions were shown up in the same cluster, showing

similar sequence motifs and physiochemical properties (Fig. 2.5a, b). Further, peptides

predicted to have high melanin binding were mostly predicted to be cell-penetrating, but

cell-penetrating peptides may not be melanin binding (Fig. 2.5c). The results also suggest

that some melanin binding peptides may be toxic (Fig. 2.5d).

21


Figure 2.4 Melanin binding, cell-penetration model interpretation, and variable con-
tributions to HR97 multifunctional peptide predictions. Overall variable contributions to
model predictions for (a) melanin binding and (b) cell-penetration. The top important variables
analyzed using Shapley additive explanations (SHAP) are shown. Dots represent peptides from
cross-validation test sets. The x-axes indicate SHAP values, indicative of variable contributions
to model prediction ranging from 0 to 100. The variables were ranked based on the difference
between the maximum and minimum SHAP values. The color gradient indicates the variable
values normalized by percentile ranks. Higher variable values are indicated by darker magenta
color and lower values by darker blue color. The minimum and maximum variable values are

22


Figure 2.4 (previous page) noted on the right of each subplot. c Scatter plot showing the
in vitro melanin binding, in vitro cell-penetration, and predicted cytotoxicity values of the 127
candidate peptides. Dots represent peptides. HR97 (black dot) was selected based on the optimal
multifunctional combination. d–f Variable contributions to HR97 multifunctional predictions
for melanin binding, cell-penetration, and cytotoxicity. The top variables ranked by absolute
SHAP values are shown. Magenta bars indicate positive contributions, and blue bars are negative
contributions. The y-axis labels convey variable names and their values for HR97. E[f(X)] denotes
the expected prediction value, and f(x) is the final prediction, calculated from the sum of all SHAP
values plus E[f(X)].

2.3.6 Characterization and validation of a peptide-drug conjugate in vivo

To investigate the effect of peptide conjugation on drug pharmacodynamics, we chose

brimonidine tartrate, a topical IOP lowering drug prescribed for glaucoma treatment.

The HR97 peptide was conjugated to brimonidine (HR97-brimonidine) via a quaternary-

ammonium traceless linker system, and the structure of the intermediates and the purified

conjugate were validated by NMR and MALDI-TOF (Figs. A.9–A.12). Conjugation to

HR97 provided a ∼10-fold increase in the in vitro melanin binding capacity of brimonidine

(5.9×107 Kd (M) vs. 5.0×10−8 Kd (M)), which brought the binding capacity closer to other

drugs with high intrinsic melanin binding, such as sunitinib malate (Fig. 2.6a) [28, 41–45].

When incubated in human aqueous fluid, only ∼7% of the brimonidine was released from

the HR97-brimonidine conjugate over 28 days in vitro (Fig. 2.6b). However, upon incuba-

tion with supraphysiological concentrations of human cathepsin cocktails to enzymatically

cleave the linker, ∼52% of the brimonidine was liberated within 48 h (Fig. 2.6c). The ef-

fect of the HR97-brimondine conjugate on IOP was then evaluated in normotensive Dutch

Belted rabbits. A single topical dose with the commercial brimonidine eye drop (n = 5) was

found to provide a peak reduction in IOP from baseline (∆IOP) of −3.0±0.82 mmHg that

23


Figure 2.5 Visualization of the peptide design space based on sequences and physio-
chemical properties. a t-distributed stochastic neighbor embedding (t-SNE, used to visualize
high-dimensional data) plots showing the peptide design space defined by the combination of
one-hot encoded peptide sequences and variables used in melanin binding, cell-penetration, and
cytotoxicity model training. Dots represent control peptides from Howell et al. [39] (magenta)
and Nosanchuk et al. [40] (blue); peptides used in the pilot (purple) and second (gray and yellow)
melanin binding peptide microarrays; and multifunctional peptide candidates (black and yellow)
used in the validation experiments. HR97 and TAT are noted. b t-SNE plot of peptides colored
by melanin binding prediction. Higher melanin binding values are colored by darker magenta and
lower by darker blue. c t-SNE plot of peptides colored by cell-penetration prediction. Magenta
dots represent predicted cell-penetrating peptides (CPP), and blue dots are predicted non-cell-
penetrating peptides (non-CPP). d t-SNE plot of peptides colored by cytotoxicity prediction.
Blue dots denote predicted toxic peptides, and magenta dots indicate non-toxic peptides.

24


recovered to baseline within 8 h (Fig. 2.6d). In contrast, a single ICM injection of the HR97-

brimonidine conjugate resulted in a greater peak ∆IOP compared to an ICM injection of

brimonidine solution at 2 days (−4.9 ± 0.46 mmHg vs. −2.6 ± 1.65 mmHg, p < 0.05, red

arrow). In a separate experiment, ICM injection of saline or HR97 (n = 5 for each) resulted

in a similar decrease in IOP that returned to baseline by day 3, and ICM injection of a

physical mixture of HR97 and brimonidine tartrate (n = 5) resulted in a similar IOP profile

to the brimonidine solution, returning to baseline by day 8 (Fig. A.13). To ensure that the

dramatic decrease in IOP with the HR97-brimonidine conjugate was not due to toxicity,

a board-certified ophthalmologist evaluated the eyes injected with the HR97-brimonidine

conjugate on day 7. It was observed that the lids, lashes, and conjunctiva were normal, the

corneas were clear, the corneal endothelium was normal without any pigment deposition,

the anterior chambers were normal depth, there was no apparent inflammation or fibrin

strands, the lenses were clear, and the iris pigmentation was symmetric. According to the

same evaluation methods, no ocular toxicity was observed upon ICM injection of saline,

HR97, or a physical mixture of HR97 and brimonidine tartrate for at least 28 days (Tables

A.4–A.7). The mean ∆IOP in the HR97-brimonidine conjugate group remained signifi-

cantly larger than in the rabbits dosed with brimonidine solution or the physical mixture

of HR97 and brimonidine tartrate for up to 14 days (Fig. A.6d, A.13). Further, the time

for the mean ∆IOP to return to baseline was 20 days in the HR97-brimonidine conjugate

group compared to 8 days in both groups of rabbits dosed with brimonidine solution or

the physical mixture of HR97 and brimonidine tartrate. When summing the area under

the curve (AUClast) for the cumulative ∆IOP over the 20-day measurement period after

ICM injection, the HR97-brimonidine conjugate showed a ∼17-fold greater AUC compared

25


to brimonidine solution (p < 0.001) (Fig. A.6e). A pharmacokinetic study was conducted

separately to characterize the intraocular distribution of brimonidine after ICM injection

of HR97-brimonidine in Dutch Belted rabbits. The brimonidine concentration remained

relatively high in the pigmented iris tissue (980 ng/g) compared to less pigmented parts

of the eye, such as the aqueous humor (0.4 ng/g) and the retina (8.3 ng/g) up to 28 days

after a single ICM injection (Fig. A.6f). The brimonidine concentration in the aqueous on

day 7 (83.3 ng/g) was similar to what we previously reported at 2 h after a single drop

of Alphagan P (0.15%) (105 ng/g), which was the time with the largest IOP reduction

in that study [46]. On day 14 after ICM injection of HR97-brimonidine, the brimonidine

concentration in the aqueous (3.9 ng/g) was similar to what we previously reported at 4 h

after a single drop of Alphagan P (0.15%) (4 ng/g) [46].

2.4 Discussion

Chronic eye diseases such as glaucoma require continuous treatment to prevent disease

progression. Eye drops are the most common dosage form of glaucoma therapy, though low

adherence to intensive drop dosage schedules is a major challenge in disease management

[11,47,48]. One study using an electronic monitoring device found that only 64% of patients

adhered to the three-times daily dosing schedule for brimonidine eye drops over a 4-week

period, even though they were aware of the monitoring [49]. Sustained drug delivery

systems may be an attractive alternative for the management of chronic ocular diseases

like glaucoma. The first sustained-release polymer-based implant for glaucoma treatment,

Durysta®, was recently approved for sustained IOP lowering for several months with a

26


Figure 2.6 Characterization of HR97-brimonidine in vitro and in vivo. a In vitro
binding capacity and dissociation constant of HR97-biotin, HR97-brimonidine, and brimonidine
characterized using a melanin nanoparticle (mNP) assay (red dots, n = 3–5). Values shown for
comparison include those we previously measured for sunitinib and N-desethyl sunitinib [28], and
literature values for other ophthalmic drugs [41–45]. b In vitro stability of HR97-brimonidine
conjugate in human aqueous humor for 28 days. The percent remaining was normalized to the
starting concentration on day 0 (n = 3). Data are shown as mean ± SD. c Cathepsin cleavage
assay of the HR97-brimonidine conjugate. HR97-brimonidine (n = 3) were incubated with human
cathepsin cocktails or buffer only for 48 h at 37 ◦C (two-tailed t-test). Data are shown as mean ±
SD. d Comparison of the intraocular pressure (IOP) change from baseline (∆IOP) after a single
ICM injection of HR97-brimonidine conjugate (white dots), brimonidine solution (black dots, 200
µg brimonidine equivalent), and a single drop of Alphagan P (gray dots, 0.15%) in normotensive

27


Figure 2.6 (previous page) Dutch Belted rabbits (n = 5 per group). The IOP was measured
every 1–2 days until returning to the baseline. The red arrow highlights the further decrease in
IOP provided by the HR97-brimonidine. Two-tailed t-test was used, ∗p < 0.05 (adjusted p values
for days 2, 3, 4, 6, and 8 were 0.044, 0.007, 0.038, 0.007, 0.007, respectively). Data are presented
as mean ± SEM. e Cumulative ∆IOP of brimonidine (black dots) and HR97-brimonidine (gray
squares) after ICM injection. The cumulative ∆IOP was characterized by calculating the area
under the curve over the 20-day measurement period (AUClast, n = 5). Two-tailed t-test was used.
Data are presented as mean ± SD. f Levels of brimonidine in the iris (black dots), aqueous (gray
squares), and retina (white dots, n = 3–4) over time after ICM injection of HR97-brimonidine
(200 µg brimonidine equivalent). The concentrations of brimonidine measured in the aqueous
after a single drop of Alphagan P (0.15%) as part of a previous study [46] at 2 h (maximal IOP
lowering time point; dotted line) and 4 h (dashed line) after dosage are shown. Data are shown
as mean ± SD.

single ICM injection [17]. However, the polymer matrix typically took longer to biodegrade

than the duration of drug release, and repeated injection with additional implants was

associated with increased risk of corneal endothelial cell loss and other corneal adverse

reactions [50]. In contrast to conventional polymer-based sustained drug delivery systems,

the approach we describe here does not require an implant or large amounts of excipients

that will remain in the eye for extended periods. By utilizing short peptide sequences that

impart melanin binding to the drug conjugate, a sustained intraocular drug release system

was created without the need for a polymer matrix.

Ocular melanin is a biopolymer that resides within melanosomes in pigmented ocular

tissues, including the iris, ciliary body, choroid, and retinal pigment epithelium (RPE) [51].

Although the amount of pheomelanin in the eye varies depending on eye color, the amount

of eumelanin in ocular tissues, including the RPE, iris pigment epithelium, and pigmented

ciliary epithelium is more consistent across the population [52]. It has been described that

drug binding to melanin and accumulation inside cells may diminish therapeutic effect by

sequestering the drug or causing ocular toxicity [26,27]. In the case of atropine, the intrinsic

28


melanin binding properties were shown to lead to prolonged residence time in pigmented

rabbit ocular tissues [53], and a sustained miotic response in pigmented rabbits [29]. In ad-

dition, we previously demonstrated that improving the intraocular absorption of sunitinib,

a drug with relatively high melanin binding capacity, with a novel gel-forming hypotonic

eye drop led to prolonged therapeutic effect of up to 1 week after dosing [28]. Indeed,

a recent study used machine learning methods to characterize the structural features of

small molecule drugs that impact intrinsic melanin binding, leading to the development of

a model that predicted intrinsic melanin binding with 91% accuracy [30]. These findings

motivated us to develop engineered adaptors designed to impart tunable melanin binding

properties to small molecule drugs used to treat ocular diseases. Further, as melanin is

contained within cells, the engineered adaptor should additionally provide cell penetra-

tion. Here, we developed a machine learning-based methodology to engineer tri-functional

peptides that displayed melanin binding, cell-penetration, and non-toxic properties. The

peptide sequence that provided the optimal combination of high melanin binding, high

cell-penetration, and low cytotoxicity, HR97, was then conjugated to brimonidine as a

proof-of-principle. The HR97-brimonidine conjugate provided up to 18 days of IOP lower-

ing with a single ICM injection in normotensive rabbits, which contrasts with the 8 h-effect

provided by a brimonidine eye drop.

Peptides are short sequences of amino acids that can have many combinations with

diverse biological functions. Compared to other aptamers and small molecule drug libraries,

peptides are relatively cost effective to synthesize and are relatively easy to modify or

conjugate to small molecule drugs [54]. Currently, there are more than 80 FDA-approved

peptide drugs and more than 600 in clinical and pre-clinal trials [55–57]. Peptides optimized

29


for a single function, either exhibiting cell-penetration or cell targeting properties, have been

widely exploited as drug carriers to shuttle drugs across biological barriers [58–60]. Peptides

such as TAT, penetratin, PEP-1 and polyarginine (R6 or R8) and have been conjugated

with various cargos for targeting the anterior and posterior segment [61–68]. For example,

various fluorescein conjugated peptides were screened for the ability to cross porcine cornea

ex vivo [68, 69]. Penetratin (PNT) showed an eightfold increase in permeability compared

to PEP-1, though most of the peptide was found to be sequestered within cells rather

than having crossed the cornea [68, 69]. In another study, TAT peptide was conjugated to

human acidic fibroblast growth factor (aFGF) and applied topically to rat eyes [70]. They

found that the conjugates reached the retina with a tmax of 30–60 min and with possible

mechanism of conjunctival-scleral penetration route [70]. However, it is known that drugs

can more easily reach the posterior segment with topical administration in rat and mouse

eyes compared to larger eyes, such as rabbits [71–73].

Many peptide screening technologies have been developed for identifying novel func-

tional peptides, including phage display, mRNA display, and peptide microarray [74–76].

Phage display and mRNA display are capable of screening a larger number of peptides

(∼1011–1013) compared to peptide microarray (∼105). However, in phage display and

mRNA display, the peptide sequences are randomly generated with fixed ratios of amino

acids [75]. In contrast, coupling computationally generated peptide sequences with pep-

tide microarrays has the advantage of rapidly improving peptide design through machine

learning model refinement. Peptides can be computationally represented by physicochem-

ical and structural descriptors [77] or encoded using various rules such as binary encoding

and evolution-based encoding [78]. Since peptide sequence is the source of functionality,

30


a machine learning-based approach can be employed to develop predictors that learn the

relationships between peptide variables derived from the sequence and the desired func-

tional property [79–81]. Peptide databases have also been made available for data-driven

functional peptide design, including cell-penetration and toxicity [32, 33]. However, there

is only a limited number of studies for, and no database of, melanin binding peptides. An

example here is that in the two studies that reported peptide sequences that were char-

acterized as melanin binding, phage display was used to identify 8 peptides that bind to

melanin in human melanoma cells [39] and 8 peptides that bound to melanized C. neofor-

mans [40]. However, in our peptide microarray, 8 of these peptides did not demonstrate

detectable melanin binding, and overall, we identified 780 peptides displaying higher levels

of melanin binding than any of these peptides described in the literature. Furthermore, the

second peptide microarray designed using the initial machine learning model provided more

potent melanin binding peptides compared to the first peptide microarray, demonstrating

the rapid improvement in design by machine learning model refinement.

Multifunctional peptides with dual or triple pharmacological properties have also

been integrated into drug delivery systems through conjugation to drugs or drug-loaded

cargos [34,82,83]. However, it is challenging to design peptides with multiple functions con-

tained in a single sequence. Often single function peptides are fused directly or by a linker

peptide [83–85], which may increase the peptide length and reduce the desired functional

properties of each component. Another approach is to optimize additional functional prop-

erties by substituting amino acids on a template peptide with a known function [35, 36],

which may require extensive laboratory screening and is time-consuming. Generating mul-

tifunctional peptides with the flexibility to choose the desired functional levels is a less

31


explored research area [86, 87]. Here, our machine learning and model interpretation ap-

proach guided the engineering of multifunctional peptides. The peptide properties were

analyzed using the shared variable set, revealing mutually important variables contribut-

ing to both melanin binding and cell-penetration, where peptides with moderate to high

net charge and containing more basic amino acids tend to possess both melanin binding

and cell-penetrating properties. Further, we unexpectedly observed correlation between

melanin binding and cell-penetrating in cell uptake in vitro. Thus, the highest intracellular

accumulation was achieved by increasing the amount of peptide that can access intracellu-

lar melanosomes, where the peptides can then bind to melanin and provide sustained drug

release.

Many machine learning models including random forest, support vector machines,

and deep learning have been developed to predict how amino acid sequence governs peptide

properties [88]. Super learning is an ensemble machine learning method that takes advan-

tage of various machine learning models. The predictive performance of a super learner

ensemble is assured to be at least as accurate as the best-performing base model [89, 90].

The same model types with varying hyperparameter combinations can be included in a

SL ensemble. Recently, it was described that base model hyperparameter tuning could

improve overall SL model performance [91]. Based on this finding, we further developed a

procedure to systematically select optimal base model composition by iteratively filtering

out models that have less contributions to the SL ensemble. Indeed, we obtained better

SL model performance compared to the one including all base models. In this study, we

explored a wide array of possible machine learning models and identified multiple com-

petitive models through statistical analyses. SL provided a framework to integrate these

32


explored models. Although the meta-learner may add a layer of complexity, it demon-

strated an interpretable summary of the model importance in terms of their contributions

to the final predictions. In addition, the complexity of the machine learning architecture

was reduced by variable reduction of the data sets and base model filtration of SL. Further,

interpretable machine learning that extracts relevant information such as variable contribu-

tions to output predictions from the data relationships learned by the model is important

for explaining model predictions [92, 93]. Many of the functional peptide predictors and

other drug discovery tools do not have information on how and why top candidates were

identified [94–96]. In this study, we showed that interpretation of machine learning models

can provide insights to improve the design of multifunctional peptides. The SHAP analysis

not only indicated important variables contributing most to the model prediction, but also

showed the relationships between variable values and prediction outputs.

The studies described here are not without limitations. First, while the in vitro

ARPE19 cell assay helped validate the cell-penetrating and melanin binding performance,

the methodology used here did not differentiate between peptides that were free or bound

to melanin or other structures within the cell. Indeed, there was a baseline level of peptide

associated with non-pigmented cells, but a substantial increase in cellular localization was

observed when the cells were induced to produce melanin. Second, the traceless linker

conjugation yield of the HR97-brimonidine was low and requires further optimization. The

cathepsin-labile linker was chosen because cathepsins are largely located intracellularly

and are present in minimal amounts in extracellular fluids such as aqueous humor [97–100].

Thus, the intracamerally delivered HR97-brimonidine would be stable until it had localized

within melanin-containing cells. However, the level of brimonidine measured in rabbit

33


iris tissue remained high, suggesting that further optimization of the linker cleavage and

brimonidine release rate may also extend the duration of the therapeutic effect. Finally,

the duration of IOP lowering reported here (20 days) was sufficient to demonstrate the

proof-of-principle in normotensive rabbits but would not be clinically translatable. Future

work with more potent drugs may increase the duration of action.

The approach we described here to apply ensemble machine learning to peptide mi-

croarray enabled the efficient design of multi-functional peptides, which in this application

enhanced the intraocular pharmacokinetics and pharmacodynamics of the ophthalmic drug

brimonidine. Engineered HR97 peptide demonstrated increased cell-penetrating properties

compared to known cell-penetrating peptides, such as TAT, and simultaneously possessed

high melanin binding capacity and low cytotoxicity. In the current context, utilizing short

peptide sequences that impart melanin binding to a drug conjugate may provide an av-

enue for creating safe and effective implant-free sustained intraocular drug release systems.

More broadly, the approach described here can be applied to generate multifunctional

peptide-drug conjugates for a variety of biomedical applications.

2.5 Methods

2.5.1 Material sources

Brimonidine was purchased from TCI America. Eumelanin from Sepia officinalis,

0.22 µm Millex-GV PVDF filter, ferric ammonium citrate, bovine serum albumin (BSA),

Tween 20, fetal bovine serum (FBS), trifluoroacetic acid (TFA), tert-Butyl methyl ether

(MTBE), thionyl chloride, Tetrabutylammonium iodide, N,N-diisopropylethylamine, hu-

34


man cathepsins B, K, L and S, Whatman® Anotop® 0.02 µm syringe filter and Triton

X-100 were purchased from Sigma Aldrich (St. Louis, MO, USA). ARPE-19 (ATCC CRL-

2302, lot No. 70013110), and DMEM:F12 medium were purchased from the American

Type Culture Collection (Manassas, VA, USA). EZ-LinkTM Amine-PEG2-Biotin, BupH

MES buffer saline pack (2-(N-morpholino)ethanesulfonic acid buffer), EDC (1-ethyl-3-(3-

dimethylaminopropyl)carbodiimide hydrochloride), NHS (N-hydroxysuccinimide), PierceTM

Fluorescence Biotin Quantitation Kit, rapid equilibrium dialysis (RED) 8 K device,

PrestoBlueTM HS Cell Viability Reagent, DMEM with high glucose and pyruvate, Trypsin-

EDTA (0.25%) with phenol, RIPA lysis buffer, Streptavidin DyLight 680, and penicillin/

streptomycin were purchased from Thermo Fisher Scientific (Waltham, MA, USA). Dispos-

able PD-10 desalting columns were purchased from VWR. Dulbecco’s Phosphate Buffered

Saline (DPBS), 1 × phosphate buffered saline (PBS), 10 × PBS, high-performance liquid

chromatography (HPLC) grade acetonitrile, dimethylformamide (DMF), and water were

purchased from Fisher Scientific (Hampton, NH, USA). Mc-Val-Cit-PAB was purchased

from Cayman Chemical (Ann Arbor, MI, USA). Endotoxin-Free Ultra-pure Water were

purchased from MilliporeSigma (Burlington, MA, USA). A Hamilton 1700 Series gas tight

syringes (25 µL, Model 1702 RN, 27 gauge) was purchased from Hamilton Company (Reno,

NV, USA). BD 1 mL TB syringe with 28 G needles were purchased from BD (San Jose,

CA, USA). Isoflurane was purchased from Baxter (Deerfield, IL, USA). Reverse-action for-

ceps were purchased from World Precision Instruments (Sarasota, FL, USA). Neomycin,

polymyxin b, and bacitracin zinc ophthalmic ointment was purchased from Akorn (Lake

Forest, IL, USA).

35


2.5.2 Melanin nanoparticle synthesis and characterization

Melanin nanoparticles (mNPs) were synthesized from the eumelanin of Sepia offici-

nalis. In brief, 10 mg/mL of eumelanin was suspended in the DPBS using an ultrasonic

probe sonicator (Sonics, Vibra Cell VCX-750 with model CV334 probe, Newtown, CT,

USA) by pulsing 1 s on/off at 40% amplitude for 30 min in a 4 ◦C water bath. The suspen-

sion was then filtered through a 0.22 µm Millex-GV PVDF filter and transferred to PD-10

desalting columns. The resulting mNPs solution was lyophilized for 7 days and stored at

−20 ◦C until further use. For mNP biotinylation (b-mNPs), mNPs were suspended in 2

mL MES buffer with 2.4 mg of EDC and 3.6 mg of NHS for 15 min at room temperature to

first activate the carboxylic acid groups. To increase the buffer pH above pH 7.4 for amine

reaction, 400 µL of 10 × PBS was directly added to the mixture and incubated for 5 min.

Various amounts of EZ-LinkTM Amine-PEG2-Biotin (5, 15, 20, 30 mg) were reacted with

activated mNPs for 2 or 6 h at room temperature. Since all conditions led to a similar

degree of mNP biotinylation, reaction conditions using 5 mg of amine-PEG2-biotin with

2 h incubation at room temperature was used moving forward. The reaction mixture was

then transferred to PD-10 desalting columns to further collect the b-mNPs. To transfer the

b-mNPs to different solvents (water, pH 6.5 PBS, pH 7.4 PBS) for optimization of the pep-

tide microarray, PD10 columns were first equilibrated with buffer, and then the b-mNPs

were added. Particle size and ζ-potential were determined by dynamic light scattering and

laser Doppler anemometry, respectively, using a Zetasizer Nano ZS90 (Malvern Instru-

ments). Size measurements were performed at 25 ◦C at a scattering angle of 173◦. Samples

were diluted in 10 mM NaCl solution (pH 7), and measurements were performed according

36


to instrument instructions. PierceTM Fluorescence Biotin Quantitation Kits were used to

quantify the biotin content on the b-mNPs. B-mNPs (1 mg/mL) were diluted 1:50, 1:100,

1:200 with 1 × PBS and the standard biocytin concentration (10–60 pmol/10 µL) were

freshly prepared for measuring the biotin concentration. Transmission electron microscopy

(H7600; Hitachi High Technologies America) was conducted to determine the morphology

of mNPs and b-mNPs.

2.5.3 Optimization of processing conditions for peptide microarray

A total of 119 peptides, including 8 peptides of length 7 amino acids (aa) and 8 pep-

tides of length 10 aa from the literature [39,40], and 103 random 15 aa peptides generated

with a frequency of 5% for each of the 20 amino acids, were printed in duplicate on peptide

microarrays by PEPperPRINT. The peptide microarrays contained hemagglutinin (HA)

peptides (YPYDVPDYAG; 9 spots) as internal quality controls. Varying screening condi-

tions of the peptide microarray were performed. A spectrum scan of melanin nanoparticles

(mNPs) and biotinylated mNPs confirmed that the autofluorescence was near background

levels after Em = 650 nm. Streptavidin DyLight 680, which was the highest wavelength

(Ex = 675 nm, Em = 705 nm) that PEPperPRINT could use in their peptide microarray

system, was selected to minimize detection of melanin. Two peptide microarray copies

were first pre-stained with streptavidin DyLight680 (0.2 µg/ml) and the control antibody

(manufacturer: BioxCell & PEPperPrint, catalogue numbers: #RT0268, PEPperCHIP®

Mouse Monoclonal anti-HA (12CA5)-DyLight800 Control; 1:2000 dilution or 0.5 µg/ml) in

incubation buffer (pH 6.5 PBS with 0.005% Tween 20 and 10% Rockland blocking buffer

37


MB-070) for 45min at room temperature to examine background interactions and internal

quality control. No background interaction of streptavidin DyLight680 or the control anti-

body with the 119 different peptides were observed. To screen the optimal melanin binding

condition, six different washing buffers were prepared: PBS at pH 6.5 with or without

0.005% Tween 20, PBS at pH 7.4 with or without 0.005% Tween 20, and Ultra-pure water

with or without 0.005% Tween 20. The Rockland blocking buffer MB-070 was used to

incubate all peptide microarrays for 30 min before the melanin binding assay. Six different

incubation buffers were formulated with 10% of blocking buffer in the six different wash-

ing buffers mentioned earlier. b-mNPs (10, 100, or 500 µg/ml) in six different incubation

buffers were incubated with the peptide microarray for 16 h at 4 ◦C or room tempera-

ture. All microarrays were subsequently washed with the same type of washing buffers

and incubated with 0.2 µg/mL of streptavidin DyLight680 for 45 min in the same type of

incubation buffer at room temperature for detecting the b-mNPs. The peptide microarrays

were then washed for 3 × 10 s with the same type of washing buffers and proceeded to

quantification of spot intensity. The pilot tests suggested that 500 µg/ mL of biotinylated

mNPs in pH 6.5 PBS buffer at room temperature was optimal (optimal condition shown

in Fig. 2.1d, remaining conditions shown in Fig. A.2. With the optimal flow conditions, 10

of the 16 peptides reported in the literature had detectable fluorescence intensities due to

binding by b-mNPs. Quantification of spot intensities and peptide annotation were based

on the 16-bit gray scale Tag Image File Format files that exhibit a higher dynamic range

than the 24-bit colorized Tag Image File Format files. Microarray image analysis was done

with PepSlide® Analyzer, version 1.4. The software algorithm decomposed fluorescence

intensities of each spot into raw, foreground and background signal, and calculated mean

38


median foreground intensities and spot-to-spot deviations of spot duplicates. Based on

mean median foreground intensities, intensity maps were generated and interactions in the

peptide maps highlighted by an intensity color code with red for high and white for low

spot intensities. The PEPperPRINT protocol tolerated a maximum spot-to-spot devia-

tion of 40%, otherwise the corresponding intensity value was zeroed. We labeled the top

20% of peptides ranked by intensities as melanin binding (23 peptides), which included 10

literature-reported peptides with non-zero fluorescent signal. The remaining peptides were

labeled as non-melanin binding (96 peptides).

2.5.4 Random forest classification model training with the pilot 119-peptide

microarray

Random forest is an ensemble tree-based statistical machine learning model and is

robust to variable noise and insensitive to variable scales [38]. Physiochemical variables and

numerical representations of peptides were computed using the R packages Peptides, version

2.4.4 [101] and protr, version 1.6–2 [102]. The resulting 1094 variables include composition,

transition, distribution, autocorrelation, conjoint triad, quasi-sequence-order descriptors,

and pseudo-amino acid and amphiphilic pseudo-amino acid composition descriptors. The

maximum value of lag was set to 6, so the minimum length of a peptide to be analyzed

without generating a missing value is 7. A random forest classification model with 100,000

trees and balanced sampling was trained on the melanin binding data set. The model

was built using the R package randomForest, version 4.7–1.1 [103]. For each tree in the

random forest, a bootstrap sample of ∼63.2% of the melanin binding peptides and the

39


same amount of non-melanin binding peptides was generated to construct the tree. The

remaining peptides were considered out-of-bag to the tree and were used to evaluate the

performance of the random forest by calculating the aggregated out-of-bag predictions

across all trees. The out-of-bag class errors were calculated and a classification threshold

of 0.5 proportion of votes was used. As part of the same analysis, permutation variable

importance was obtained with the importance function in the randomForest package. For

each tree in the random forest, out-of-bag instances were permuted for each variable in the

subset, and the decrease in accuracy was recorded. The mean decrease in accuracy for each

variable was calculated over all 100,000 trees and normalized by dividing the mean by the

standard error.

2.5.5 Expansion of the peptide microarray

Melanin binding candidate peptides were generated randomly with a frequency of 5%

for each of the 20 amino acids. Peptides classified as melanin binding by the trained ran-

dom forest model were selected, resulting in 5483 peptides of length ranging from 7 to 12

aa. Along with the 16 known melanin binding peptides from the literature, a total of 5499

peptides were printed in duplicate along with HA controls (YPYDVPDYAG; 68 spots)

on peptide microarrays by PEPperPRINT. Peptide sequences were printed in duplicate

of a custom peptide microarray. Pre-staining of a peptide microarray copy was done with

streptavidin DyLight680 (0.2 µg/ml) and the control antibody (mouse monoclonal anti-HA

(12CA5) DyLight800; 0.5 µg/ml) in incubation buffer to characterize non-specific binding.

Subsequent incubation of another peptide microarray with the b-mNPs at a concentration

40


of 500 µg/ml in incubation buffer (PBS at pH 6.5 with 0.005% Tween 20 with 10% Rock-

land blocking buffer MB-070) was followed by staining with streptavidin DyLight680 (0.5

µg/mL) and the control antibody (0.5 µg/mL). The control staining of the HA epitopes

was done simultaneously as internal quality control to confirm the assay quality and the

peptide microarray integrity. Quantification of spot intensities were described earlier in

the previous section.

2.5.6 Variable reduction of the machine learning input data

To reduce the number of variables and improve the model performance, a variable

reduction procedure was applied to the machine learning input data before model training.

Permutation-based variable importance was first computed on the data set with random

forest (100,000 trees) using the R package ranger, version 0.14.1 [104], with balanced sam