ABSTRACT Title of Dissertation: APPLICATION OF ADVANCED MACHINE LEARNING STRATEGIES FOR BIOMEDICAL RESEARCH Renee Ti Chou Doctor of Philosophy, 2023 Dissertation Directed by: Professor Michael P. Cummings Department of Biology Biomedical research delves deeply into understanding individual health and disease mechanisms. Recent advancements in technologies have further transformed the field with large-scale data sets, enabling data-driven approaches to identify important patterns and relationships from large data sets. However, these data sets are often noisy and unstruc- tured. Moreover, missing values and high dimensionality further complicate the analysis processes aimed at yielding meaningful results. With examples in ocular diseases and malaria, this dissertation presents novel strategies employing machine learning to tackle some of the challenges in biomedical research. In ocular diseases, sustained ocular drug delivery is critical to retain therapeutic lev- els and improve patient adherence to dosing schedules. To enhance the sustained delivery system, we engineer peptide sequences as an adapter to impart desired properties to ocular drugs. Specifically, we develop machine learning models separately for three properties– melanin binding, cell-penetration, and non-toxicity. We employ data reduction techniques to reduce the number of features while maintaining the machine learning model perfor- mance and apply interpretable machine learning techniques to explain model predictions on the three properties. Experimental validation in rabbits show two-fold increase in drug retention time with the selected peptide candidate. The developed machine learning frame- work can be further tailored to engineer other properties in molecular sequences with a wide variety of potential in biomedical applications. Malaria is an infectious disease caused by protozoan of the genus Plasmodium and has been a burden in global health. Developing malaria vaccines is challenging due to the diversity in parasite antigen sequences, which may lead to immune escape. To facilitate the vaccine development process, we leverage the wealth of systems data collected from various sources. For facile data management, a database is constructed to store the structured data processed from the results of the bioinformatics tools. Due to the small fraction of Plasmodium proteins labeled as known antigens, and the remaining proteins unknown of being antigens or non-antigens, a positive-unlabeled machine learning method is applied to identify potential vaccine antigen candidates. Beyond malaria, our approach provides a promising framework for identifying and prioritizing vaccine antigen candidates for a broad range of disease pathogens. APPLICATION OF ADVANCED MACHINE LEARNING STRATEGIES FOR BIOMEDICAL RESEARCH by Renee Ti Chou Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2023 Advisory Committee: Dr. Michael P. Cummings, Chair/Advisor Dr. Najib El-Sayed, Dean’s Representative Dr. Laura M. Ensign Dr. Philip Johnson Dr. Brian Pierce Dr. Shannon Takala Harrison © Copyright by Renee Ti Chou 2023 Acknowledgments This endeavor would not have been possible without the individuals who played piv- otal roles in shaping my Ph.D. journey. First and foremost, I would like to express my deepest gratitude to my advisor and committee chair, Dr. Michael P. Cummings. I joined Dr. Cummings’ lab at the Center for Bioinformatics and Computational Biology with a passion for learning interdisciplinary research skills and communicating with researchers from diverse scientific backgrounds, fo- cusing on biomedical research. Dr. Cummings possesses extensive research experience in applying computational and mathematical methods to biological problems, which aligns with my interest in developing a machine learning platform for studying biomedical sci- ences. Dr. Cummings is also exceptionally supportive in both research and professional development. He is the best mentor, deeply caring for his students, and has taught me not to be afraid of any challenges. Whether it is about research or fellowship applications, he is always there to guide and support me. I am extremely grateful to my two collaborators of my Ph.D. research projects, Dr. Shannon Takala-Harrison and Dr. Laura M. Ensign. Since joining Dr. Cummings’ lab in June 2019, I have been working on malaria research in collaboration with Dr. Shan- non Takala-Harrison’s group at the Center for Vaccine Development and Global Health, University of Maryland School of Medicine. I would like to thank Dr. Takala-Harrison for ii her invaluable insights into malaria, from the initial stages of the research to the process of manuscript writing. In each meeting with Dr. Takala-Harrison, I consistently learn from her unique perspectives, which enable me to improve the research further based on her helpful suggestions. In September 2019, I began working with Dr. Laura M. Ensign’s lab at the Center for Nanomedicine, Wilmer Eye Institute, which is part of the Johns Hopkins University School of Medicine. Dr. Ensign’s expertise in drug delivery provides valuable insights and supplements the interdisciplinary research. She is always very supportive of my professional development and loves to share her personal stories of challenges she has encountered during her academic career. These stories have been a great source of encouragement for me when facing failures along the way. I am deeply grateful to have Dr. Ensign as my Ph.D. project collaborator. I also want to extend my sincere thanks to my other dissertation committee members, Dr. Najib El-Sayed, Dr. Philip Johnson, and Dr. Brian Pierce, for their support over the past four years, including my initial committee meeting, qualifying examination, as well as seminars in Computational Biology, Bioinformatics, and Genomics, and the Center for Bioinformatics and Computational Biology seminars. They have been instrumental in helping my research in various ways, from teaching and commenting to supporting my fellowship applications. I cannot thank them enough for their informative and insightful advice on my dissertation. Special thanks go to my colleagues in Dr. Cummings’ lab, Alexis S. Boleda, Rana Khalil, and Yi Chen, for their helpful feedback and comments during lab meetings. I am also thankful to Jason Fan for his suggestions at the early stages of the malaria research. I would iii also like to express my gratitude to all the members of the Center for Bioinformatics and Computational Biology, the Biological Sciences Graduate Program, and the Computation and Mathematics for Biological Networks (COMBINE) program who have shown their support for my research through seminars and courses. My dissertation has been supported by the COMBINE fellowship and the Ann G. Wylie Dissertation Fellowship awarded by the Graduate School. I would also like to acknowledge my family, who have had the most significant impact on the beginning of my Ph.D. journey. My parents have been consistently supportive, no matter what decisions I have made. My sister has been my role model since I was little, and her success as a researcher in the biomedical field has inspired me greatly. Most importantly, I want to mention the unwavering support of my husband, Henry Hsueh, throughout my Ph.D. study. Since college, we have been supporting each other’s dreams of becoming researchers in our respective fields of passion. He is not only my supporter but also an important collaborator on one of my dissertation projects. Thanks should also go to my two corgis, Pika and Pichu, for their unconditional love and emotional support. Without the support and resources provided by these kind and impactful individuals around me, my Ph.D. journey at the University of Maryland, College Park, would not have been as fruitful, and my research would not have achieved such high quality. iv Table of Contents Acknowledgements ii Table of Contents v List of Tables x List of Figures xii I Overview of the Dissertation 1 Chapter 1: Introduction 2 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 II Multifunctional Peptide Engineering 8 Chapter 2: Machine Learning-Driven Multifunctional Peptide Engineering for Sus- tained Ocular Drug Delivery 9 2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Development of high throughput melanin binding peptide microarray methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Training of the melanin binding regression model . . . . . . . . . . 13 2.3.3 Training of cell-penetration and cytotoxicity classification models . 16 2.3.4 Validation of predicted peptide properties in vitro . . . . . . . . . . 18 2.3.5 Analysis of peptide variables that contribute to observed properties 19 2.3.6 Characterization and validation of a peptide-drug conjugate in vivo 23 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.1 Material sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.5.2 Melanin nanoparticle synthesis and characterization . . . . . . . . . 36 2.5.3 Optimization of processing conditions for peptide microarray . . . . 37 2.5.4 Random forest classification model training with the pilot 119-peptide microarray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 v 2.5.5 Expansion of the peptide microarray . . . . . . . . . . . . . . . . . 40 2.5.6 Variable reduction of the machine learning input data . . . . . . . 41 2.5.7 Machine learning model training for melanin binding predictions . . 42 2.5.8 Machine learning model training for cell-penetration predictions . . 44 2.5.9 Machine learning model training for cytotoxicity predictions . . . . 46 2.5.10 Peptide generation for machine learning model validation . . . . . . 47 2.5.11 Peptide synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.5.12 Melanin binding assay for machine learning model validation . . . . 48 2.5.13 Cell-penetration assay with ARPE19 cell type for machine learning model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.14 Shapley additive explanations (SHAP) analysis of variable contribu- tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.5.15 Adversarial computational controls . . . . . . . . . . . . . . . . . . 50 2.5.16 Peptide design space visualization . . . . . . . . . . . . . . . . . . . 51 2.5.17 Traceless linker system for conjugating HR97 to brimonidine . . . . 51 2.5.18 In vitro melanin binding assay . . . . . . . . . . . . . . . . . . . . . 53 2.5.19 In vitro stability test for HR97-brimonidine conjugate . . . . . . . . 54 2.5.20 Cathepsin cleavage assay for HR97 and HR97-brimonidine conjugate 55 2.5.21 Cell viability assay of HR97 peptide . . . . . . . . . . . . . . . . . 55 2.5.22 Animal studies—Animal welfare statement . . . . . . . . . . . . . . 56 2.5.23 Rabbit IOP measurements, topical dosing, and ICM injection . . . 56 2.5.24 Measurement of brimonidine in ocular tissues . . . . . . . . . . . . 58 2.5.25 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Chapter 3: Engineered Peptide-Drug Conjugate Provides Sustained Protection of Retinal Ganglion Cells with Topical Administration in Rats 60 3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.1 Conjugation of HR97 peptide to sunitinib increases melanin binding in vitro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.2 A deep learning object detection model was more accurate in count- ing RGCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.3 HR97-SunitiGel showed prolonged neuroprotective effects compared to SunitiGel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.3.4 HR97-SunitiGel provided increased intraocular residence time in rats and therapeutically relevant drug delivery to the posterior segment in rabbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.4.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.1 Material sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.5.2 Traceless linker system for conjugating HR97 to sunitinib . . . . . 76 3.5.3 In vitro stability test for HR97-sunitinib conjugate . . . . . . . . . 78 3.5.4 Cathepsin cleavage assay for HR97-sunitinib conjugate . . . . . . . 79 vi 3.5.5 In vitro melanin binding assay . . . . . . . . . . . . . . . . . . . . . 79 3.5.6 In vitro cell uptake assay . . . . . . . . . . . . . . . . . . . . . . . . 81 3.5.7 Characterization of drug solubility . . . . . . . . . . . . . . . . . . 81 3.5.8 Animal studies—Animal welfare statement . . . . . . . . . . . . . . 82 3.5.9 Rat optic nerve head (ONH) crush model . . . . . . . . . . . . . . 82 3.5.10 Retinal ganglion cell staining and imaging . . . . . . . . . . . . . . 83 3.5.11 Retinal ganglion cell counting . . . . . . . . . . . . . . . . . . . . . 84 3.5.12 Pharmacokinetic studies . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5.13 Measurement of sunitinib in ocular tissues . . . . . . . . . . . . . . 86 3.5.14 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 III Malaria Vaccine Antigen Identification 89 Chapter 4: Positive-Unlabeled Learning Identifies Vaccine Candidate Antigens in the Malaria Parasite Plasmodium falciparum 90 4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.1 Identification of potential P. falciparum candidate antigens . . . . . 94 4.3.2 Training positive-unlabeled random forest models . . . . . . . . . . 96 4.3.3 Classification tree filtering using reference antigens . . . . . . . . . 97 4.3.4 Proximity of top-ranked candidates to reference antigens . . . . . . 101 4.3.5 Variable importance of candidate antigen groups . . . . . . . . . . 102 4.3.6 Characteristics of identified potential vaccine antigen targets . . . . 103 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5.1 Known antigen protein collection . . . . . . . . . . . . . . . . . . . 108 4.5.2 Collection of Plasmodium data and bioinformatic analyses . . . . . 109 4.5.3 Data set assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5.4 Positive-unlabeled simulation . . . . . . . . . . . . . . . . . . . . . 113 4.5.5 Positive-unlabeled random forest algorithm implementation . . . . 113 4.5.6 Positive-unlabeled random forest evaluation . . . . . . . . . . . . . 114 4.5.7 Variable space weighting . . . . . . . . . . . . . . . . . . . . . . . . 115 4.5.8 Ensemble constituent filtering . . . . . . . . . . . . . . . . . . . . . 116 4.5.9 Positive-unlabeled random forest validation . . . . . . . . . . . . . 116 4.5.10 Candidate antigen clustering and comparisons . . . . . . . . . . . . 117 4.5.11 Variable importance analyses . . . . . . . . . . . . . . . . . . . . . 118 4.5.12 Variable value comparisons of top important variables . . . . . . . 119 4.5.13 Gene ontology enrichment analysis . . . . . . . . . . . . . . . . . . 119 4.5.14 Candidate antigen characterization . . . . . . . . . . . . . . . . . . 120 4.5.15 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Chapter 5: Plasmodium vivax Antigen Candidate Prediction Improves with the Addition of Plasmodium falciparum Data 123 vii 5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.3.1 Data engineering and model training . . . . . . . . . . . . . . . . . 126 5.3.2 Comparison of single-species models and the combined model . . . 128 5.3.3 Effects of heterologous positives and unlabeled proteins on combined model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.4 Analysis of model prediction space and species effect . . . . . . . . 133 5.3.5 Variables contributing to Plasmodium antigen prediction . . . . . . 136 5.3.6 Characterization of top vaccine antigen candidates . . . . . . . . . 137 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 5.5.2 Known antigen labeling . . . . . . . . . . . . . . . . . . . . . . . . 146 5.5.3 Machine learning data assembly and data combinations . . . . . . . 147 5.5.4 Positive-unlabeled random forest training . . . . . . . . . . . . . . 148 5.5.5 Positive-unlabeled random forest evaluation . . . . . . . . . . . . . 149 5.5.6 Adversarial control analysis . . . . . . . . . . . . . . . . . . . . . . 150 5.5.7 Comparison of models trained with different data combinations . . 151 5.5.8 Model interpretation of the combined model . . . . . . . . . . . . . 152 5.5.9 Clustering and amino acid composition analyses of model predictions 153 5.5.10 Variable importance analysis . . . . . . . . . . . . . . . . . . . . . 154 5.5.11 Clustering of top candidate antigens . . . . . . . . . . . . . . . . . 155 5.5.12 Gene ontology enrichment analysis . . . . . . . . . . . . . . . . . . 156 IV Appendices 157 Appendix A: Supplementary Information for Machine Learning-Driven Multifunc- tional Peptide Engineering for Sustained Ocular Drug Delivery 158 A.1 Supplementary Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 A.1.1 Machine learning input data sets . . . . . . . . . . . . . . . . . . . 158 A.1.2 Machine learning cross-validation results . . . . . . . . . . . . . . . 159 A.1.3 Adversarial control machine learning cross-validation results . . . . 163 A.2 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 A.3 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Appendix B: Supplementary Information for Engineered Peptide-Drug Conjugate Provides Sustained Protection of Retinal Ganglion Cells with Topical Administration in Rats 185 B.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 Appendix C: Supplementary Information for Positive-Unlabeled Learning Identifies Vaccine Candidate Antigens in the Malaria Parasite Plasmodium fal- ciparum 194 viii C.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 C.2 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 Appendix D: Supplementary Information for Plasmodium vivax Antigen Candidate Prediction Improves with the Addition of Plasmodium falciparum Data 209 D.1 Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 D.2 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Bibliography 223 ix List of Tables 4.1 Significantly enriched gene ontology terms with false discovery rate (FDR) <0.05 in gene ontology enrichment analysis of candidate antigen groups with the background proteome of P. falciparum 3D7. . . . . . . . . . . . . . . . 122 5.1 P. vivax and P. falciparum known antigen prediction accuracies of PURF models trained separately on P. vivax, P. falciparum, and combined data sets.128 5.2 Different combinations of data from P. vivax and P. falciparum and their corresponding model types. . . . . . . . . . . . . . . . . . . . . . . . . . . 132 A.1 Cross-validation performance (mean ± SEM) of the melanin binding general and adversarial control models. . . . . . . . . . . . . . . . . . . . . . . . . 180 A.2 Cross-validation performance (mean ± SEM) of the cell-penetration general and adversarial control models. . . . . . . . . . . . . . . . . . . . . . . . . 180 A.3 Cross-validation performance (mean ± SEM) of the cytotoxicity general and adversarial control models. . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 A.4 Ocular grading 7 days after a single ICM injection of saline, HR97 (equiva- lent to the amount of HR97 in HR97-brimonidine conjugate), or a physical mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine, 200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 181 A.5 Ocular grading 14 days after a single ICM injection of saline, HR97 (equiv- alent to the amount of HR97 in HR97-brimonidine conjugate), or a physical mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine, 200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 182 A.6 Ocular grading 21 days after a single ICM injection of saline, HR97 (equiv- alent to the amount of HR97 in HR97-brimonidine conjugate), or a physical mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine, 200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 183 A.7 Ocular grading 28 days after a single ICM injection of saline, HR97 (equiv- alent to the amount of HR97 in HR97-brimonidine conjugate), or a physical mixture of HR97 and brimonidine tartrate in solution (HR97 + brimonidine, 200 µg brimonidine equivalent) in Dutch Belted rabbits (n = 5 per group). 184 x C.1 Top important variables (upper part) and variable categories (lower part) in group 1 candidate antigens. Ranks in groups 2 and 3 individual variable and variable category importance are also shown (MDA: Mean Decrease Accuracy). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 C.2 Top important variables (upper part) and variable categories (lower part) in group 2 candidate antigens. Ranks in groups 1 and 3 variable and variable category importance are also shown (MDA: Mean Decrease Accuracy). . . 207 C.3 Top important variables (upper part) and variable categories (lower part) in group 3 candidate antigens. Ranks in groups 1 and 2 variable and variable category importance are also shown (MDA: Mean Decrease Accuracy). . . 208 D.1 Associations between Plasmodium species and antigen predictions from mod- els trained on different combinations of autologous and heterologous data (CI: confidence interval). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 xi List of Figures 2.1 Pilot 119 melanin binding peptide microarray screening with machine learn- ing analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Schematic of the machine learning pipeline based on the super learner frame- work for the melanin binding data set. . . . . . . . . . . . . . . . . . . . . 17 2.3 Experimental validations of final model predictions on melanin binding and cell-penetration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Melanin binding, cell-penetration model interpretation, and variable contri- butions to HR97 multifunctional peptide predictions. . . . . . . . . . . . . 22 2.5 Visualization of the peptide design space based on sequences and physio- chemical properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Characterization of HR97-brimonidine in vitro and in vivo. . . . . . . . . . 27 3.1 Characterization of HR97-sunitinib stability and solubility. . . . . . . . . . 64 3.2 Characterization of HR97-sunitinib melanin binding and cell uptake in vitro. 65 3.3 Comparison between SSD-MobileNet, Faster R-CNN Inception ResNet v2, and CellProfiler software. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4 HR97-SunitiGel extended RGC protection to at least 2 weeks after the last topical dose in rat model of optic nerve injury. . . . . . . . . . . . . . . . . 69 3.5 Characterization of intraocular drug concentrations after topical dosing with SunitiGel or HR97-SunitiGel in rats and rabbits. . . . . . . . . . . . . . . 70 4.1 Database schema of P. falciparum vaccine target identification. . . . . . . 95 4.2 Model evaluation and validation of positive-unlabeled random forest models. 99 4.3 Positive-unlabeled random forest model interpretation based on known anti- gens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4 Clustering of top 200 candidate antigens based on proximity measured from tree-based model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.1 Performance of PURF models with the optimal hyper-parameter setting. . 129 5.2 Probability score distributions of PURF models. . . . . . . . . . . . . . . . 131 5.3 Visualization of the prediction space of the combined PURF model. . . . . 135 5.4 Model interpretation of the combined PURF model on the prediction of known antigens. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.5 Venn diagram of top 10 important variables from different PURF models. . 140 xii A.1 Characterization of melanin nanoparticles (mNPs) and biotinylated-melanin nanoparticles (b-mNPs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 A.2 Interaction profilings of b-mNPs against peptides in the pilot 119 microarray.168 A.3 Variable reduction of peptide data sets with random forests. . . . . . . . . 169 A.4 Base model coefficients in final super learners. . . . . . . . . . . . . . . . . 170 A.5 Comparison of melanin binding and cell-penetration of candidate peptides in non-induced ARPE-19 cells. . . . . . . . . . . . . . . . . . . . . . . . . . 171 A.6 Cytotoxicity model interpretation. . . . . . . . . . . . . . . . . . . . . . . . 172 A.7 Variable contributions to the prediction of the adversarial models. . . . . . 173 A.8 Cytotoxicity validation of the HR97 peptide. . . . . . . . . . . . . . . . . . 174 A.9 NMR spectrum of brimonidine. . . . . . . . . . . . . . . . . . . . . . . . . 175 A.10 NMR spectrum of Mc-VC-PAB-Cl (Maleimidocaproyl-L-valine-L-citrulline- p-aminobenzyl chloride). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 A.11 NMR spectrum of Mc-VC-PAB-brimonidine. . . . . . . . . . . . . . . . . . 177 A.12 MALDI-TOF spectrum of the HR97-brimonidine conjugate. . . . . . . . . 178 A.13 Comparison of intraocular pressure (IOP) change from baseline. . . . . . . 179 B.1 Synthesis scheme for HR97-sunitinib. . . . . . . . . . . . . . . . . . . . . . 185 B.2 NMR spectrum of sunitinib base. . . . . . . . . . . . . . . . . . . . . . . . 186 B.3 NMR spectrum of Mc-VC-PAB-Cl. . . . . . . . . . . . . . . . . . . . . . . 187 B.4 NMR spectrum of Mc-VC-PAB-sunitinib. . . . . . . . . . . . . . . . . . . 188 B.5 Molecular structure of HR97-sunitinib conjugate and the MALDI-TOF spec- trum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 B.6 HPLC analysis of cathepsin cleavage assay of the HR97-sunitinib conjugate. 190 B.7 RGC quantification using SSD Mobile-Net. . . . . . . . . . . . . . . . . . . 191 B.8 RGC quantification using the Faster R-CNN Inception Resnet v2 model. . 192 B.9 Time course of RGC loss in the rat optic nerve head crush model. . . . . . 193 C.1 Database schema of P. falciparum reverse vaccinology data. . . . . . . . . 195 C.2 Evaluation of model performance on simulated data set. . . . . . . . . . . 196 C.3 Hyper-parameter tuning before variable space weighting. . . . . . . . . . . 197 C.4 Evaluation of known antigen predictions before variable space weighting. . 198 C.5 Hyper-parameter tuning after variable space weighting. . . . . . . . . . . . 199 C.6 Evaluation of known antigen predictions after variable space weighting. . . 200 C.7 Comparison of mean differences in probability scores after known antigen label removal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 C.8 Probability scores of candidate antigen groups. . . . . . . . . . . . . . . . . 202 C.9 Statistical comparisons of distances between candidate and reference antigens.203 C.10 Statistical comparisons of variable values of top important variables between the candidate antigen groups and randomly selected non-antigens. . . . . . 204 C.11 Candidate antigen characterization across various P. falciparum life stages. 205 D.1 Hyper-parameter tuning for PURF model trained on the P. vivax data set. 210 D.2 Evaluation of known antigen predictions of the P. vivax model. . . . . . . 211 D.3 Hyper-parameter tuning for PURF model trained on the combined data set. 212 xiii D.4 Evaluation of known antigen predictions of the combined model. . . . . . . 213 D.5 Validation of PURF models. . . . . . . . . . . . . . . . . . . . . . . . . . . 214 D.6 Evaluation of known antigen predictions of PURF models. . . . . . . . . . 215 D.7 Relationship between proportion of labeled positives in the data set and mean tree depth in the PURF model. . . . . . . . . . . . . . . . . . . . . . 216 D.8 Visualization of hierarchical clustering dendrogram investigation. . . . . . 217 D.9 Variable importance for the P. vivax model. . . . . . . . . . . . . . . . . . 218 D.10 Comparison of variable importance values between PURF models. . . . . . 219 D.11 Clustering analysis of top candidate antigens. . . . . . . . . . . . . . . . . 220 D.12 Gene ontology (GO) enrichment analysis of candidate antigen groups. . . . 221 xiv Part I Overview of the Dissertation 1 Chapter 1: Introduction 1.1 Background The advancement of biological and computational technologies has enabled the gener- ation of large and complex data in biological sciences research, and has promoted the broad application of machine learning in various biomedical domains over the past decades [1,2]. As the volume of data increases, multiple research fields gradually transitioned from tradi- tional, model-focused approaches, to approaches that are more data-driven [3]. However, challenges emerge when extracting meaningful patterns and relationships from the large amount of data. Machine learning, including both statistical methods and computational algorithms, aims at learning relationships among data, which can gain important insights from complex and large-scale data by computing the underlying and inherent structures within a data set. However, in biomedical application, there is a wide variety of biological data types, such as genome sequences, gene expressions, and molecular structures [2]. Be- cause of such diversity, the selection of representative features from the high-dimensional data set and the usage of machine learning algorithms are usually problem-specific [4]. Moreover, the rapid growth of data could lead to a lack of substantial labeling, hamper- ing the model performance due to insufficient information [5]. Therefore, it is critical to develop adaptive and advanced strategies to solve the biomedical research problems more 2 effectively and efficiently. The dissertation delves into two types of biomedical problems: sustained ocular drug delivery and malaria vaccine antigen identification. This dissertation introduces machine learning-based platforms that can be further extended to various other biomedical appli- cations. In the research domain of sustained ocular drug delivery, patient adherence may be enhanced through maintaining the drug therapeutic levels in the eye. Utilizing melanin residing in the pigmented tissues in the eye, drugs with melanin binding and cell penetra- tion properties can be stored and slowly released from the depot. To impart the desired properties to drugs, peptides, which are short sequences of amino acids, can be used as an adapter and be conjugated to drugs. For peptides with lengths ranging between 7 and 12 amino acids, the number of possible combinations is ∼4.3 × 10−15, given 20 different amino acids. Among other methods, machine learning is an appropriate approach to rationally design peptides with desired properties. By performing interpretable machine learning techniques, the predictions of the model can be explained, leading to reproducible and transparent results. Regarding the research of identifying malaria vaccine antigen candidates, effective malaria vaccines targeting either of the most-predominant species, Plasmodium falciparum and Plasmodium vivax, are an unmet need. Reverse vaccinology, which leverages the wealth of systemic data derived from pathogen genomes, has been adopted to facilitate the process of vaccine development. However, most methods involved filtering candidate antigens with criteria solely based on domain knowledge, and a more comprehensive, data-centric machine learning approach is less explored. Without prior assumptions about the importance of protein variables, machine learning assists in learning the variable importance through 3 training data sets, and, if provided, the corresponding labels, which indicate whether a protein is an antigen or non-antigen. Nevertheless, due to the fact that validating the antigenicity of a protein requires several rigorous experiments and thus is time-consuming, the antigenic labeling of the proteins is sparse, with only a few proteins labeled as antigens, and the remaining proteins being unlabeled. To overcome such challenge, an advanced approach of positive-unlabeled learning is adapted to identify potential antigen candidates with the goal to further improve the reverse vaccinology pipeline in vaccine development. 1.2 Dissertation Outline The dissertation is structured so that each chapter corresponds to a manuscript. Part I: Overview of the Dissertation, provides the background and scope of the problem domains, as well as a brief introduction to each chapter. Part II: Multifunctional Peptide Engineering, including Chapters 2 and 3, focuses on using an ensemble machine learning method to engineer multifunctional peptides to improve sustained ocular drug delivery. Part III: Malaria Vaccine Identification, consisting of Chapters 4 and 5, emphasizes on using the positive-unlabeled learning technique to identify potential candidates for malaria vaccine antigens to facilitate the vaccine development process. The appendices in Part IV provide additional materials related to the research findings. Chapter 2: Machine Learning-Driven Multifunctional Peptide Engineering for Sus- tained Ocular Drug Delivery, presents research results published in Nature Communi- cations (https://doi.org/10.1038/s41467-023-38056-w), authored by H. T. Hsueh, R. T. Chou (co-first author), U. Rai, W. Liyanage, Y. C. Kim, M. B. Appell, J. Pejavar, 4 https://doi.org/10.1038/s41467-023-38056-w K. T. Leo, C. Davison, P. Kolodziejski, A. Mozzer, H. Kwon, M. Sista, N. M. Anders, A. Hemingway, S. V. K. Rompicharla, M. Edwards, I. Pitha, J. Hanes, M. P. Cummings, and L. M. Ensign. The research addresses the challenge of delivering drugs into the eye, which stems from the inherent ocular barriers and clearing mechanisms [6, 7], resulting in an intensive dosing schedule that discourages patient compliance. Thus, it is important to develop effective ocular drug delivery systems that can maintain sustained therapeutic drug levels. To assist in the delivery of drugs to the depot formed by the melanin in- side the pigmented tissues in the eye, the research leverages machine learning models to guide the engineering of multifunctional peptide adapters, which imparts melanin binding and cell penetrating properties to ocular drugs. My contributions to this work include: (i) developing melanin binding peptide microarray assays; (ii) designing an ensemble ma- chine learning pipeline to predict melanin binding, cell penetration, and low cytotoxicity peptides; and (iii) conducting interpretable machine learning analyses to understand and explain model predictions. The corresponding supplementary information is in Appendix A. H. T. Hsueh, R. T. Chou, J. Hanes, M. P. Cummings, and L. M. Ensign are named as inventors on the U.S. Provisional Patent Application No. 63/340,714, which covers aspects of this work. Chapter 3: Engineered Peptide-Drug Conjugate Provides Sustained Protection of Retinal Ganglion Cells with Topical Administration in Rats, presents further application of the selected peptide candidate from machine learning models trained in Chapter 2 to another ocular drug, sunitinib, that protects retinal ganglion cells. The research work is published in Journal of Controlled Release (https://doi.org/10.1016/j.jconrel. 2023.08.058), and is authored by H. T. Hsueh, R. T. Chou (co-first author), U. Rai, 5 https://doi.org/10.1016/j.jconrel.2023.08.058 https://doi.org/10.1016/j.jconrel.2023.08.058 P. Kolodziejski, W. Liyanage, J. Pejavar, A. Mozzer, C. Davison, M. B. Appell, Y. C. Kim, K. T. Leo, H. Kwon, M. Sista, N. M. Anders, A. Hemingway, S. V. K. Rompicharla, I. Pitha, D. J. Zack, J. Hanes, M. P. Cummings, and L. M. Ensign. The research focuses on improving the drug delivery system to treat chronic diseases related to the posterior segment of the eye, such as retina, choroid and optic nerve, with the ultimate goal of enhancing patient adherence for better disease management. My contributions to this manuscript include: (i) participating in conceptualizing, designing, and interpreting experiments and results; (ii) using machine learning models to predict and select peptide candidates to be conjugated to the drug; and (iii) applying an object detection technique to facilitate the measurement of cell survival rates to validate the effectiveness of the peptide-drug conjugate in the drug delivery system. The supplementary materials are described in Appendix B. Chapter 4: Positive-Unlabeled Learning Identifies Vaccine Candidate Antigens in the Malaria Parasite Plasmodium falciparum, discusses research that studies approaches to fa- cilitate malaria vaccine development. The manuscript is currently under review by npj Systems Biology and Applications, and is a collaborative work by the authors, R. T. Chou, A. Ouattara, M. Adams, A. A. Berry, S. Takala-Harrison, and M. P. Cummings. Malaria is a mosquito-borne infectious disease caused by Plasmodium species. The parasite has mul- tiple life stages, and exhibits various immune evasion strategies, such as extremely variable surface antigens [8]. Thus, it is critical to identify conserved potential vaccine antigens that are less variable but with subdominant immunogenicity. The research employs a machine learning-based reverse vaccinology approach to identify potential vaccine antigen candi- dates for malaria. Since only a few known antigens are selected based on our stringent criteria, the data set is largely unlabeled. My contributions to this research include: (i) 6 adapting a positive-unlabeled learning algorithm to classify P. falciparum proteins into antigens or non-antigens while tackling the problem with sparse antigenic labeling; (ii) im- proving the machine learning model by utilizing the tree structure in the positive-unlabeled random forest model; and (iii) performing downstream computational analyses to charac- terize top antigen candidates and to further select a smaller set of antigen candidates based on desired properties for future experimental validation experiments. The supplementary details for Chapter 4 can be found in Appendix C. Chapter 5: Plasmodium vivax Antigen Candidate Prediction Improves with the Addi- tion of Plasmodium falciparum Data, highlights research findings of a comprehensive study conducted to improve the identification of vaccine antigen candidates for P. vivax, the second-most prevalent species causing malaria, by integrating data from the well-studied species, P. falciparum. The study also employs positive-unlabeled learning to construct a machine learning model with multiple different training sets generated by integrating the data of the two species. The research work is jointly conducted by the authors, R. T. Chou, A. Ouattara, S. Takala-Harrison, and M. P. Cummings, and will be submitted to npj System Biology and Applications soon. My contributions to the manuscript include: (i) applying the positive-unlabeled learning framework described in Chapter 4 to various combinations of training data from P. vivax and P. falciparum; (ii) decomposing and quantifying the ef- fects of the addition of known antigens and/or unlabeled proteins; and (iii) characterizing top candidate antigens, analyzing important protein variables for identifying top candi- dates, and comparing important variables identified from across various machine learning models to gain insights into the proposed integration methodology. Additional information for Chapter 5 is provided in Appendix D. 7 Part II Multifunctional Peptide Engineering 8 Chapter 2: Machine Learning-Driven Multifunctional Peptide Engineering for Sustained Ocular Drug Delivery 2.1 Abstract Sustained drug delivery strategies have many potential benefits for treating a range of diseases, particularly chronic diseases that require treatment for years. For many chronic ocular diseases, patient adherence to eye drop dosing regimens and the need for frequent in- traocular injections are significant barriers to effective disease management. Here, we utilize peptide engineering to impart melanin binding properties to peptide-drug conjugates to act as a sustained-release depot in the eye. We develop a super learning-based methodology to engineer multifunctional peptides that efficiently enter cells, bind to melanin, and have low cytotoxicity. When the lead multifunctional peptide (HR97) is conjugated to brimonidine, an intraocular pressure lowering drug that is prescribed for three times per day topical dos- ing, intraocular pressure reduction is observed for up to 18 days after a single intracameral injection in rabbits. Further, the cumulative intraocular pressure lowering effect increases ∼17-fold compared to free brimonidine injection. Engineered multifunctional peptide-drug conjugates are a promising approach for providing sustained therapeutic delivery in the eye and beyond. 9 2.2 Introduction In many disease settings, sustained delivery of therapeutic levels of drug can improve treatment efficacy, reduce side effects, and avoid challenges with patient adherence to in- tensive dosing regimens [9, 10]. This is particularly critical in the management of chronic diseases, where long-term adherence to medication usage and clinical monitoring can suf- fer [11, 12]. In the ophthalmic setting, the leading causes of irreversible blindness and low vision are primarily age-related, chronic diseases, such as glaucoma and age-related mac- ular degeneration [13–15]. Recent approvals of devices that provide sustained therapeutic release, such as the Durysta® intracameral implant for continuous delivery of an intraocu- lar pressure (IOP) lowering agent, and the surgically implanted port-delivery system that provides continuous intravitreal delivery of ranibizumab, highlight the importance of these next generation approaches for ocular disease management [16–19]. Conventionally, sus- tained therapeutic effect is achieved by an injectable or implantable device that controls the release of the therapeutic moiety into the surrounding environment. However, these devices typically require injection through larger gauge needles or a surgery for implantation, with both procedures having associated risks [20–22]. Further, the buildup of excipient material, the need for device removal, and the potential for foreign body reaction can cause further issues [18, 23, 24]. One approach for circumventing the issues associated with sustained release devices is to impart enhanced retention time and therapeutic effect to drugs upon administration to the eye without the need for an excipient matrix/implant. Binding to melanin, a pigment present within melanosomes in multiple ocular cell types, was previously reported to affect 10 ocular drug biodistribution [25]. Due to the low turnover rate of ocular melanin, a drug that can bind to melanin may accumulate in pigmented eye tissues, leading to drug toxicity or drug sequestration [26, 27]. However, with the right balance of melanin-binding affinity and capacity, melanin may act as a sustained-release drug depot in the eye that results in prolonged therapeutic action [28]. Several drugs have been demonstrated to have intrinsic melanin binding properties due to particular physicochemical properties, which in some cases, prolongs the pharmacologic activity in the eye [28–30]. To impart beneficial melanin-binding properties to drugs, one approach is to engi- neer peptides with high melanin binding that could be conjugated to small molecule drugs through a reducible linker. Thus, the peptide would provide enhanced retention time, while the linker would ensure that drug could be released and exert its therapeutic action in a sustained manner. In addition, there are available databases describing how peptide sequence affects cell-penetration [31,32], and separately cytotoxicity [33], enabling the po- tential for engineering multifunctional peptides that can be chemically conjugated to drugs. Incorporating multiple functions into one peptide sequence remains challenging, and thus multifunctional peptides are often designed by fusing peptides via a linker, thus forgoing potentially more efficient rational design, or by testing additional properties on peptides with known functions [34–36]. In contrast, machine learning could allow for designing peptide sequences that simultaneously provide multiple desired properties. Here, we describe the development of engineered peptides informed by machine learn- ing, which have three properties: high binding to melanin, cell-penetration (to enter cells and access melanin in the melanosomes), and low cytotoxicity. As there was no prior infor- mation for how peptide sequences affect melanin binding, we experimentally determine the 11 effect of peptide sequence on melanin binding using a microarray. We then apply machine learning-based analyses to identify peptide sequences that display all three desired proper- ties. Importantly, with the Shapley additive explanation (SHAP) analysis [37] of peptide variables, the machine learning model interpretation provides additional insights and rea- soning for the multifunctionality of the peptides. As a proof-of-principle, we demonstrate here that an engineered peptide, HR97, can be conjugated to the intraocular pressure (IOP) reducing drug, brimonidine tartrate. A single intracameral (ICM) injection of the HR97- brimonidine conjugate is able to provide sustained IOP reduction in normotensive rabbits compared to ICM injection of an equivalent amount of brimonidine tartrate, or a topical dose of Alphagan® P 0.1% eye drops. Further, the maximum measured change in IOP from baseline (∆IOP) is increased with ICM injection of the HR97-brimonidine conjugate. We anticipate that engineered peptide-drug conjugates will facilitate the development of implant-free injectables for use in a variety of ophthalmic indications. 2.3 Results 2.3.1 Development of high throughput melanin binding peptide microarray methodology To determine how peptide sequence affects melanin binding properties, we adapted a high-throughput flow-based peptide microarray system to characterize melanin binding events (Fig. 2.1a). Commercially available eumelanin was processed into nanoparticles (mNPs) to prevent sedimentation and provide reproducible surface area available for bind- ing to peptides printed on the substrate surface. The mNPs had an mean size of 200.7±5.99 12 nm and ζ-potential of −23.7 ± 1.39 mV (Fig. 2.1b, c). The mNPs were further biotinylated (b-mNPs) to facilitate fluorescent labeling with streptavidin DyLight680. The b-mNPs showed slightly larger mean size of 216.0 ± 14.85 nm and ζ-potential of −21.2 ± 2.15 mV (Fig. 2.1b, c), and maintained similar spherical morphology (Fig. A.1a) and binding to small molecule drugs brimonidine tartrate and sunitinib malate (Fig. A.1b). The first microarray was printed with 119 peptides to screen flow conditions for the highest fluo- rescent reporter signal, which identified that the 500 µg/mL of biotinylated mNPs in pH 6.5 PBS buffer at room temperature was optimal (Fig. 2.1d and Fig. A.2). We then used the fluorescent reporter signals to construct a melanin binding classification random forest model. The prediction accuracy was 0.92. The permutation-based variable importance analysis [38] further revealed that the net charge, basic amino acids, and isoelectric point (pI) may contribute to distinguishing melanin binding and non-melanin binding peptides (Fig. 2.1e). 2.3.2 Training of the melanin binding regression model A second larger peptide screen was implemented to generate melanin binding data to use for the additional model generation (Fig. 2.2a). Specifically, we used the trained random forest model to predict melanin binding for ∼630,000 randomly generated pep- tides, and those classified as melanin binding were selected. A total of 5499 peptides were printed in duplicate, and the fluorescent reporter intensities were reported as the amount of the b-mNPs that bind to the printed peptides on the microarray. Surprisingly, we identified 780 peptides displaying higher levels of fluorescent reporter intensities than any 13 Figure 2.1 Pilot 119 melanin binding peptide microarray screening with machine learning analysis. a Schematic illustration of the first peptide microarray. Peptides were anchored to a microarray, and melanin nanoparticles (mNPs) with surface biotinylation (b-mNPs) were flowed over to characterize binding events. The fluorescence intensity of the biotin was detected using DyLight 680-conjugated streptavidin to quantify melanin binding for each peptide. An initial classification model was trained using the data generated. Random peptides were then classified by the model as melanin binding or non-melanin binding. Created with BioR- ender.com. b,c Plot showing the sizes (b) and ζ-potential (c) of mNPs (black dots, n = 6 and b-mNPs (gray squares, n = 6). Data are presented as mean ± SD. Group means were compared using Student’s t tests (two-tailed). d The optimal interaction profiling of b-mNPs against 16 positive control peptides (peptide numbers: 1–16) and 103 random peptides (peptide numbers: 17–119). e Permutation-based variable importance analysis of the melanin binding classification random forest. The x-axis indicates the mean decrease in prediction accuracy after variable permutation. The values are shown at the end of the bars. The top 20 important variables ranked by mean decrease in accuracy are shown. 14 of the 16 peptides described in the literature that bound to human melanoma cells [39] and melanized C. neoformans [40], which were previously screened by the phage display technique. Furthermore, there were 758 peptides showing higher fluorescent values than the highest melanin binding peptides (661.5 arb. units) from the 119-peptide microarray, demonstrating the enrichment of melanin binding properties from training the random forest model. Next, the fluorescent reporter intensities values were used as the response variable in training a regression model. Applying a variable reduction procedure using ran- dom forest to eliminate less informative variables from the data set, reduced the number of variables from 1094 to 64 (Fig. A.3a), and model performance measured by the coefficient of determination (R2) improved from 0.48 to 0.53. A wide array of machine learning mod- els was explored and trained on the variable-reduced data set and were integrated with a super learning (SL) framework that combined various types of base models weighted using a meta-learner. By applying the iterative base model filtering procedure (Fig. 2.2b), the complexity of the SL was further reduced. To explore other combinations of base models in the SL ensemble, homogeneous base models consisting of models from only one algorithm family were constructed. A nested cross-validation (Fig. 2.2c) was applied to estimate an unbiased generalization performance. All SL models with base model reduction were se- lected as the top model in the inner loop cross-validations, and the performance evaluated in the outer loop cross-validation improved to R2 = 0.54 ± 0.01 (Table A.1). The reduced SL was selected amongst 31 competitive models as the final melanin binding regression model. When training the same set of models on the whole data set, and number of base models in the SL was reduced from 907 to 38 (Fig. 2.2d). Adversarial computational con- trol was performed, and the generalization performance was R2 = −0.04 ± 0.02, indicating 15 that the machine learning was effective in learning meaningful relationships in the melanin binding data set. 2.3.3 Training of cell-penetration and cytotoxicity classification models Engineered peptides must enter cells to reach and bind to melanin within the melano- somes and should be minimally toxic to cells. Thus, the SkipCPP-Pred [31] and the Toxin- Pred [33] databases were used to create SL classification ensembles to engineer tri-functional peptides. Variable reduction decreased the number of variables from 1094 to 11 for the cell-penetration data set (Fig. A.3b) and from 1094 to 56 for the cytotoxicity data set (Fig. A.3c). The prediction accuracies calculated from out-of-bag samples improved from 0.91 to 0.93 and from 0.951 to 0.958 for cell-penetration and cytotoxicity, respectively. We subsequently trained base models and SL ensembles, and the generalization performances in terms of Matthews correlation coefficient (MCC), F1 (harmonic mean of precision and re- call), and balanced accuracy for cell-penetration were 0.79±0.01, 0.90±0.01, and 0.90±0.01, respectively; and those for cytotoxicity were 0.88±0.004, 0.92±0.002, and 0.95±0.002, re- spectively (Tables A.2, A.3). The number of base models in the reduced SL models trained on the whole data sets were decreased from 310 to 65 for cell-penetration, and from 311 to 22 for cytotoxicity (Fig. A.4). There were 300 competitive cell-penetration models and 175 competitive cytotoxicity models. A GBM model and the reduced SL were selected as the final predictive cell-penetration and cytotoxicity models. Similar to melanin binding, adversarial controls had decreased generalization performances, where the MCC, F1, and balanced accuracy were −0.002±0.05, 0.52±0.03, and 0.50±0.03 for cell-penetration, and 16 Figure 2.2 Schematic of the machine learning pipeline based on the super learner framework for the melanin binding data set. a Scheme of a larger microarray, which includes 5499 peptides used to train a regression super learner. Random peptides were generated based on position-dependent amino acid frequencies calculated using the second peptide array data, and the melanin binding levels were predicted. Peptides with desired melanin binding levels were selected for further experimental validation. Created with BioRender.com. b Scheme of the super learner complexity reduction. Holdout predictions of peptides (shown as rows) were generated for each base model (shown as columns) with tenfold cross-validation (CV) on the input data set. A meta-learner (generalized linear model) was fitted on the holdout predictions with another tenfold cross-validation. The number of base models was reduced by applying an iterative reduction procedure (see Section 2.5). The final super learner ensemble was trained on the input data set with the optimal combination of the selected base models. c Scheme of the machine learning pipeline for an unbiased model performance evaluation. The nested cross-validation 17 Figure 2.2 (previous page) includes an outer loop for model evaluation and an inner loop for model selection (cyan). The outer loop generated 10 sets of train-test splits using a Monte Carlo method, and the inner loop generated 10 sets of train-test splits using a modulo method. d Plot of the base models of the final melanin binding super learner. Coefficients of determination (R2) are denoted with color and conveyed as white text on the bars or gray text adjacent bars. Base model coefficients are indicated at the bar ends. There is one model having zero coefficient and not shown. See Sections 2.5 and A.1 for information about model hyperparameter details and statistics of model performance. 0.001 ± 0.01, 0.05 ± 0.02, and 0.62 ± 0.04 for cytotoxicity. 2.3.4 Validation of predicted peptide properties in vitro A position-dependent amino acid frequency matrix was used to generate 127 peptides that spanned the range of low to high predicted melanin binding. Among the 127 peptide candidates, 113 peptides were classified as cell-penetrating and 117 peptides were predicted as non-toxic. To experimentally measure melanin binding in vitro, biotinylated peptides were incubated with mNPs, and the bound fraction was calculated using an avidin-based fluorescent reporter (Fig. 2.3a). The Pearson correlation coefficient was computed to com- pare the predicted and experimental melanin binding values, and the correlation coefficient was r = 0.84, showing a high level of correlation between the predicted and experimental values (Fig. 2.3b). We next characterized how the predicted cell-penetrating properties of the peptides affected cell uptake in a retinal pigment epithelium cell line (ARPE-19). ARPE-19 cells were cultured using standard methods (non-induced, n = 3) and using culture conditions that induce melanin production (induced, n = 3) [28]. A positive corre- lation was observed between the measured in vitro melanin binding of the peptides and the intracellular peptide concentrations in melanin-induced cells for cell-penetrating peptides (r = 0.77, p < 2.2 × 10−16) but not non-cell-penetrating peptides (r = 0.28, Fig. 2.3c, d), 18 suggesting correlation between the two properties. Further, peptides predicted to be cell- penetrating demonstrated significantly higher intracellular concentrations (median 229.4 pmol/100 K cells) than those of non-cell-penetrating peptides (median 26.7 pmol/ 100 K cells) in the melanin-induced cells (p = 6.9 × 10−6, Fig. 2.3e). In contrast, the intracellular peptide concentrations were not affected by the predicted properties in non-induced cells (Fig. A.5). 2.3.5 Analysis of peptide variables that contribute to observed properties To identify which peptide variables contributed to the properties observed in vitro, Shapley additive explanation (SHAP) analysis of the final predictive models was performed. The results showed that peptide property predictions were based on contribution from mul- tiple variables. More specifically, basic peptides and higher net charge variables had higher contributions to melanin binding predictions (Fig. 2.4a), which was consistent with the top variables identified by the random forest classification model trained on the pilot peptide microarray. Similarly, higher net charge and higher isoelectric point contributed more to cell-penetration (Fig. 2.4b), and less free cysteines had more influence on non-toxic pre- dictions (Fig. A.6). To understand how reliable the interpretable results were, adversarial controls were constructed with the final predictive models using a 10-fold cross-validation. Indeed, the distributions and levels of variable contributions changed for melanin binding, cell-penetration, and cytotoxicity (Fig. A.7). Among all the peptide candidates, HR97 (FS- GKRRKRKPR) was selected based on combination of the three peptide properties (melanin bindingHR97 = 79.1 ± 0.7%, cell uptakeHR97 = 759.9 ± 19.6 pmol/100 K cells, non-toxicHR97 19 Figure 2.3 Experimental validations of final model predictions on melanin binding and cell-penetration. a Schematic showing an in vitro melanin binding assay with melanin nanopar- ticles (mNPs) using a biotin quantification kit. The DyLight 494-tagged avidin emitted fluores- cence when the biotinylated peptides displaced the weakly interacting 4′-hydroxyazobenzene-2- carboxylic acid (HABA or H). Created with BioRender.com. b Plot of the relationship between predicted melanin binding and binding measured experimentally in vitro. The x-axis indicates melanin binding predictions from the final super learner, and the y-axis indicates the experimental melanin binding values (n = 4 for each peptide). Dots represent the mean value for peptides. The black linear trend line conveys the Pearson correlation relationship (two-tailed), and the gray area indicates the 95% confidence interval. c, d Comparison of melanin binding and cell-penetration in melanin-induced human adult retinal pigment epithelial (ARPE-19) cells. Blue triangles de- note predicted non-cell-penetrating peptides (non-CPP), and magenta dots represent predicted cell-penetrating peptides (CPP). The x-axes indicate melanin binding measured in vitro (n = 4 for each peptide), and the y-axes convey intracellular peptide concentration measured from the cell uptake assay (n = 3 for each peptide). Black linear trend lines indicate Pearson correlation relationships, with 95% confidence intervals shown as shaded areas. The correlation coefficients 20 Figure 2.3 (previous page) and p-values (two-tailed) are shown. e Summary of CPP (n = 113) and non-CPP (n = 14) intracellular concentrations. Box plot conveys median (middle line), 25th and 75th percentiles (box), and the 1.5 × interquartile range (whiskers). The p value was calculated using a Mann–Whitney U test (two-tailed). = 96.9%, Fig. 2.4c). HR97 had the highest intracellular concentration, which outperformed the well-characterized cell-penetrating peptide fragment of the HIV trans-activator protein (TAT47−57, YGRKKRRQRRR). HR97 demonstrated increased cell uptake compared to TAT47−57 in both the induced ARPE-19 cells (cell uptakeHR97 = 759.9 ± 19.6 pmol/100 K cells, cell uptakeTAT47−57 = 457.1 ± 34.2 pmol/100 K cells) and the non-induced cell type (cell uptakeHR97 = 82.5 ± 9.1 pmol/100 K cells, cell uptakeTAT47−57 = 68.3 ± 4.6 pmol/100 K cells). In addition, HR97 showed no sign of cytotoxicity in ARPE-19 cells at concen- trations up to 5 mg/mL (Fig. A.8). HR97 predictions embodied all the properties that were the largest contributors to each functionality, including being basic (63.64% basic amino acids), possessing a high net charge (6.98) and a high isoelectric point value (12.99), and no cysteines (Fig. 2.4d–f). By visualizing the peptide design space defined by the sequences and variables used in training the desired functional properties, the peptide can- didates with high melanin binding predictions were shown up in the same cluster, showing similar sequence motifs and physiochemical properties (Fig. 2.5a, b). Further, peptides predicted to have high melanin binding were mostly predicted to be cell-penetrating, but cell-penetrating peptides may not be melanin binding (Fig. 2.5c). The results also suggest that some melanin binding peptides may be toxic (Fig. 2.5d). 21 Figure 2.4 Melanin binding, cell-penetration model interpretation, and variable con- tributions to HR97 multifunctional peptide predictions. Overall variable contributions to model predictions for (a) melanin binding and (b) cell-penetration. The top important variables analyzed using Shapley additive explanations (SHAP) are shown. Dots represent peptides from cross-validation test sets. The x-axes indicate SHAP values, indicative of variable contributions to model prediction ranging from 0 to 100. The variables were ranked based on the difference between the maximum and minimum SHAP values. The color gradient indicates the variable values normalized by percentile ranks. Higher variable values are indicated by darker magenta color and lower values by darker blue color. The minimum and maximum variable values are 22 Figure 2.4 (previous page) noted on the right of each subplot. c Scatter plot showing the in vitro melanin binding, in vitro cell-penetration, and predicted cytotoxicity values of the 127 candidate peptides. Dots represent peptides. HR97 (black dot) was selected based on the optimal multifunctional combination. d–f Variable contributions to HR97 multifunctional predictions for melanin binding, cell-penetration, and cytotoxicity. The top variables ranked by absolute SHAP values are shown. Magenta bars indicate positive contributions, and blue bars are negative contributions. The y-axis labels convey variable names and their values for HR97. E[f(X)] denotes the expected prediction value, and f(x) is the final prediction, calculated from the sum of all SHAP values plus E[f(X)]. 2.3.6 Characterization and validation of a peptide-drug conjugate in vivo To investigate the effect of peptide conjugation on drug pharmacodynamics, we chose brimonidine tartrate, a topical IOP lowering drug prescribed for glaucoma treatment. The HR97 peptide was conjugated to brimonidine (HR97-brimonidine) via a quaternary- ammonium traceless linker system, and the structure of the intermediates and the purified conjugate were validated by NMR and MALDI-TOF (Figs. A.9–A.12). Conjugation to HR97 provided a ∼10-fold increase in the in vitro melanin binding capacity of brimonidine (5.9×107 Kd (M) vs. 5.0×10−8 Kd (M)), which brought the binding capacity closer to other drugs with high intrinsic melanin binding, such as sunitinib malate (Fig. 2.6a) [28, 41–45]. When incubated in human aqueous fluid, only ∼7% of the brimonidine was released from the HR97-brimonidine conjugate over 28 days in vitro (Fig. 2.6b). However, upon incuba- tion with supraphysiological concentrations of human cathepsin cocktails to enzymatically cleave the linker, ∼52% of the brimonidine was liberated within 48 h (Fig. 2.6c). The ef- fect of the HR97-brimondine conjugate on IOP was then evaluated in normotensive Dutch Belted rabbits. A single topical dose with the commercial brimonidine eye drop (n = 5) was found to provide a peak reduction in IOP from baseline (∆IOP) of −3.0±0.82 mmHg that 23 Figure 2.5 Visualization of the peptide design space based on sequences and physio- chemical properties. a t-distributed stochastic neighbor embedding (t-SNE, used to visualize high-dimensional data) plots showing the peptide design space defined by the combination of one-hot encoded peptide sequences and variables used in melanin binding, cell-penetration, and cytotoxicity model training. Dots represent control peptides from Howell et al. [39] (magenta) and Nosanchuk et al. [40] (blue); peptides used in the pilot (purple) and second (gray and yellow) melanin binding peptide microarrays; and multifunctional peptide candidates (black and yellow) used in the validation experiments. HR97 and TAT are noted. b t-SNE plot of peptides colored by melanin binding prediction. Higher melanin binding values are colored by darker magenta and lower by darker blue. c t-SNE plot of peptides colored by cell-penetration prediction. Magenta dots represent predicted cell-penetrating peptides (CPP), and blue dots are predicted non-cell- penetrating peptides (non-CPP). d t-SNE plot of peptides colored by cytotoxicity prediction. Blue dots denote predicted toxic peptides, and magenta dots indicate non-toxic peptides. 24 recovered to baseline within 8 h (Fig. 2.6d). In contrast, a single ICM injection of the HR97- brimonidine conjugate resulted in a greater peak ∆IOP compared to an ICM injection of brimonidine solution at 2 days (−4.9 ± 0.46 mmHg vs. −2.6 ± 1.65 mmHg, p < 0.05, red arrow). In a separate experiment, ICM injection of saline or HR97 (n = 5 for each) resulted in a similar decrease in IOP that returned to baseline by day 3, and ICM injection of a physical mixture of HR97 and brimonidine tartrate (n = 5) resulted in a similar IOP profile to the brimonidine solution, returning to baseline by day 8 (Fig. A.13). To ensure that the dramatic decrease in IOP with the HR97-brimonidine conjugate was not due to toxicity, a board-certified ophthalmologist evaluated the eyes injected with the HR97-brimonidine conjugate on day 7. It was observed that the lids, lashes, and conjunctiva were normal, the corneas were clear, the corneal endothelium was normal without any pigment deposition, the anterior chambers were normal depth, there was no apparent inflammation or fibrin strands, the lenses were clear, and the iris pigmentation was symmetric. According to the same evaluation methods, no ocular toxicity was observed upon ICM injection of saline, HR97, or a physical mixture of HR97 and brimonidine tartrate for at least 28 days (Tables A.4–A.7). The mean ∆IOP in the HR97-brimonidine conjugate group remained signifi- cantly larger than in the rabbits dosed with brimonidine solution or the physical mixture of HR97 and brimonidine tartrate for up to 14 days (Fig. A.6d, A.13). Further, the time for the mean ∆IOP to return to baseline was 20 days in the HR97-brimonidine conjugate group compared to 8 days in both groups of rabbits dosed with brimonidine solution or the physical mixture of HR97 and brimonidine tartrate. When summing the area under the curve (AUClast) for the cumulative ∆IOP over the 20-day measurement period after ICM injection, the HR97-brimonidine conjugate showed a ∼17-fold greater AUC compared 25 to brimonidine solution (p < 0.001) (Fig. A.6e). A pharmacokinetic study was conducted separately to characterize the intraocular distribution of brimonidine after ICM injection of HR97-brimonidine in Dutch Belted rabbits. The brimonidine concentration remained relatively high in the pigmented iris tissue (980 ng/g) compared to less pigmented parts of the eye, such as the aqueous humor (0.4 ng/g) and the retina (8.3 ng/g) up to 28 days after a single ICM injection (Fig. A.6f). The brimonidine concentration in the aqueous on day 7 (83.3 ng/g) was similar to what we previously reported at 2 h after a single drop of Alphagan P (0.15%) (105 ng/g), which was the time with the largest IOP reduction in that study [46]. On day 14 after ICM injection of HR97-brimonidine, the brimonidine concentration in the aqueous (3.9 ng/g) was similar to what we previously reported at 4 h after a single drop of Alphagan P (0.15%) (4 ng/g) [46]. 2.4 Discussion Chronic eye diseases such as glaucoma require continuous treatment to prevent disease progression. Eye drops are the most common dosage form of glaucoma therapy, though low adherence to intensive drop dosage schedules is a major challenge in disease management [11,47,48]. One study using an electronic monitoring device found that only 64% of patients adhered to the three-times daily dosing schedule for brimonidine eye drops over a 4-week period, even though they were aware of the monitoring [49]. Sustained drug delivery systems may be an attractive alternative for the management of chronic ocular diseases like glaucoma. The first sustained-release polymer-based implant for glaucoma treatment, Durysta®, was recently approved for sustained IOP lowering for several months with a 26 Figure 2.6 Characterization of HR97-brimonidine in vitro and in vivo. a In vitro binding capacity and dissociation constant of HR97-biotin, HR97-brimonidine, and brimonidine characterized using a melanin nanoparticle (mNP) assay (red dots, n = 3–5). Values shown for comparison include those we previously measured for sunitinib and N-desethyl sunitinib [28], and literature values for other ophthalmic drugs [41–45]. b In vitro stability of HR97-brimonidine conjugate in human aqueous humor for 28 days. The percent remaining was normalized to the starting concentration on day 0 (n = 3). Data are shown as mean ± SD. c Cathepsin cleavage assay of the HR97-brimonidine conjugate. HR97-brimonidine (n = 3) were incubated with human cathepsin cocktails or buffer only for 48 h at 37 ◦C (two-tailed t-test). Data are shown as mean ± SD. d Comparison of the intraocular pressure (IOP) change from baseline (∆IOP) after a single ICM injection of HR97-brimonidine conjugate (white dots), brimonidine solution (black dots, 200 µg brimonidine equivalent), and a single drop of Alphagan P (gray dots, 0.15%) in normotensive 27 Figure 2.6 (previous page) Dutch Belted rabbits (n = 5 per group). The IOP was measured every 1–2 days until returning to the baseline. The red arrow highlights the further decrease in IOP provided by the HR97-brimonidine. Two-tailed t-test was used, ∗p < 0.05 (adjusted p values for days 2, 3, 4, 6, and 8 were 0.044, 0.007, 0.038, 0.007, 0.007, respectively). Data are presented as mean ± SEM. e Cumulative ∆IOP of brimonidine (black dots) and HR97-brimonidine (gray squares) after ICM injection. The cumulative ∆IOP was characterized by calculating the area under the curve over the 20-day measurement period (AUClast, n = 5). Two-tailed t-test was used. Data are presented as mean ± SD. f Levels of brimonidine in the iris (black dots), aqueous (gray squares), and retina (white dots, n = 3–4) over time after ICM injection of HR97-brimonidine (200 µg brimonidine equivalent). The concentrations of brimonidine measured in the aqueous after a single drop of Alphagan P (0.15%) as part of a previous study [46] at 2 h (maximal IOP lowering time point; dotted line) and 4 h (dashed line) after dosage are shown. Data are shown as mean ± SD. single ICM injection [17]. However, the polymer matrix typically took longer to biodegrade than the duration of drug release, and repeated injection with additional implants was associated with increased risk of corneal endothelial cell loss and other corneal adverse reactions [50]. In contrast to conventional polymer-based sustained drug delivery systems, the approach we describe here does not require an implant or large amounts of excipients that will remain in the eye for extended periods. By utilizing short peptide sequences that impart melanin binding to the drug conjugate, a sustained intraocular drug release system was created without the need for a polymer matrix. Ocular melanin is a biopolymer that resides within melanosomes in pigmented ocular tissues, including the iris, ciliary body, choroid, and retinal pigment epithelium (RPE) [51]. Although the amount of pheomelanin in the eye varies depending on eye color, the amount of eumelanin in ocular tissues, including the RPE, iris pigment epithelium, and pigmented ciliary epithelium is more consistent across the population [52]. It has been described that drug binding to melanin and accumulation inside cells may diminish therapeutic effect by sequestering the drug or causing ocular toxicity [26,27]. In the case of atropine, the intrinsic 28 melanin binding properties were shown to lead to prolonged residence time in pigmented rabbit ocular tissues [53], and a sustained miotic response in pigmented rabbits [29]. In ad- dition, we previously demonstrated that improving the intraocular absorption of sunitinib, a drug with relatively high melanin binding capacity, with a novel gel-forming hypotonic eye drop led to prolonged therapeutic effect of up to 1 week after dosing [28]. Indeed, a recent study used machine learning methods to characterize the structural features of small molecule drugs that impact intrinsic melanin binding, leading to the development of a model that predicted intrinsic melanin binding with 91% accuracy [30]. These findings motivated us to develop engineered adaptors designed to impart tunable melanin binding properties to small molecule drugs used to treat ocular diseases. Further, as melanin is contained within cells, the engineered adaptor should additionally provide cell penetra- tion. Here, we developed a machine learning-based methodology to engineer tri-functional peptides that displayed melanin binding, cell-penetration, and non-toxic properties. The peptide sequence that provided the optimal combination of high melanin binding, high cell-penetration, and low cytotoxicity, HR97, was then conjugated to brimonidine as a proof-of-principle. The HR97-brimonidine conjugate provided up to 18 days of IOP lower- ing with a single ICM injection in normotensive rabbits, which contrasts with the 8 h-effect provided by a brimonidine eye drop. Peptides are short sequences of amino acids that can have many combinations with diverse biological functions. Compared to other aptamers and small molecule drug libraries, peptides are relatively cost effective to synthesize and are relatively easy to modify or conjugate to small molecule drugs [54]. Currently, there are more than 80 FDA-approved peptide drugs and more than 600 in clinical and pre-clinal trials [55–57]. Peptides optimized 29 for a single function, either exhibiting cell-penetration or cell targeting properties, have been widely exploited as drug carriers to shuttle drugs across biological barriers [58–60]. Peptides such as TAT, penetratin, PEP-1 and polyarginine (R6 or R8) and have been conjugated with various cargos for targeting the anterior and posterior segment [61–68]. For example, various fluorescein conjugated peptides were screened for the ability to cross porcine cornea ex vivo [68, 69]. Penetratin (PNT) showed an eightfold increase in permeability compared to PEP-1, though most of the peptide was found to be sequestered within cells rather than having crossed the cornea [68, 69]. In another study, TAT peptide was conjugated to human acidic fibroblast growth factor (aFGF) and applied topically to rat eyes [70]. They found that the conjugates reached the retina with a tmax of 30–60 min and with possible mechanism of conjunctival-scleral penetration route [70]. However, it is known that drugs can more easily reach the posterior segment with topical administration in rat and mouse eyes compared to larger eyes, such as rabbits [71–73]. Many peptide screening technologies have been developed for identifying novel func- tional peptides, including phage display, mRNA display, and peptide microarray [74–76]. Phage display and mRNA display are capable of screening a larger number of peptides (∼1011–1013) compared to peptide microarray (∼105). However, in phage display and mRNA display, the peptide sequences are randomly generated with fixed ratios of amino acids [75]. In contrast, coupling computationally generated peptide sequences with pep- tide microarrays has the advantage of rapidly improving peptide design through machine learning model refinement. Peptides can be computationally represented by physicochem- ical and structural descriptors [77] or encoded using various rules such as binary encoding and evolution-based encoding [78]. Since peptide sequence is the source of functionality, 30 a machine learning-based approach can be employed to develop predictors that learn the relationships between peptide variables derived from the sequence and the desired func- tional property [79–81]. Peptide databases have also been made available for data-driven functional peptide design, including cell-penetration and toxicity [32, 33]. However, there is only a limited number of studies for, and no database of, melanin binding peptides. An example here is that in the two studies that reported peptide sequences that were char- acterized as melanin binding, phage display was used to identify 8 peptides that bind to melanin in human melanoma cells [39] and 8 peptides that bound to melanized C. neofor- mans [40]. However, in our peptide microarray, 8 of these peptides did not demonstrate detectable melanin binding, and overall, we identified 780 peptides displaying higher levels of melanin binding than any of these peptides described in the literature. Furthermore, the second peptide microarray designed using the initial machine learning model provided more potent melanin binding peptides compared to the first peptide microarray, demonstrating the rapid improvement in design by machine learning model refinement. Multifunctional peptides with dual or triple pharmacological properties have also been integrated into drug delivery systems through conjugation to drugs or drug-loaded cargos [34,82,83]. However, it is challenging to design peptides with multiple functions con- tained in a single sequence. Often single function peptides are fused directly or by a linker peptide [83–85], which may increase the peptide length and reduce the desired functional properties of each component. Another approach is to optimize additional functional prop- erties by substituting amino acids on a template peptide with a known function [35, 36], which may require extensive laboratory screening and is time-consuming. Generating mul- tifunctional peptides with the flexibility to choose the desired functional levels is a less 31 explored research area [86, 87]. Here, our machine learning and model interpretation ap- proach guided the engineering of multifunctional peptides. The peptide properties were analyzed using the shared variable set, revealing mutually important variables contribut- ing to both melanin binding and cell-penetration, where peptides with moderate to high net charge and containing more basic amino acids tend to possess both melanin binding and cell-penetrating properties. Further, we unexpectedly observed correlation between melanin binding and cell-penetrating in cell uptake in vitro. Thus, the highest intracellular accumulation was achieved by increasing the amount of peptide that can access intracellu- lar melanosomes, where the peptides can then bind to melanin and provide sustained drug release. Many machine learning models including random forest, support vector machines, and deep learning have been developed to predict how amino acid sequence governs peptide properties [88]. Super learning is an ensemble machine learning method that takes advan- tage of various machine learning models. The predictive performance of a super learner ensemble is assured to be at least as accurate as the best-performing base model [89, 90]. The same model types with varying hyperparameter combinations can be included in a SL ensemble. Recently, it was described that base model hyperparameter tuning could improve overall SL model performance [91]. Based on this finding, we further developed a procedure to systematically select optimal base model composition by iteratively filtering out models that have less contributions to the SL ensemble. Indeed, we obtained better SL model performance compared to the one including all base models. In this study, we explored a wide array of possible machine learning models and identified multiple com- petitive models through statistical analyses. SL provided a framework to integrate these 32 explored models. Although the meta-learner may add a layer of complexity, it demon- strated an interpretable summary of the model importance in terms of their contributions to the final predictions. In addition, the complexity of the machine learning architecture was reduced by variable reduction of the data sets and base model filtration of SL. Further, interpretable machine learning that extracts relevant information such as variable contribu- tions to output predictions from the data relationships learned by the model is important for explaining model predictions [92, 93]. Many of the functional peptide predictors and other drug discovery tools do not have information on how and why top candidates were identified [94–96]. In this study, we showed that interpretation of machine learning models can provide insights to improve the design of multifunctional peptides. The SHAP analysis not only indicated important variables contributing most to the model prediction, but also showed the relationships between variable values and prediction outputs. The studies described here are not without limitations. First, while the in vitro ARPE19 cell assay helped validate the cell-penetrating and melanin binding performance, the methodology used here did not differentiate between peptides that were free or bound to melanin or other structures within the cell. Indeed, there was a baseline level of peptide associated with non-pigmented cells, but a substantial increase in cellular localization was observed when the cells were induced to produce melanin. Second, the traceless linker conjugation yield of the HR97-brimonidine was low and requires further optimization. The cathepsin-labile linker was chosen because cathepsins are largely located intracellularly and are present in minimal amounts in extracellular fluids such as aqueous humor [97–100]. Thus, the intracamerally delivered HR97-brimonidine would be stable until it had localized within melanin-containing cells. However, the level of brimonidine measured in rabbit 33 iris tissue remained high, suggesting that further optimization of the linker cleavage and brimonidine release rate may also extend the duration of the therapeutic effect. Finally, the duration of IOP lowering reported here (20 days) was sufficient to demonstrate the proof-of-principle in normotensive rabbits but would not be clinically translatable. Future work with more potent drugs may increase the duration of action. The approach we described here to apply ensemble machine learning to peptide mi- croarray enabled the efficient design of multi-functional peptides, which in this application enhanced the intraocular pharmacokinetics and pharmacodynamics of the ophthalmic drug brimonidine. Engineered HR97 peptide demonstrated increased cell-penetrating properties compared to known cell-penetrating peptides, such as TAT, and simultaneously possessed high melanin binding capacity and low cytotoxicity. In the current context, utilizing short peptide sequences that impart melanin binding to a drug conjugate may provide an av- enue for creating safe and effective implant-free sustained intraocular drug release systems. More broadly, the approach described here can be applied to generate multifunctional peptide-drug conjugates for a variety of biomedical applications. 2.5 Methods 2.5.1 Material sources Brimonidine was purchased from TCI America. Eumelanin from Sepia officinalis, 0.22 µm Millex-GV PVDF filter, ferric ammonium citrate, bovine serum albumin (BSA), Tween 20, fetal bovine serum (FBS), trifluoroacetic acid (TFA), tert-Butyl methyl ether (MTBE), thionyl chloride, Tetrabutylammonium iodide, N,N-diisopropylethylamine, hu- 34 man cathepsins B, K, L and S, Whatman® Anotop® 0.02 µm syringe filter and Triton X-100 were purchased from Sigma Aldrich (St. Louis, MO, USA). ARPE-19 (ATCC CRL- 2302, lot No. 70013110), and DMEM:F12 medium were purchased from the American Type Culture Collection (Manassas, VA, USA). EZ-LinkTM Amine-PEG2-Biotin, BupH MES buffer saline pack (2-(N-morpholino)ethanesulfonic acid buffer), EDC (1-ethyl-3-(3- dimethylaminopropyl)carbodiimide hydrochloride), NHS (N-hydroxysuccinimide), PierceTM Fluorescence Biotin Quantitation Kit, rapid equilibrium dialysis (RED) 8 K device, PrestoBlueTM HS Cell Viability Reagent, DMEM with high glucose and pyruvate, Trypsin- EDTA (0.25%) with phenol, RIPA lysis buffer, Streptavidin DyLight 680, and penicillin/ streptomycin were purchased from Thermo Fisher Scientific (Waltham, MA, USA). Dispos- able PD-10 desalting columns were purchased from VWR. Dulbecco’s Phosphate Buffered Saline (DPBS), 1 × phosphate buffered saline (PBS), 10 × PBS, high-performance liquid chromatography (HPLC) grade acetonitrile, dimethylformamide (DMF), and water were purchased from Fisher Scientific (Hampton, NH, USA). Mc-Val-Cit-PAB was purchased from Cayman Chemical (Ann Arbor, MI, USA). Endotoxin-Free Ultra-pure Water were purchased from MilliporeSigma (Burlington, MA, USA). A Hamilton 1700 Series gas tight syringes (25 µL, Model 1702 RN, 27 gauge) was purchased from Hamilton Company (Reno, NV, USA). BD 1 mL TB syringe with 28 G needles were purchased from BD (San Jose, CA, USA). Isoflurane was purchased from Baxter (Deerfield, IL, USA). Reverse-action for- ceps were purchased from World Precision Instruments (Sarasota, FL, USA). Neomycin, polymyxin b, and bacitracin zinc ophthalmic ointment was purchased from Akorn (Lake Forest, IL, USA). 35 2.5.2 Melanin nanoparticle synthesis and characterization Melanin nanoparticles (mNPs) were synthesized from the eumelanin of Sepia offici- nalis. In brief, 10 mg/mL of eumelanin was suspended in the DPBS using an ultrasonic probe sonicator (Sonics, Vibra Cell VCX-750 with model CV334 probe, Newtown, CT, USA) by pulsing 1 s on/off at 40% amplitude for 30 min in a 4 ◦C water bath. The suspen- sion was then filtered through a 0.22 µm Millex-GV PVDF filter and transferred to PD-10 desalting columns. The resulting mNPs solution was lyophilized for 7 days and stored at −20 ◦C until further use. For mNP biotinylation (b-mNPs), mNPs were suspended in 2 mL MES buffer with 2.4 mg of EDC and 3.6 mg of NHS for 15 min at room temperature to first activate the carboxylic acid groups. To increase the buffer pH above pH 7.4 for amine reaction, 400 µL of 10 × PBS was directly added to the mixture and incubated for 5 min. Various amounts of EZ-LinkTM Amine-PEG2-Biotin (5, 15, 20, 30 mg) were reacted with activated mNPs for 2 or 6 h at room temperature. Since all conditions led to a similar degree of mNP biotinylation, reaction conditions using 5 mg of amine-PEG2-biotin with 2 h incubation at room temperature was used moving forward. The reaction mixture was then transferred to PD-10 desalting columns to further collect the b-mNPs. To transfer the b-mNPs to different solvents (water, pH 6.5 PBS, pH 7.4 PBS) for optimization of the pep- tide microarray, PD10 columns were first equilibrated with buffer, and then the b-mNPs were added. Particle size and ζ-potential were determined by dynamic light scattering and laser Doppler anemometry, respectively, using a Zetasizer Nano ZS90 (Malvern Instru- ments). Size measurements were performed at 25 ◦C at a scattering angle of 173◦. Samples were diluted in 10 mM NaCl solution (pH 7), and measurements were performed according 36 to instrument instructions. PierceTM Fluorescence Biotin Quantitation Kits were used to quantify the biotin content on the b-mNPs. B-mNPs (1 mg/mL) were diluted 1:50, 1:100, 1:200 with 1 × PBS and the standard biocytin concentration (10–60 pmol/10 µL) were freshly prepared for measuring the biotin concentration. Transmission electron microscopy (H7600; Hitachi High Technologies America) was conducted to determine the morphology of mNPs and b-mNPs. 2.5.3 Optimization of processing conditions for peptide microarray A total of 119 peptides, including 8 peptides of length 7 amino acids (aa) and 8 pep- tides of length 10 aa from the literature [39,40], and 103 random 15 aa peptides generated with a frequency of 5% for each of the 20 amino acids, were printed in duplicate on peptide microarrays by PEPperPRINT. The peptide microarrays contained hemagglutinin (HA) peptides (YPYDVPDYAG; 9 spots) as internal quality controls. Varying screening condi- tions of the peptide microarray were performed. A spectrum scan of melanin nanoparticles (mNPs) and biotinylated mNPs confirmed that the autofluorescence was near background levels after Em = 650 nm. Streptavidin DyLight 680, which was the highest wavelength (Ex = 675 nm, Em = 705 nm) that PEPperPRINT could use in their peptide microarray system, was selected to minimize detection of melanin. Two peptide microarray copies were first pre-stained with streptavidin DyLight680 (0.2 µg/ml) and the control antibody (manufacturer: BioxCell & PEPperPrint, catalogue numbers: #RT0268, PEPperCHIP® Mouse Monoclonal anti-HA (12CA5)-DyLight800 Control; 1:2000 dilution or 0.5 µg/ml) in incubation buffer (pH 6.5 PBS with 0.005% Tween 20 and 10% Rockland blocking buffer 37 MB-070) for 45min at room temperature to examine background interactions and internal quality control. No background interaction of streptavidin DyLight680 or the control anti- body with the 119 different peptides were observed. To screen the optimal melanin binding condition, six different washing buffers were prepared: PBS at pH 6.5 with or without 0.005% Tween 20, PBS at pH 7.4 with or without 0.005% Tween 20, and Ultra-pure water with or without 0.005% Tween 20. The Rockland blocking buffer MB-070 was used to incubate all peptide microarrays for 30 min before the melanin binding assay. Six different incubation buffers were formulated with 10% of blocking buffer in the six different wash- ing buffers mentioned earlier. b-mNPs (10, 100, or 500 µg/ml) in six different incubation buffers were incubated with the peptide microarray for 16 h at 4 ◦C or room tempera- ture. All microarrays were subsequently washed with the same type of washing buffers and incubated with 0.2 µg/mL of streptavidin DyLight680 for 45 min in the same type of incubation buffer at room temperature for detecting the b-mNPs. The peptide microarrays were then washed for 3 × 10 s with the same type of washing buffers and proceeded to quantification of spot intensity. The pilot tests suggested that 500 µg/ mL of biotinylated mNPs in pH 6.5 PBS buffer at room temperature was optimal (optimal condition shown in Fig. 2.1d, remaining conditions shown in Fig. A.2. With the optimal flow conditions, 10 of the 16 peptides reported in the literature had detectable fluorescence intensities due to binding by b-mNPs. Quantification of spot intensities and peptide annotation were based on the 16-bit gray scale Tag Image File Format files that exhibit a higher dynamic range than the 24-bit colorized Tag Image File Format files. Microarray image analysis was done with PepSlide® Analyzer, version 1.4. The software algorithm decomposed fluorescence intensities of each spot into raw, foreground and background signal, and calculated mean 38 median foreground intensities and spot-to-spot deviations of spot duplicates. Based on mean median foreground intensities, intensity maps were generated and interactions in the peptide maps highlighted by an intensity color code with red for high and white for low spot intensities. The PEPperPRINT protocol tolerated a maximum spot-to-spot devia- tion of 40%, otherwise the corresponding intensity value was zeroed. We labeled the top 20% of peptides ranked by intensities as melanin binding (23 peptides), which included 10 literature-reported peptides with non-zero fluorescent signal. The remaining peptides were labeled as non-melanin binding (96 peptides). 2.5.4 Random forest classification model training with the pilot 119-peptide microarray Random forest is an ensemble tree-based statistical machine learning model and is robust to variable noise and insensitive to variable scales [38]. Physiochemical variables and numerical representations of peptides were computed using the R packages Peptides, version 2.4.4 [101] and protr, version 1.6–2 [102]. The resulting 1094 variables include composition, transition, distribution, autocorrelation, conjoint triad, quasi-sequence-order descriptors, and pseudo-amino acid and amphiphilic pseudo-amino acid composition descriptors. The maximum value of lag was set to 6, so the minimum length of a peptide to be analyzed without generating a missing value is 7. A random forest classification model with 100,000 trees and balanced sampling was trained on the melanin binding data set. The model was built using the R package randomForest, version 4.7–1.1 [103]. For each tree in the random forest, a bootstrap sample of ∼63.2% of the melanin binding peptides and the 39 same amount of non-melanin binding peptides was generated to construct the tree. The remaining peptides were considered out-of-bag to the tree and were used to evaluate the performance of the random forest by calculating the aggregated out-of-bag predictions across all trees. The out-of-bag class errors were calculated and a classification threshold of 0.5 proportion of votes was used. As part of the same analysis, permutation variable importance was obtained with the importance function in the randomForest package. For each tree in the random forest, out-of-bag instances were permuted for each variable in the subset, and the decrease in accuracy was recorded. The mean decrease in accuracy for each variable was calculated over all 100,000 trees and normalized by dividing the mean by the standard error. 2.5.5 Expansion of the peptide microarray Melanin binding candidate peptides were generated randomly with a frequency of 5% for each of the 20 amino acids. Peptides classified as melanin binding by the trained ran- dom forest model were selected, resulting in 5483 peptides of length ranging from 7 to 12 aa. Along with the 16 known melanin binding peptides from the literature, a total of 5499 peptides were printed in duplicate along with HA controls (YPYDVPDYAG; 68 spots) on peptide microarrays by PEPperPRINT. Peptide sequences were printed in duplicate of a custom peptide microarray. Pre-staining of a peptide microarray copy was done with streptavidin DyLight680 (0.2 µg/ml) and the control antibody (mouse monoclonal anti-HA (12CA5) DyLight800; 0.5 µg/ml) in incubation buffer to characterize non-specific binding. Subsequent incubation of another peptide microarray with the b-mNPs at a concentration 40 of 500 µg/ml in incubation buffer (PBS at pH 6.5 with 0.005% Tween 20 with 10% Rock- land blocking buffer MB-070) was followed by staining with streptavidin DyLight680 (0.5 µg/mL) and the control antibody (0.5 µg/mL). The control staining of the HA epitopes was done simultaneously as internal quality control to confirm the assay quality and the peptide microarray integrity. Quantification of spot intensities were described earlier in the previous section. 2.5.6 Variable reduction of the machine learning input data To reduce the number of variables and improve the model performance, a variable reduction procedure was applied to the machine learning input data before model training. Permutation-based variable importance was first computed on the data set with random forest (100,000 trees) using the R package ranger, version 0.14.1 [104], with balanced sam