R E S E A R CH AR T I C L E Evaluation of AlphaFold antibody–antigen modeling with implications for improving predictive accuracy Rui Yin1,2 | Brian G. Pierce1,2 1University of Maryland Institute for Bioscience and Biotechnology Research, Rockville, Maryland, USA 2Department of Cell Biology and Molecular Genetics, University of Maryland, College Park, Maryland, USA Correspondence Brian G. Pierce, University of Maryland Institute for Bioscience and Biotechnology Research, Rockville, MD 20850, USA. Email: pierce@umd.edu Funding information National Institutes of Health, Grant/Award Number: R35 GM144083 Review Editor: Nir Ben-Tal Abstract High resolution antibody–antigen structures provide critical insights into immune recognition and can inform therapeutic design. The challenges of experimental structural determination and the diversity of the immune reper- toire underscore the necessity of accurate computational tools for modeling antibody–antigen complexes. Initial benchmarking showed that despite overall success in modeling protein–protein complexes, AlphaFold and AlphaFold- Multimer have limited success in modeling antibody–antigen interactions. In this study, we performed a thorough analysis of AlphaFold's antibody–antigen modeling performance on 427 nonredundant antibody–antigen complex struc- tures, identifying useful confidence metrics for predicting model quality, and features of complexes associated with improved modeling success. Notably, we found that the latest version of AlphaFold improves near-native modeling suc- cess to over 30%, versus approximately 20% for a previous version, while increased AlphaFold sampling gives approximately 50% success. With this improved success, AlphaFold can generate accurate antibody–antigen models in many cases, while additional training or other optimization may further improve performance. KEYWORD S antibody, antigen, AlphaFold, deep learning 1 | INTRODUCTION Antibodies are a key component of the immune system, defending the host from viruses and other pathogens through specific recognition of protein and non-protein antigens. Typically, antibodies engage their antigenic tar- gets using the hypervariable complementarity determining region (CDR) loops within the variable domain (Chothia & Lesk, 1987), which are stabilized by the β-sandwich structure of the framework region (Sela- Culang et al., 2013). Despite sharing a conserved immunoglobulin structure, antibodies collectively exhibit a remarkable ability to recognize and bind to a wide array of antigens with high specificity. The highly specific and diverse nature of antibody–antigen interactions makes antibodies highly useful as therapeutics as well as a con- sideration in vaccine development efforts (Carter, 2006; Nelson et al., 2010; Rappuoli et al., 2016; Scott et al., 2012). High resolution structures of antibody–antigen com- plexes have refined our knowledge of immunity (Li et al., 2003), revealed molecular basis of antibody recognition of viral epitopes (Barnes et al., 2020; Dreyfus et al., 2012; Received: 28 July 2023 Revised: 1 December 2023 Accepted: 7 December 2023 DOI: 10.1002/pro.4865 This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2023 The Authors. Protein Science published by Wiley Periodicals LLC on behalf of The Protein Society. Protein Science. 2024;33:e4865. wileyonlinelibrary.com/journal/pro 1 of 16 https://doi.org/10.1002/pro.4865 https://orcid.org/0000-0001-5330-8306 https://orcid.org/0000-0003-4821-0368 mailto:pierce@umd.edu http://creativecommons.org/licenses/by/4.0/ http://wileyonlinelibrary.com/journal/pro https://doi.org/10.1002/pro.4865 http://crossmark.crossref.org/dialog/?doi=10.1002%2Fpro.4865&domain=pdf&date_stamp=2023-12-27 Zhou et al., 2015), and guided the effective design of anti- bodies (Haidar et al., 2012; Hanf et al., 2014) and immu- nogens (Graham et al., 2019). However, due to the challenges of experimental structure determination, resource and time constraints, as well as the highly diverse nature of the immune repertoire (Georgiou et al., 2014; Li et al., 2004), experimental characteriza- tion of most antibody–antigen complex structures is impractical. Therefore, computational tools have been developed and applied to bridge this gap. General protein–protein docking methods have been applied to model antibody–antigen complex structures with lim- ited success (Vreven et al., 2015), due in part to the need to account for the mobility of key CDR loops, as well as the size of certain antigens. To address this, algorithms have been developed specifically for antibody–antigen complex modeling (Ambrosetti et al., 2020; Brenke et al., 2012; Krawczyk et al., 2013; Sircar & Gray, 2010). However, accurate structural prediction of antibody– antigen complexes remains a challenge (Guest et al., 2021; Vreven et al., 2015). Recently, the scientific community saw a major breakthrough with AlphaFold (v.2.0), which uses an end- to-end deep neural network to predict protein structures from sequence (Jumper et al., 2021a). AlphaFold itera- tively infers and refines pairwise residue–residue evolu- tionary and geometric information from multiple sequence alignments (MSAs) and has achieved unprece- dented success in protein structure prediction (Jumper et al., 2021a, 2021b). Its capabilities were expanded by the development of AlphaFold-Multimer (Evans et al., 2021) (released in AlphaFold v.2.1), an updated implementation of AlphaFold that was designed to pre- dict protein–protein complex structures. The overall architecture of AlphaFold-Multimer is similar to the pre- vious version of AlphaFold, with changes including cross-chain MSA pairing, adjusted loss functions, and training on protein–protein interface residues. Previously, our benchmarking revealed that, while generally successful in protein–protein complex structure prediction, AlphaFold was less successful in modeling antibody–antigen complexes, and adaptive immune rec- ognition in general (Yin et al., 2022). This lack of success in antibody–antigen structure prediction was also noted by the developers of AlphaFold-Multimer (Evans et al., 2021). However, some highly accurate antibody– antigen complex models were generated by AlphaFold (Yin et al., 2022), which shows potential for success of the “fold-and-dock” approach for antibody–antigen struc- ture prediction. While recent studies have assessed the predictive performance of AlphaFold for modeling unbound antibodies (Abanades et al., 2023; Ruffolo et al., 2023), or optimization of AlphaFold's ability to predict protein complexes in general (Bryant et al., 2022; Wallner, 2023a), studies have not focused on AlphaFold's performance in antibody–antigen recognition, particu- larly in light of updated versions of AlphaFold and its multimer model (v2.2, v2.3) (DeepMind, 2022), since our initial test of AlphaFold v2.1 on a set of 100 antibody– antigen complexes (Yin et al., 2022). Thus, there is a need for an updated and expanded benchmarking and analysis of AlphaFold performance on this challenging and important class of complexes. In this study, we report a comprehensive benchmark- ing of AlphaFold for antibody–antigen complex structure modeling. With a dataset of over 400 high resolution and non-redundant antibody–antigen complexes, represent- ing a major increase over the 100 complexes that were used previously (Yin et al., 2022), we investigated factors contributing to modeling successes and failures, includ- ing antibody class and subunit accuracy. The default AlphaFold model confidence score was found to be well correlated with antibody–antigen model accuracy, while residue-level confidence for interface residues was like- wise correlated with model accuracy. Interestingly, we found that recent optimization of AlphaFold led to nota- bly higher antibody–antigen accuracy, while use of a “massive sampling” strategy with large sets of pooled AlphaFold models for each complex (Wallner, 2023a) led to even better performance. Our study presents a thorough analysis of AlphaFold's ability to predict antibody–antigen complexes, yielding valuable insights for interpreting model accuracy, identifying obstacles in the modeling process, and highlighting potential areas for improvement. 2 | RESULTS 2.1 | AlphaFold antibody–antigen complex modeling accuracy To perform a comprehensive and detailed assessment of AlphaFold's ability to model antibody–antigen com- plexes, we assembled a set of over 400 nonredundant antibody–antigen complexes released after April 30, 2018 (Table S1). The date cutoff was selected to avoid overlap with the training set of the tested version of AlphaFold (v2.2.0, hereafter denoted as v2.2 for brevity). Nonredun- dancy and additional test case selection criteria are described in Section 4. For efficiency, we only utilized the variable domains of the antibody sequences for modeling. As all AlphaFold modeling of multimers in this study was performed with the multimer model of AlphaFold, we use the term AlphaFold (vs. AlphaFold- Multimer) to denote that protocol in this study, for brevity. 2 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense The accuracy of antibody–antigen complex predictions was evaluated using Critical Assessment of Predicted Interac- tions (CAPRI) criteria (Lensink et al., 2020), which classify predictions as incorrect, acceptable, medium, or high based on a combination of interface root-mean-square distance (I-RMSD), ligand root-mean-square distance (L-RMSD), and fraction of native interface residue contacts (fnat), in comparison with the experimentally determined antibody– antigen complex structure. AlphaFold generated acceptable or higher accuracy models as top-ranked predictions for 26% of the 427 test cases for which models were generated (Figure 1a). Medium or higher accuracy models, which we refer to as near-native predictions, were generated as top-ranked predictions for 18% of the cases, and high accuracy models were generated for 5% of the test cases. Success rates increased when all 25 predictions per complex were taken into consideration, leading to 37% of the cases achieving acceptable or higher accuracy predictions, 22% achieving medium or higher accuracy predictions, and 6% achieving high accuracy predictions. Representative models generated by AlphaFold are shown in Figure 1b (PDB code 6nmv; antibody/SIRP- alpha complex) and Figure 1c (PDB code 6j15; antibody/ PD-1 complex). Both models are top-ranked predictions for the respective complex. The model in Figure 1b has high CAPRI accuracy, and an interface root-mean squared distance (I-RMSD) value of 0.68 Å, indicating a low level of structural deviation of this modeled antibody–antigen complex from the native complex. Figure 1c shows an acceptable CAPRI accuracy predic- tion with an I-RMSD of 3.55 Å. While the antibody engages the correct site of the antigen in this example, a deviation in positioning of the antibody on the antigen, with respect to the experimentally determined structure, is observed. We compared the AlphaFold benchmarking results with other pipelines and approaches, including Alpha- Fold in ColabFold (Mirdita et al., 2022). For fairness of the comparison with the full AlphaFold pipeline's results, ColabFold was modified to generate 25 predictions per complex. ColabFold's modeling success was similar to that of AlphaFold for 426 cases for which both algorithms were able to generate models, with slightly lower success observed for ColabFold (Figure S1). The difference in success may be due to factors such as different MSAs or structural templates, as ColabFold and AlphaFold employ distinct approaches for building and pairing MSAs, and utilize different sequence and template data- bases. In a comparison with previously developed dock- ing approaches, we observed that AlphaFold exhibits higher success in antibody–antigen modeling than rigid-body docking algorithms ZDOCK (Pierce et al., 2011) and ClusPro (antibody mode) (Brenke et al., 2012) with modeled unbound structures as input (Supplementary Results, Figures S2 and S3). FIGURE 1 Antibody–antigen modeling accuracy of AlphaFold. (a) Benchmarking of AlphaFold (v.2.2, multimer model) was performed on 427 antibody–antigen complexes. For each complex, 25 predictions were generated and ranked by AlphaFold model confidence score. Antibody–antigen predictions were evaluated for complex modeling accuracy using CAPRI criteria for high, medium, and acceptable accuracy. The success rate was calculated based on the percentage of cases that had at least one model among their top N ranked predictions that met a specified level of CAPRI accuracy. Bars are colored by CAPRI accuracy level. (b) Example of a near-native prediction by AlphaFold, in comparison with the experimentally determined structure (PDB: 6nmv; antibody/SIRP-alpha complex). This model has high CAPRI accuracy (I-RMSD = 0.68 Å) and has the highest model confidence of all 25 predictions of this complex (model confidence = 0.88). (c) An example of an acceptable accuracy complex model from AlphaFold, in comparison with the experimentally determined structure (PDB: 6j15; antibody/PD-1 complex). This model has acceptable CAPRI accuracy (I-RMSD = 3.35 Å), and has the highest model confidence of all 25 predictions of this complex (model confidence = 0.75). Complex structures in (b,c) are superposed by antigen with the model and the x-ray structure components colored separately as indicated on right. YIN and PIERCE 3 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense http://bioinformatics.org/firstglance/fgij//fg.htm?mol=6nmv http://bioinformatics.org/firstglance/fgij//fg.htm?mol=6j15 http://bioinformatics.org/firstglance/fgij//fg.htm?mol=6nmv http://bioinformatics.org/firstglance/fgij//fg.htm?mol=6j15 We observed higher success in modeling antibody– antigen complexes by AlphaFold in this study versus our previous benchmarking study, in which fewer than 10% of cases had top-ranked predictions with near-native accuracy (Yin et al., 2022) (vs. 18% here, as noted above). This difference is likely due to the newer version of AlphaFold used in this study (v2.2 vs. v2.1), which uses a retrained multimer model, as well as different sets of test cases, with the current study representing a substantial expansion over the cases used previously. 2.2 | Antibody–antigen modeling accuracy determinants To identify possible factors associated with modeling out- come, we analyzed properties of the native antibody– antigen complexes in relation to predictive modeling suc- cess. As glycans are not modeled by AlphaFold and antigen glycosylation can be an important component in antibody– antigen recognition (in some cases with glycans contacted directly by antibodies) (Kappler & Hennet, 2020), the subset of complexes with antibody–antigen interface glycans in our set was identified (N = 45) to assess the impact of anti- gen glycosylation on modeling outcome. We identified sev- eral additional cases (N = 4) containing non-protein ligand molecules (lipids, nucleotides) at the antibody–antigen interface that were likewise included in the set. Our analy- sis showed that the presence of non-protein ligands and gly- cans at the native antibody–antigen interface is associated with lower modeling success (Figure 2a). Among those 49 cases, the top-ranked predictions of medium accuracy were produced in only 8% of the cases, and no high accu- racy top-ranked predictions were produced. In contrast, for cases not belonging to this category, 19% had top-ranked predictions of medium or higher accuracy. Thus, the lack of explicit consideration of interface glycans and ligands may reduce modeling accuracy for some antibody–antigen com- plexes. Nonetheless, AlphaFold was able to accurately model a single-domain antibody–antigen complex for which the native structure contains with a glycosphingoli- pid antigen α-galactosylceramide (α-GalCer) in the binding interface, as shown in Figure 2b (PDB code 6v7y; single- domain antibody/CD1d α-GalCer complex). The model, a top-ranked prediction for the complex, has medium CAPRI accuracy, and an I-RMSD value of 1.02 Å, indicating that AlphaFold accurately captured the antibody–antigen dock- ing conformation despite the absence of an explicit repre- sentation of the glycosphingolipid antigen at the binding interface. Antigen glycosylation can be an important compo- nent in antibody–antigen recognition, with many cases of glycans contacted directly by antibodies (Kappler & Hennet, 2020). The importance of glycans, as well as the prevalence of antigen N-glycosylation in our dataset (45 out of 49 glycan/ligand interface complexes, as noted above), prompted us to examine antigen glycosylation in the set further. As some x-ray or cryo-EM structures used for analysis may lack resolved glycan atoms, or naturally occurring glycans can be removed enzymatically or via mutation to enable structural characterization, it is possi- ble that some members of the non-glycan/ligand set (N = 378) may actually have interface-proximal glycans in vivo. Based on the analysis of antigen source organism and proximity of surface-exposed N-glycosylation motifs to the interacting antibody, a subset of N = 91 cases were identified to have possible antigen N-glycosylation near antibody-binding site. The predicted antibody-proximal antigen glycosylation subset showed a moderately lower modeling success, with medium or higher accuracy top- ranked predictions generated in 16% of cases, compared to 20% medium/high success for the cases without likely or structurally resolved interface N-glycosylation (N = 287) (Figure S4). It should be noted that factors such as varying levels of N-glycan site occupancy and cel- lular localization (e.g. intracellular vs. extracellular pro- teins) were not considered in the computational identification of potential N-glycan sites, and experimen- tal methods such as mass spectroscopy of native proteins from organism-specific cells would be needed for more conclusive identification of antigen-linked glycans. We also investigated whether antibody–antigen com- plexes containing single-chain antibodies (or nanobodies) are more successfully modeled compared to the heavy– light chain only counterparts (Figure 2c). For nanobody– antigen complexes (N = 132), 27% of cases had medium or higher accuracy top-ranked predictions, versus 14% of cases with medium or higher accuracy top-ranked predic- tions for heavy-light chain antibody–antigen complexes (N = 295). To understand the pronounced difference in modeling nanobody–antigen complexes versus antibody– antigen complexes, we investigated the difference in MSA depth of the two types of complexes. We hypothe- sized that the single-chain variable domains in nanobo- dies may simplify construction of cross-chain MSAs for nanobody–antigen complexes, as opposed to the more complex heavy–light chain antibodies. However, after analyzing the MSA depth, we found no statistically signif- icant difference in the number of effective sequences (Neff, a measure of the effective sequence count in an MSA [Jumper et al., 2021a]) between the two types of complexes (Figure S5). This suggests that other factors, such as fewer CDR loops and a smaller search space, may contribute to the observed difference in modeling suc- cess. Unlike heavy–light chain antibodies, which possess six CDR loops, the variable domain of nanobodies 4 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense http://bioinformatics.org/firstglance/fgij//fg.htm?mol=6v7y contains three loops only, thus it is possible that the lower complexity and size of the receptor component of the complex may play a role in the observed improved modeling performance for AlphaFold. To investigate whether more favorable antibody– antigen interfaces are more successfully predicted by AlphaFold, we compared antibody–antigen interface energy, computed from the bound complex structure using Rosetta (Leman et al., 2020), with modeling success considering all 25 predictions of each case (Figure 2d). We found that more negative interface energies, indica- tive of more energetically favorable protein–protein FIGURE 2 Properties associated with antibody–antigen modeling success. (a) Success rates based on presence of non-protein atoms (glycans or ligands) at the antibody–antigen interface. Complexes are classified as either “Yes” (N = 49) or “No” (N = 378) to indicate whether glycans/ligands are present or absent in the antibody–antigen interface. (b) An example of a medium accuracy complex model from AlphaFold for an interface ligand complex, in comparison with the experimentally determined structure (PDB: 6v7y; single-domain antibody/CD1d α-GalCer complex). This model has medium CAPRI accuracy (I-RMSD = 1.02 Å), and has the highest model confidence of all 25 predictions of this complex (model confidence = 0.85). The complex structure is superposed by antigen with the model and the x-ray structure components colored separately as indicated on right. The α-GalCer glycolipid from the x-ray structure is colored orange. (c) Success rates based on type of antibody in the complex. Complexes were classified as “Ab” (heavy-light antibody, N = 295), or “Nano” (nanobody/ VHH, N = 132) based on antibody type. T1 and T25 denote AlphaFold modeling accuracy in top 1 (ranked by AlphaFold model confidence score) and in all 25 predictions of the complex. Bars were colored by CAPRI criteria. (d) Distribution of Interface energy score calculated by the Rosetta InterfaceAnalyzer (Stranges & Kuhlman, 2013) protocol (based on Rosetta REF15 energy function [Alford et al., 2017]) grouped by AlphaFold modeling accuracy. The modeling accuracy is defined as the highest CAPRI criteria prediction in the complex, considering all 25 predictions. Statistical significance values (Wilcoxon rank-sum test) were calculated between interface energy scores for sets of cases with incorrect versus medium and incorrect versus high CAPRI accuracy predictions, as noted at top (*p ≤ 0.05; ***p ≤ 0.001). YIN and PIERCE 5 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense http://bioinformatics.org/firstglance/fgij//fg.htm?mol=6v7y interactions, are associated with higher AlphaFold modeling success. The difference in distribution of inter- face energy scores between complexes is statistically sig- nificant between incorrect versus medium accuracy prediction (p ≤ 0.05), and incorrect versus high accuracy complexes (p ≤ 0.001), based on Wilcoxon rank-sum test. Other factors including modeled antigen assembly mode, CDR loop modeling accuracy, and antigen modeling accuracy were evaluated for their influence on the suc- cess of antibody–antigen modeling (Supplementary Results, Figures S6–S8). 2.3 | Model confidence score comparison The reported success of model accuracy scores produced by AlphaFold (Evans et al., 2021; Yin et al., 2022) led us to evaluate the ability of those scores, or adaptations thereof, to discriminate between accurate versus incor- rect antibody–antigen predictions. We assessed Alpha- Fold's model confidence score, which is a linear combination of pTM and ipTM (Evans et al., 2021) scores, as well as interface pLDDT (I-pLDDT), which is based on residue-level confidence scores for antibody– antigen interface residues (4 Å distance cutoff), as used in previous studies (Bryant et al., 2022; Yin et al., 2022), for discrimination of correct antibody–antigen models (Figure 3). While both exhibited significant correlations with DockQ score (Johansson-Akhe & Wallner, 2022), which is a continuous measure of complex model accu- racy, I-pLDDT was marginally superior (Figure 3a,b); this was also evident for comparison of the scores with CAPRI accuracy levels (Figure 3c,d). I-pLDDT also provided outstanding discrimination between incorrect versus medium or higher accuracy models based on receiver operating characteristic (ROC) area under the curve (AUC) metrics (AUC = 0.92), which is higher than that of the model confidence (AUC = 0.88; Table 1). We also tested the individual components of the model confidence scores (pTM and ipTM) (Figure S9), which did not yield improved correla- tions with DockQ scores versus model confidence. When excluding data points without side-chain contacts within 4 Å across the antibody–antigen interface (for which I-pLDDT was set to an arbitrary minimum value in Figure 3b and in the corresponding correlation calcula- tion), the correlation between the interface pLDDT and DockQ increased to r = 0.57 (Figure S10a), which dem- onstrates a more significant difference compared to the correlation between the model confidence and DockQ (r = 0.53; Figure S10b). One advantage of I-pLDDT over ipTM and model confidence (which primarily consists of ipTM) is that it is specifically focused on the antibody–antigen interface, whereas ipTM is calculated across all inter-chain inter- faces of complex models, including heavy–light and mul- tiple antigen chains, thus the latter scores may be influenced by less relevant elements of the complex. Overall, these results support the use of I-pLDDT as a pri- mary metric in assessing the quality of AlphaFold antibody–antigen models. 2.4 | Progressive improvements over recycling iterations Recycling is a critical component of the AlphaFold algo- rithm (Evans et al., 2021; Jumper et al., 2021a), wherein each model is input back to the system for further optimi- zation. To improve our understanding of the impact of recycling iterations on AlphaFold modeling of antibody– antigen complexes, we modified the AlphaFold pipeline in ColabFold. ColabFold was preferable to utilize in this context versus the default AlphaFold pipeline due to its speed, in order to enable output and analysis of the antibody–antigen complex predictions at each recycling iteration. Our analysis demonstrates an increase in model accuracy as recycling iterations progress (Figure 4a). In fact, approximately 50% of predictions of medium or higher accuracy after the third recycling iteration were incorrect models before recycling iterations (Figure S11). Next, we analyzed specific changes in antibody– antigen model across recycling iterations, identifying notably enhanced features and those that are unchanged. Features that were improved highlights the strength of AlphaFold, whereas the lack of improvement may high- light areas of difficulty or suggest that these features were already optimal at the start and did not require further refinement. We analyzed both the accuracy of antibody positioning on the antigen and the quality of the highly variable CDR loop of the antibody. Given the high vari- ability in CDRH3 RMSD (Figure S6), compared to the RMSD of other CDR loops, we focused our analysis of CDR loops on the CDRH3. Considering all predictions, we observed a marginal yet significant improvement in both the antibody–antigen binding conformation as mea- sured by ligand RMSD (L-RMSD) (Figure 4b, left panel) and the CDRH3 loop accuracy (Figure 4c, left panel). Upon examining the subset of cases with medium or higher accuracy at recycle 3, we observed that the antibody–antigen binding conformation score L-RMSD exhibited a pronounced and significant improvement (Figure 4b, right panel), while the improvement in CDRH3 loop RMSD was significant but not as pro- nounced (Figure 4c, right panel), indicating that for models to attain high accuracy at the end of the recycling 6 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense iteration, it is helpful for AlphaFold to accurately predict the CDRH3 loop relatively accurately before recycling iterations begin. The capability of AlphaFold to perform rigid-body protein movements over recycling iterations, is shown in Figure 4d (nanobody/Ricin complex). This prediction was of incorrect accuracy before recycling iterations and was improved to a medium accuracy prediction at Recy- cle 3. Over recycling iterations, the L-RMSD of this pre- diction exhibited a substantial degree of improvement, from 49.95 Å before recycling, to 4.25 Å at recycle 3. Unlike L-RMSD, the CDRH3 loop of this prediction was accurately predicted (CDRH3 RMSD = 1.39 Å) before the recycling iterations. The importance of CDRH3 loop accuracy for complex modeling success was further explored by the analysis of CDRH3 loop conformations of modeled unbound struc- tures. Unbound antibody structures were generated with AlphaFold with a template date cutoff of April 30, 2018, and the CDR loops of the unbound antibody models were compared to those of the antibodies in the antibody– antigen complexes. The RMSD between CDR loops of the unbound models and the antibody in the bound is com- pared against the complex modeling success of top- ranked antibody–antigen models generated by AlphaFold in Figure S12. Although the relatively small numbers of high accuracy cases limit this comparison, the accuracy of the CDRH3 modeling in unbound antibody structures FIGURE 3 AlphaFold model confidence scores and model accuracy. Scatter plots compare (a) model confidence and (b) interface pLDDT score with model accuracy, with accuracy assessed by DockQ score. In the scatter plots, all 25 models representing 427 complexes are depicted as data points, with their colors indicating the model quality according to CAPRI criteria. The orange line represents the linear regression, and the lower right corner of the scatter plots displays the Pearson's correlation coefficients and correlation p-values. Distribution of (c) model confidence and (d) interface pLDDT score, grouped by the CAPRI criteria of AlphaFold predictions. Interface pLDDT score is defined as the mean of pLDDT scores of residues within 4 Å of the antibody–antigen interface. Complexes without contacts within 4 Å of antibody–antigen interface is assigned an I-pLDDT score of 30. Statistical significance values (Wilcoxon rank-sum test) were calculated between model scores for sets of predictions with incorrect versus medium and incorrect versus high CAPRI accuracy, as noted at top (***p ≤ 0.001). YIN and PIERCE 7 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense for high antibody–antigen models was found to be signifi- cantly higher than that of the incorrect accuracy models (p ≤ 0.05), suggesting that antibodies with unbound models that more closely resemble the bound loop con- formation are likely to be more accurately modeled in the form of antibody–antigen complexes. 2.5 | Input of subunit chains in bound conformation enables higher success To better understand the factors that can enhance the success rate of the AlphaFold antibody–antigen model- ing, we utilized native antibody–antigen chains as tem- plates within the AlphaFold modeling pipeline, to gauge whether AlphaFold can better assemble the complex structures given the bound subunit chains. Modifications were made to the AlphaFold pipeline to optionally input specific selected PDB templates for each chain. To test performance, we randomly selected 100 cases from the full antibody–antigen benchmark that do not have observed glycans at the antibody–antigen interface and do not belong to the partial antigen assembly category, due to observed change in performance for those sets of cases (Figure 2; Figure S6). On this subset of 100 cases, the use of default templates identified from the Alpha- Fold pipeline resulted in 18% success in generating near- native (medium or high accuracy) top-ranked predictions (Figure 5a), which is similar to the performance on the full benchmark (Figure 1a). A substantial improvement in accuracy was observed when experimentally determined antibody–antigen chains were used as individual chain templates, in which case the success in generating near-native top-ranked predictions was 52% (Figure 5b). Analysis of the top- ranked prediction success determinants shows that distri- bution of interface energy score (Figure S13a) and change in solvent-accessible surface area (ΔSASA) for hydropho- bic part of the antibody–antigen interface (Figure S13b) are significantly different (p ≤ 0.01) between complexes that have incorrect versus high accuracy top-ranked pre- dictions, indicating that despite using bound template structures, AlphaFold has difficulty predicting the com- plex structure for antibody–antigen interactions with less favorable computed interface energies and with smaller hydrophobic interface area. Using only subsets of experi- mentally determined antibody–antigen chains as tem- plates, as well as use of experimentally resolved antigens bound to other antibody structures, resulted in a decrease in model accuracy, compared to using all experimentally determined chains as templates (Figures S14 and S15; Supplementary Results). Interestingly, rigid-body docking in ZDOCK with bound component inputs achieved com- parable, although moderately higher, medium/high accu- racy success compared to AlphaFold with bound component templates (Figure S14a), indicating that both rigid-body docking and deep learning can both perform antibody–antigen complex assembly from bound compo- nents, albeit not in all cases (�50%–60% medium/high accuracy success for top-ranked models). 2.6 | MSA provides important information for accurate prediction of complexes We also evaluated the performance of AlphaFold without MSAs, to test the impact on complex assembly when sub- unit structures are known (thus MSA would not in prin- ciple be needed for subunit structure modeling), given the likely lack of direct co-evolutionary information pre- sent in antibody–antigen MSAs. The removal of MSAs was implemented through modifications to the Alpha- Fold pipeline, as noted in the Section 4. Our results indi- cated a notable decrease in accuracy when MSA was disabled, as compared to the with-MSA counterparts (Figure 5; Figure S14b,c). This prompted us to investigate the possible association between the depth of MSA and the modeling outcome by AlphaFold. TABLE 1 Area under the ROC curve (AUC) value for protein model quality classes as a function of different scoring metrics. Scorea Binary classification ROC AUCb Multi-class classificationb Incorrect versus high Incorrect versus medium and high Interface pLDDT 1.00 0.92 0.88 Model confidence 0.99 0.88 0.85 ipTM 0.99 0.87 0.85 pTM 0.99 0.88 0.84 aScoring methods. Model confidence, ipTM, and pTM are confidence scores from AlphaFold. Interface pLDDT is the average AlphaFold pLDDT score of antibody–antigen interface residues within 4 Å distance cutoff. Models without antibody–antigen interface contacts were assigned an interface pLDDT value of 30. bThe ROC AUC values of binary classification and multi-class classification were calculated using the R pROC (Robin et al., 2011) and multiROC (Wei & Wang, 2018) packages, with classes defined by model CAPRI accuracy, which assigned antibody–antigen models into incorrect (n = 9062), acceptable (n = 773), medium (n = 684), and high (n = 156) accuracy categories. 8 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense We investigated the impact of MSA depth on model- ing success by the full AlphaFold protocol, grouping the complexes by prediction accuracy and comparing distri- butions of MSA depth (Neff) (Figure 6). The distribution of Neff was found to be statistically significant between incorrect and medium accuracy classes (p ≤ 0.01), and between incorrect and high accuracy classes (p ≤ 0.01). We also compared the docking model quality (DockQ score) for all cases when binned by MSA depth levels (Figure S16). A slight trend was observed indicating that a greater MSA depth is associated with higher DockQ scores (higher model accuracies), suggesting that com- pared to a shallow MSA, predictions with a deeper MSA are more likely to be of higher accuracy. Thus, it is possible that increasing MSA depth, particularly for antibody–antigen complexes with very shallow MSAs, could lead to some improvement in overall modeling performance. 2.7 | Modeling accuracy of AlphaFold v.2.3.0 Recently, an updated version of AlphaFold (v.2.3.0, here- after denoted as v.2.3) was released, with modifications to the pipeline and deep learning model (DeepMind, 2022). Compared with the previous version, this version was trained on PDB structures released until September FIGURE 4 Analysis of antibody–antigen predictions across recycling iterations. (a) The accuracy of antibody–antigen complex predictions across up to three recycling iterations. Complex prediction accuracy across recycling iterations (up to three recycles, denoted by the x-axis). Success rate is defined as the proportion of predictions of specific level of CAPRI criteria in a total of 25 prediction per complex, 426 complexes total, at the given recycle. Recycle = 0 denotes the state of the prediction before recycling iterations begin. (b) Distribution of the ligand RMSD (L-RMSD, Å) of antibody–antigen prediction at each recycling iteration (denoted by the x-axis), of all predictions (25 predictions � 426 complexes, left panel) or a subset of predictions of medium or high CAPRI accuracy at recycle = 3 (106 predictions, right panel). (c) Distribution of the CDRH3 accuracy of antibody–antigen prediction at each recycling iteration (denoted by the x-axis), of all or a subset of predictions of medium or high CAPRI accuracy at recycle = 3. CDRH3 accuracy is defined as the change in RMSD of the CDRH3 region, when superposing the predicted antibody (in the antibody–antigen complex prediction) onto the experimentally resolved antibody (in the antibody–antigen complex) using the antibody framework region. Statistical significance values (Wilcoxon rank-sum test) were calculated between RMSD values for sets of predictions at the outset of recycling iterations (recycle = 0) versus at recycle = 3, as noted at top (***p ≤ 0.001). (d) Example of a prediction across recycling iterations (PDB 7kd2; nanobody/Ricin complex). This prediction's CAPRI accuracy level across recycles was incorrect at recycle = 0 (I-RMSD = 17.98 Å), incorrect at recycle = 1 (I-RMSD = 10.90 Å), acceptable at recycle = 2 (I-RMSD = 2.52 Å), and medium at recycle = 3 (I-RMSD = 1.45 Å). The CDRH3 RMSDs of the predictions across recycling iterations 0, 1, 2, and 3 were 1.39 Å, 1.19 Å, 1.27 Å, and 1.17 Å, respectively. The L-RMSDs of the predictions across recycling iterations 0, 1, 2, 3 were 49.95 Å, 24.68 Å, 5.42 Å, 4.25 Å, respectively. Antibody and antigen chains of the predictions and x-ray structure are colored as indicated. Predictions were generated with ColabFold due to its faster model generation speed compared to AlphaFold. YIN and PIERCE 9 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense http://bioinformatics.org/firstglance/fgij//fg.htm?mol=7kd2 30, 2021, resulting in a 30% increase in training data. This version also increased the maximum number of recycles, from 3 recycles in v.2.2 to 20 recycles in v.2.3, with early stopping, and utilized larger interface regions (crops) and more chains during training. To benchmark its performance, we assembled a test set of 41 nonredundant antibody–antigen complexes released after the September 30, 2021 training date (Table S1). AlphaFold v.2.3 gener- ated medium or higher accuracy models as top-ranked predictions for 36% of the test cases, notably higher than the 23% generated by v.2.2 (Figure 7), with no significant difference in antibody CDR loop accuracy. Additional benchmarking revealed that reducing the number of recycling iterations in v2.3 to match the number of recy- cling iterations in v2.2 resulted in unchanged modeling success for v2.3 (Figure S17; Supplementary Results). This suggests that the observed difference in the number of recycles between v.2.3 and v.2.2 is not the main factor contributing to the increased success, and that the updated and expanded training of the deep learning model training may be responsible. A recent study demonstrated that by introducing sto- chastic perturbations through activating dropout during AlphaFold inference and employing extensive sampling, the modeling success of AlphaFold can be improved (Wallner, 2023a). Using this technique, named AFsam- ple, the Wallner group ranked among the top predictors in CASP15 for protein assembly modeling, which included five nanobody-antigen and three heavy–light antibody–antigen targets (Lensink, Brysbaert, Raouraoua, Bates, et al., 2023; Wallner, 2023b). In light of this, we applied the AFsample protocol to model our benchmark- ing set of antibody–antigen complexes to assess its perfor- mance on a broader dataset. On a total of 37 cases for which all models were successfully generated, AFsample generated medium or higher accuracy top-ranked predic- tions for 51% of the test cases, which is notably higher than 35% for AlphaFold v.2.3, and 24% AlphaFold v.2.2 FIGURE 6 Comparison of MSA depth and modeling success. The distribution of MSA depth (number of effective sequences, Neff), calculated using CD-Hit (Fu et al., 2012) with an identity cutoff of 80%, is shown for antibody–antigen complexes grouped by AlphaFold modeling accuracy. The modeling accuracy is defined as the highest CAPRI criteria prediction in the complex, considering all 25 predictions. Numbers of data points in incorrect, acceptable, medium and high categories are 272, 63, 65 and 26. Statistical significance values (Wilcoxon rank-sum test) were calculated between interface energy scores for sets of cases with incorrect versus medium and incorrect versus high CAPRI accuracy predictions, as noted at top (**p ≤ 0.01). FIGURE 5 Improved subunit modeling enhances antibody–antigen complex modeling success. Antibody–antigen modeling success of AlphaFold by utilizing (a) templates identified through the default template search protocol, (b) bound antibody and antigen chains as templates, (c) bound antibody and default antigen chains (identified by the default search protocol) as templates. Benchmarking was performed on a total of 100 antibody–antigen complexes. The success rate was calculated based on the percentage of cases that had at least one model among their top N predictions that met a specified level of CAPRI accuracy. Bars are colored by CAPRI accuracy criteria. 10 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense (Figure 8). When the top 25 predictions were considered, AFsample's medium or higher success rate increased to 59%. In summary, our findings indicate that massive sampling through a combination of dropout, pooling structures from different AlphaFold models and parame- ters, and generation of large numbers of models, provides a clear advantage over the standard protocol in the con- text of antibody–antigen complex modeling, although with substantially higher computational cost. 3 | DISCUSSION Using a set of over 400 nonredundant antibody–antigen complexes, we benchmarked and evaluated AlphaFold's ability to model antibody–antigen complexes. On this set, we observed a limited yet higher success in the prediction of antibody–antigen structures by AlphaFold, compared to our previous benchmarking that used an older Alpha- Fold version accessed via ColabFold, and was based on a limited set of 100 antibody–antigen cases (Yin et al., 2022). Analyses of factors that could influence the prediction outcome showed that AlphaFold was less able to accurately predict antibody–antigen structures with glycans at the antibody–antigen interface, which high- lights AlphaFold's limitation in handling complexes with post-translational modifications. We also found that AlphaFold is more successful at modeling nanobody– antigen complexes and has difficulty predicting the struc- ture of larger antibody–antigen complexes. An analysis of prediction accuracy at each recycling iteration, as well as the bound antibody–antigen template tests shows the FIGURE 7 Antibody–antigen modeling success by AlphaFold v.2.3. Modeling success of (a) AlphaFold v.2.2 and (b) AlphaFold v.2.3 on 41 antibody–antigen complexes. The success rate was calculated based on the percentage of cases that had at least one model among their top N predictions that met a specified level of CAPRI accuracy. Bars are colored by CAPRI accuracy criteria. (c) Distribution of the CDR loop prediction accuracy of AlphaFold v.2.2 (denoted by salmon color) versus v.2.3 (denoted by cyan color). CDR loop accuracy is defined as the change in RMSD of the CDR regions, when superposing the predicted antibody (in the antibody–antigen complex prediction) onto the experimentally resolved antibody (in the antibody–antigen complex) using the antibody framework region. FIGURE 8 Antibody–antigen modeling success by AlphaFold v.2.2, v2.3 and AFsample. Modeling success of (a) AlphaFold v.2.2, (b) AlphaFold v.2.3, and (c) AFsample on 37 antibody– antigen complexes. The success rate was calculated based on the percentage of cases that had at least one model among their top N predictions that met a specified level of CAPRI accuracy. Bars are colored by CAPRI accuracy criteria. YIN and PIERCE 11 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense importance of accurate subunit modeling for success in predicting the antibody–antigen complex. Relatedly, the ability to accurately predict CDRH3 loops is important for overall docking success. Our benchmarking also shows that the latest version of AlphaFold (v.2.3) exhibits improved success in predict- ing antibody–antigen structures versus the previous AlphaFold version (v.2.2), likely due at least in part to the model training on an updated and expanded set of complex structures from the PDB (DeepMind, 2022). It is possible that success can be improved further through additional optimization or other adaptations of the AlphaFold framework or model. Additionally, we observed that a recently described AlphaFold-based mas- sive sampling approach, named AFsample (Wallner, 2023a), achieved even higher success than stan- dard AlphaFold 2.3. It is possible that additional sam- pling, or pooling with different sets of models and parameters, could improve this success further. Another potential avenue for elevating the accuracy of AlphaFold predictions is demonstrated by the recent development of fully trainable AlphaFold implementations (Gustaf et al., 2020; Motmaen et al., 2023; Ziyao et al., 2008), which enable researchers to adapt and refine the model to specific datasets or domains of interest, opening up new possibilities for customization and optimization of the AlphaFold network. Despite the lack of explicit coevolutionary signal, our data show that the inclusion of diverse sequence informa- tion in MSAs is helpful for maintaining AlphaFold's modeling success of antibody–antigen complexes. As such, curation or optimization of MSAs could be another avenue for improving the accuracy of AlphaFold predic- tions. Previous work showed that AlphaFold prediction of protein–protein complexes can be augmented with improved MSA cross-chain pairing (Bryant et al., 2022), while others have developed alternative MSA methods such as DeepMSA2 (Zheng et al., 2021), which was part of a successful pipeline in a recent CASP/CAPRI complex structure prediction round (Lensink, Brysbaert, Raouraoua, et al., 2023). Recent work leveraging protein language models shows promise in constructing diversi- fied and informative MSAs for enhancing accuracy in AlphaFold protein complex prediction (Bo et al., 2009), while it may be possible to replace or augment the MSA in AlphaFold with language model representations, potentially building on recent language models developed for antibodies (Olsen et al., 2022; Ruffolo et al., 2021) or proteins in general (Hie et al., 2023). Our results also demonstrate that accurate subunit prediction is associated with higher antibody–antigen complex prediction success. Recent work has shown improved accuracy in antibody prediction, particularly in the context of CDR loops, leveraging elements of Alpha- Fold architecture, especially the structure module, with modifications (Abanades et al., 2023; Ruffolo et al., 2023). Incorporating such advances into the prediction pipeline may enable the prediction of more accurate antibody– antigen complexes. While it is possible or even likely that antibody– antigen modeling success may ultimately be improved in AlphaFold or related deep learning frameworks, the cur- rent success of AlphaFold, particularly when using its updated model (v.2.3) or a recently described massive sampling protocol, in conjunction with the observed con- fidence scoring accuracy, indicates that AlphaFold may potentially be of practical use to researchers in modeling this important and challenging class of complexes, and can complement or assist experimental structural deter- mination methods. 4 | METHODS 4.1 | Antibody–antigen benchmark assembly We assembled two nonredundant sets of high resolution structures to benchmark AlphaFold, following the gen- eral protocol that we described previously (Yin et al., 2022). To obtain an initial list of antibody–antigen complexes from the PDB, we downloaded the full SAb- Dab (Dunbar et al., 2014) antibody structure dataset in January 2022. The antibody–antigen complex dataset for AlphaFold v2.2 benchmarking was assembled using the following criteria: (1) structure resolution ≤3.0 Å, (2) pro- tein antigen in the structure (based on SAbDab annota- tion), and (3) nonredundant with antibody–antigen complexes with structural resolution ≤9.0 Å released before April 30, 2018 (AlphaFold v2.2 training sample cutoff date) based on sequence criteria. Sequence criteria for nonredundancy are: (1) heavy chain variable domain sequence ID <90% and full variable domain sequence ID <90%, or (2) no match between antigen chain sequences (no hit detected using BLAST (Camacho et al., 2009) with default parameters). Pairwise sequence alignments were performed using the “blastp” executable in the BLAST suite (Camacho et al., 2009). Structural nonredundancy criteria were then applied to the set. We removed antibody–antigen structures with <5 Å heavy chain Cα atom RMSD, after superposition of antigens using the FAST structure alignment program (Zhu & Weng, 2005), and > 70% identity between heavy chain variable domain, light chain variable domain, or concatenated CDR loop sequences. To avoid modeling antigen chains with large regions that are not resolved in the 12 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense experimentally determined structures, we additionally removed structures with PDB “seqres” file sequence annotation and resolved region sequence length differ- ence >70%, or sequence length difference >35% and resolved antigen length >500 aa. We also removed non- canonical antibody–antigen complex cases (e.g., with antibody-tetramerization, dimeric sdAb, or constant domain binding), and we removed cases with incomplete antigen chain annotations by SAbDab, identified through manual inspection of the PDB bioassembly structure. To benchmark AlphaFold v.2.3, we identified a subset of 41 antibody–antigen complexes within the v.2.2 bench- marking set. These antibody–antigen complexes were released after September 30, 2021, and are not redundant with structures released before that date based on the sequence criteria detailed above. The AlphaFold v2.2 and v2.3 benchmarking cases are shown in Table S1. 4.2 | AlphaFold antibody–antigen modeling Sequences input to AlphaFold were obtained from the PDB “seqres” file. Antibody sequences were processed by ANARCI to remove non-variable domain sequence regions. We downloaded and installed AlphaFold v2.2 from Github (https://github.com/deepmind/alphafold) in May 2022 and v.2.3 in February 2023. Both versions of AlphaFold were installed on a local computing cluster. During the structure prediction or feature preparation step in the AlphaFold pipeline, 15 cases failed to com- plete because of GPU and memory limitations out of a total of 442 test cases. For generating unbound antibody and antigen struc- tures, we employed AlphaFold in Multimer setting when the input consisted of a heavy-light chain antibody or a multimeric antigen. Alternatively, the Monomer setting was utilized when the input was a single chain. A tem- plate date cutoff of April 30, 2018 was applied to avoid template overlap with benchmarking set. To generate AlphaFold predictions without the use of MSAs (corresponding to single-sequence modeling), we modified “all_seq_msa_features” variable of chain features, to include only the query sequence. To use custom templates, we adapted the template featuriza- tion function from Motmaen et al. (2023) (https:// github.com/phbradley/alphafold_finetune/blob/main/ predict_utils.py). AlphaFold modeling in ColabFold (Mirdita et al., 2022) was performed with ColabFold version 1.4.0 (commit 26de12d3afb5f85d49d0c7db1b9371f034388395), installed on a local computing cluster using scripts from Github (https://github.com/YoshitakaMo/localcolabfold). During ColabFold AlphaFold modeling, MSA was built by query- ing the MMseqs2 MSA server using unpaired and paired MSA. To generate a total of 25 predictions per complex, modifications were made to “load_models_and_params” function, utilizing a different random seed for each predic- tion, producing five predictions per AlphaFold model parameter. Unless otherwise specified, a template date cutoff of April 30, 2018 was applied for benchmarking AlphaFold v.2.2 and ColabFold, and a template date cutoff of September 30, 2021 was applied for benchmarking of AlphaFold v.2.3, to avoid using bound structures as template. AlphaFold and ColabFold modeling runs were per- formed using NVIDIA Titan RTX and Quadro 6000 GPUs. 4.3 | Complex model accuracy assessment We assessed antibody–antigen complex model accuracy using DockQ (Johansson-Akhe & Wallner, 2022), which was downloaded from GitHub (https://github.com/ bjornwallner/DockQ). Antibody–antigen complex model accuracy was computed by DockQ using the experimen- tally determined antibody–antigen complex structures obtained from the PDB. DockQ calculates interface backbone RMSD (I-RMSD), ligand backbone RMSD (L-RMSD), fraction of native contacts (fnat), DockQ score, as well as the Critical Assessment of PRediction of Interactions (CAPRI) accuracy level, which assigns the model into one of four discrete accuracy classes: incor- rect, acceptable, medium, and high, based on the model's similarity to the native structure (Lensink et al., 2020). 4.4 | Interface pLDDT calculation To determine the interface pLDDT (I-pLDDT), we com- puted the average pLDDT value for all residues at the antibody–antigen interface. Interface residues were defined as any residue with a non-hydrogen atom within 4.0 Å of the binding partner. An I-pLDDT score of 30 was assigned to predictions with no antibody–antigen inter- face residues. 4.5 | CDR loop accuracy analysis The CDRs and the framework regions of antibodies were identified by AHo numbering (Honegger & Pluckthun, 2001), assigned using ANARCI software YIN and PIERCE 13 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense https://github.com/deepmind/alphafold https://github.com/phbradley/alphafold_finetune/blob/main/predict_utils.py https://github.com/phbradley/alphafold_finetune/blob/main/predict_utils.py https://github.com/phbradley/alphafold_finetune/blob/main/predict_utils.py https://github.com/YoshitakaMo/localcolabfold https://github.com/bjornwallner/DockQ https://github.com/bjornwallner/DockQ (Dunbar & Deane, 2016). The CDR loops were defined as residues 24–42 (CDR1), 57–76 (CDR2), and 107–138 (CDR3), as in previous work (Lee et al., 2022). ProFit v 3.1 (Martin & Porter, 2009) was used to cal- culate backbone RMSDs between modeled and experi- mentally determined CDR loop structures, after superposing the modeled antibody structures onto the experimentally resolved structures by the framework residues. 4.6 | Figures and statistical analysis PyMOL (Schrodinger, Inc.) was used to generate struc- tural figures. The ggplot2 (Wickham, 2016) package in R (r-project.org) was utilized to generate box plots, line plots, and bar plots. Pearson correlations and their corre- sponding p values were calculated using the ggpubr pack- age in R, while the Wilcoxon rank-sum test was performed using the ggsignif package in R. Binary and multi-class ROC AUC values were calculated using the pROC (Robin et al., 2011) and multiROC (Wei & Wang, 2018) packages in R, respectively. AUTHOR CONTRIBUTIONS Brian G. Pierce: Conceptualization; methodology; funding acquisition; writing – review and editing; super- vision. Rui Yin: Conceptualization; methodology; writing – original draft; writing – review and editing; for- mal analysis; data curation. ACKNOWLEDGMENTS We are grateful to the information technology group at the Institute for Bioscience and Biotechnology Research (IBBR), including Gale Lane, as well as the University of Maryland Division of Information Technology, for assis- tance with AlphaFold installation on local high perfor- mance computing (HPC) clusters. Computational resources from the IBBR HPC cluster and University of Maryland Zaratan HPC cluster were used in this study. Members of the Pierce lab, including Helder Veras Ribeiro-Filho, provided insightful discussions and sugges- tions. We also express our gratitude to John Moult, Bran- don Feng, Mike Song, Sergey Ovchinnikov (Harvard University), and John Jumper (DeepMind) for helpful discussions. This work was supported by the National Institutes of Health grant R35 GM144083 to B. G. P. DATA AVAILABILITY STATEMENT Modified AlphaFold code and analysis scripts are avail- able on Github: https://github.com/piercelab/alphafold_ v2.2_customize. AlphaFold2.2, AlphaFold2.3, and Colab- Fold antibody–antigen models generated in this study are available for download at: https://piercelab.ibbr.umd. edu/af_abag_benchmarking.html. ORCID Rui Yin https://orcid.org/0000-0001-5330-8306 Brian G. Pierce https://orcid.org/0000-0003-4821-0368 REFERENCES Abanades B, Wong WK, Boyles F, Georges G, Bujotzek A, Deane CM. ImmuneBuilder: deep-learning models for predict- ing the structures of immune proteins. Commun Biol. 2023; 6(1):575. Alford RF, Leaver-Fay A, Jeliazkov JR, O'Meara MJ, DiMaio FP, Park H, et al. The Rosetta all-atom energy function for macro- molecular modeling and design. J Chem Theory Comput. 2017; 13(6):3031–48. Ambrosetti F, Jimenez-Garcia B, Roel-Touris J, Bonvin A. Modeling antibody-antigen complexes by information-driven docking. Structure. 2020;28(1):119–129.e2. Barnes CO, Jette CA, Abernathy ME, Dam KMA, Esswein SR, Gristick HB, et al. SARS-CoV-2 neutralizing antibody structures inform therapeutic strategies. Nature. 2020;588(7839):682–7. Bo C, Ziwei X, Jiezhong Q, Zhaofeng Y, Jinbo X, Jie T. Improve the protein complex prediction with protein language models. bioRxiv. 2022. https://doi.org/10.1101/2022.09.15.508065 Brenke R, Hall DR, Chuang GY, Comeau SR, Bohnuud T, Beglov D, et al. Application of asymmetric statistical potentials to antibody-protein docking. Bioinformatics. 2012;28(20): 2608–14. Bryant P, Pozzati G, Elofsson A. Improved prediction of protein- protein interactions using AlphaFold2. Nat Commun. 2022; 13(1):1265. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. Carter PJ. Potent antibody therapeutics by design. Nat Rev Immu- nol. 2006;6(5):343–57. Chothia C, Lesk AM. Canonical structures for the hypervariable regions of immunoglobulins. J Mol Biol. 1987;196(4):901–17. DeepMind. AlphaFold v2.3.0 technical note. 2022. https://github. com/deepmind/alphafold/blob/main/docs/technical_note_v2.3. 0.md Dreyfus C, Laursen NS, Kwaks T, Zuijdgeest D, Khayat R, Ekiert DC, et al. Highly conserved protective epitopes on influ- enza B viruses. Science. 2012;337(6100):1343–8. Dunbar J, Deane CM. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics. 2016;32(2):298–300. Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, et al. SAbDab: the structural antibody database. Nucleic Acids Res. 2014;42(Database issue):D1140–6. Evans R, O'Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2021. https://doi.org/10.1101/2021.10.04.463034 Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28(23):3150–2. Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. The promise and challenge of high-throughput 14 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense http://r-project.org https://github.com/piercelab/alphafold_v2.2_customize https://github.com/piercelab/alphafold_v2.2_customize https://piercelab.ibbr.umd.edu/af_abag_benchmarking.html https://piercelab.ibbr.umd.edu/af_abag_benchmarking.html https://orcid.org/0000-0001-5330-8306 https://orcid.org/0000-0001-5330-8306 https://orcid.org/0000-0003-4821-0368 https://orcid.org/0000-0003-4821-0368 https://doi.org/10.1101/2022.09.15.508065 https://github.com/deepmind/alphafold/blob/main/docs/technical_note_v2.3.0.md https://github.com/deepmind/alphafold/blob/main/docs/technical_note_v2.3.0.md https://github.com/deepmind/alphafold/blob/main/docs/technical_note_v2.3.0.md https://doi.org/10.1101/2021.10.04.463034 sequencing of the antibody repertoire. Nat Biotechnol. 2014; 32(2):158–68. Graham BS, Gilman MSA, McLellan JS. Structure-based vaccine antigen design. Annu Rev Med. 2019;70:91–104. Guest JD, Vreven T, Zhou J, Moal I, Jeliazkov JR, Gray JJ, et al. An expanded benchmark for antibody-antigen docking and affinity prediction reveals insights into antibody recognition determi- nants. Structure. 2021;29(6):606–621.e5. Gustaf A, Nazim B, Sachin K, Xia Q, Gerecke W, O'Donnell TJ, et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv. 2022 https://doi.org/ 10.1101/2022.11.20.517210 Haidar JN, Yuan QA, Zeng L, Snavely M, Luna X, Zhang H, et al. A universal combinatorial design of antibody framework to graft distinct CDR sequences: a bioinformatics approach. Proteins. 2012;80(3):896–912. Hanf KJ, Arndt JW, Chen LL, Jarpe M, Boriack-Sjodin PA, Li Y, et al. Antibody humanization by redesign of complementarity- determining region residues proximate to the acceptor frame- work. Methods. 2014;65(1):68–76. Hie BL, Shanker VR, Xu D, Bruun TUJ, Weidenbacher PA, Tang S, et al. Efficient evolution of human antibodies from general pro- tein language models. Nat Biotechnol. 2023. https://doi.org/10. 1038/s41587-023-01763-2 Honegger A, Pluckthun A. Yet another numbering scheme for immunoglobulin variable domains: an automatic modeling and analysis tool. J Mol Biol. 2001;309(3):657–70. Johansson-Akhe I, Wallner B. Improving peptide-protein docking with AlphaFold-multimer using forced sampling. Front Bioin- form. 2022;2:959160. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure predic- tion with AlphaFold. Nature. 2021a;596(7873):583–9. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Applying and improving AlphaFold at CASP14. Proteins. 2021b;89:1711–21. Kappler K, Hennet T. Emergence and significance of carbohydrate- specific antibodies. Genes Immun. 2020;21(4):224–39. Krawczyk K, Baker T, Shi J, Deane CM. Antibody i-patch prediction of the antibody binding site improves rigid local antibody-antigen docking. Protein Eng des Sel. 2013;26(10): 621–9. Lee JH, Yin R, Ofek G, Pierce BG. Structural features of antibody- peptide recognition. Front Immunol. 2022;13:910367. Leman JK, Weitzner BD, Lewis SM, Adolf-Bryfogle J, Alam N, Alford RF, et al. Macromolecular modeling and design in Rosetta: recent methods and frameworks. Nat Methods. 2020; 17(7):665–80. Lensink M, Brysbaert G, Raouraoua N, Bates PA, Giulini M, Honorato RV, et al. Impact of AlphaFold on structure predic- tion of protein complexes: the CASP15-CAPRI experiment. Pro- teins. 2023;91(12):1658–83. Lensink MF, Nadzirin N, Velankar S, Wodak SJ. Modeling protein- protein, protein-peptide, and protein-oligosaccharide com- plexes: CAPRI 7th edition. Proteins. 2020;88(8):916–38. Li Y, Li H, Yang F, Smith-Gill SJ, Mariuzza RA. X-ray snapshots of the maturation of an antibody response to a protein antigen. Nat Struct Biol. 2003;10(6):482–8. Li Z, Woo CJ, Iglesias-Ussel MD, Ronai D, Scharff MD. The genera- tion of antibody diversity through somatic hypermutation and class switch recombination. Genes Dev. 2004;18(1):1–11. Martin AC, Porter CT. ProFit Version 3.1. 2009 Available from: http://www.bioinf.org.uk/software/profit/ Mirdita M, Schutze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Methods. 2022;19(6):679–82. Motmaen A, Dauparas J, Baek M, Abedi MH, Baker D, Bradley P. Peptide-binding specificity prediction using fine-tuned protein structure prediction networks. Proc Natl Acad Sci U S A. 2023; 120(9):e2216697120. Nelson AL, Dhimolea E, Reichert JM. Development trends for human monoclonal antibody therapeutics. Nat Rev Drug Dis- cov. 2010;9(10):767–74. Olsen TH, Moal IH, Deane CM. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv. 2022;2(1):vbac046. Pierce BG, Hourai Y, Weng Z. Accelerating protein docking in ZDOCK using an advanced 3D convolution library. PloS One. 2011;6(9):e24657. Rappuoli R, Bottomley MJ, D'Oro U, Finco O, De Gregorio E. Reverse vaccinology 2.0: human immunology instructs vaccine antigen design. J Exp Med. 2016;213(4):469–81. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics. 2011;12:77. Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natu- ral antibodies. Nat Commun. 2023;14(1):2389. Ruffolo JA, Gray JJ, Sulam J. Deciphering antibody affinity matura- tion with language models and weakly supervised learning. arXiv preprint. arXiv:211207782. 2021. Scott AM, Wolchok JD, Old LJ. Antibody therapy of cancer. Nat Rev Cancer. 2012;12(4):278–87. Sela-Culang I, Kunik V, Ofran Y. The structural basis of antibody- antigen recognition. Front Immunol. 2013;4:302. Sircar A, Gray JJ. SnugDock: paratope structural optimization dur- ing antibody-antigen docking compensates for errors in anti- body homology models. PLoS Comput Biol. 2010;6(1):e1000644. Stranges PB, Kuhlman B. A comparison of successful and failed protein interface designs highlights the challenges of designing buried hydrogen bonds. Protein Sci. 2013;22(1):74–82. Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the integrated protein-protein interaction benchmarks: docking benchmark version 5 and affinity benchmark version 2. J Mol Biol. 2015;427(19):3031–41. Wallner B. AFsample: improving multimer prediction with AlphaFold using massive sampling. Bioinformatics. 2023a;39(9):btad573. Wallner B. Improved multimer prediction using massive sampling with AlphaFold in CASP15. Proteins. 2023b;91(12):1734–46. Wei R, Wang J. multiROC: Calculating and Visualizing ROC and PR Curves Across Multi-Class Classifications. R package ver- sion 1.1.1; 2018. Wickham H. ggplot2: elegant graphics for data analysis. New York: Springer-Verlag; 2016. Yin R, Feng BY, Varshney A, Pierce BG. Benchmarking AlphaFold for protein complex modeling reveals accuracy determinants. Protein Sci. 2022;31(8):e4379. YIN and PIERCE 15 of 16 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense https://doi.org/10.1101/2022.11.20.517210 https://doi.org/10.1101/2022.11.20.517210 https://doi.org/10.1038/s41587-023-01763-2 https://doi.org/10.1038/s41587-023-01763-2 http://www.bioinf.org.uk/software/profit/ Zheng W, Li Y, Zhang C, Zhou X, Pearce R, Bell EW, et al. Protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14. Proteins. 2021;89(12): 1734–51. Zhou T, Lynch RM, Chen L, Acharya P, Wu X, Doria-Rose NA, et al. Structural repertoire of HIV-1-neutralizing antibodies tar- geting the CD4 supersite in 14 donors. Cell. 2015;161(6): 1280–92. Zhu J, Weng Z. FAST: a novel protein structure alignment algo- rithm. Proteins. 2005;58(3):618–27. Ziyao L, Xuyang L, Weijie C, Shen F, Bi H, Ke G, et al. Uni-fold: an open-source platform for developing protein folding models beyond AlphaFold. bioRxiv. 2022. https://doi.org/10.1101/2022. 08.04.502811 SUPPORTING INFORMATION Additional supporting information can be found online in the Supporting Information section at the end of this article. How to cite this article: Yin R, Pierce BG. Evaluation of AlphaFold antibody–antigen modeling with implications for improving predictive accuracy. Protein Science. 2024;33(1): e4865. https://doi.org/10.1002/pro.4865 16 of 16 YIN and PIERCE 1469896x, 2024, 1, D ow nloaded from https://onlinelibrary.w iley.com /doi/10.1002/pro.4865, W iley O nline L ibrary on [03/12/2025]. See the T erm s and C onditions (https://onlinelibrary.w iley.com /term s-and-conditions) on W iley O nline L ibrary for rules of use; O A articles are governed by the applicable C reative C om m ons L icense https://doi.org/10.1101/2022.08.04.502811 https://doi.org/10.1101/2022.08.04.502811 https://doi.org/10.1002/pro.4865 Evaluation of AlphaFold antibody-antigen modeling with implications for improving predictive accuracy 1 INTRODUCTION 2 RESULTS 2.1 AlphaFold antibody-antigen complex modeling accuracy 2.2 Antibody-antigen modeling accuracy determinants 2.3 Model confidence score comparison 2.4 Progressive improvements over recycling iterations 2.5 Input of subunit chains in bound conformation enables higher success 2.6 MSA provides important information for accurate prediction of complexes 2.7 Modeling accuracy of AlphaFold v.2.3.0 3 DISCUSSION 4 METHODS 4.1 Antibody-antigen benchmark assembly 4.2 AlphaFold antibody-antigen modeling 4.3 Complex model accuracy assessment 4.4 Interface pLDDT calculation 4.5 CDR loop accuracy analysis 4.6 Figures and statistical analysis AUTHOR CONTRIBUTIONS ACKNOWLEDGMENTS DATA AVAILABILITY STATEMENT REFERENCES