Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

被引:52
|
作者
Kim, Yohan [1 ]
Sidney, John [1 ]
Buus, Soren [2 ]
Sette, Alessandro [1 ]
Nielsen, Morten [3 ,4 ]
Peters, Bjoern [1 ]
机构
[1] La Jolla Inst Allergy & Immunol, La Jolla, CA 92037 USA
[2] Univ Copenhagen, Dept Int Hlth Immunol & Microbiol, DK-2200 Copenhagen, Denmark
[3] Tech Univ Denmark, Ctr Biol Sequence Anal, Dept Syst Biol, DK-2800 Lyngby, Denmark
[4] Univ Nacl San Martin, Inst Invest Biotecnol, RA-1650 Buenos Aires B, DF, Argentina
来源
BMC BIOINFORMATICS | 2014年 / 15卷
基金
美国国家卫生研究院;
关键词
Benchmarking of MHC class I predictors; Epitope prediction; Sequence similarity; Cross-validation; T-CELL EPITOPES; DATABASE; IMMUNOGENICITY; IMMUNOLOGY; NETMHCPAN; MOLECULES; SEQUENCE; RESOURCE; AFFINITY;
D O I
10.1186/1471-2105-15-241
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: It is important to accurately determine the performance of peptide: MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. Results: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. Conclusion: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] General Prediction of Peptide-MHC Binding Modes Using Incremental Docking: A Proof of Concept
    Dinler A. Antunes
    Didier Devaurs
    Mark Moll
    Gregory Lizée
    Lydia E. Kavraki
    Scientific Reports, 8
  • [32] General Prediction of Peptide-MHC Binding Modes Using Incremental Docking: A Proof of Concept
    Antunes, Dinler A.
    Devaurs, Didier
    Moll, Mark
    Lizee, Gregory
    Kavraki, Lydia E.
    ACM-BCB'18: PROCEEDINGS OF THE 2018 ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2018, : 568 - 568
  • [33] HLA-DM senses peptide-MHC class II interactions throughout the peptide binding groove.
    Reyes-Vargas, Eduardo
    Barker, Adam P.
    Zhou, Zemin
    He, Xiao
    Jensen, Peter E.
    JOURNAL OF IMMUNOLOGY, 2017, 198 (01):
  • [34] Quantification of Uncertainty in Peptide-MHC Binding Prediction Improves High-Affinity Peptide Selection for Therapeutic Design
    Zeng, Haoyang
    Gifford, David K.
    CELL SYSTEMS, 2019, 9 (02) : 159 - +
  • [35] Rosetta FlexPepDock to predict peptide-MHC binding: An approach for non-canonical amino acids
    Bloodworth, Nathaniel
    Barbaro, Natalia Ruggeri
    Moretti, Rocco
    Harrison, David G.
    Meiler, Jens
    PLOS ONE, 2022, 17 (12):
  • [36] Thermodynamics of T cell receptor binding to peptide-MHC: Evidence for a general mechanism of molecular scanning
    Boniface, JJ
    Reich, Z
    Lyons, DS
    Davis, MM
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1999, 96 (20) : 11446 - 11451
  • [37] SENSITIVE QUANTITATIVE PREDICTIONS OF MHC BINDING PEPTIDE FROM ENTAMOEBA HISTOLYTICA
    Gomase, V. S.
    Kapoor, R. A.
    Ladak, S. S.
    Khartode, Ravikiran
    PROCEEDINGS OF THE 2011 3RD INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGY AND ENGINEERING (ICSTE 2011), 2011, : 53 - 58
  • [38] Isolation of a Structural Mechanism for Uncoupling T Cell Receptor Signaling from Peptide-MHC Binding
    Sibener, Leah V.
    Fernandes, Ricardo A.
    Kolawole, Elizabeth M.
    Carbone, Catherine B.
    Liu, Fan
    McAffee, Darren
    Birnbaum, Michael E.
    Yang, Xinbo
    Su, Laura F.
    Yu, Wong
    Dong, Shen
    Gee, Marvin H.
    Jude, Kevin M.
    Davis, Mark M.
    Groves, Jay T.
    Goddard, William A., III
    Heath, James R.
    Evavold, Brian D.
    Vale, Ronald D.
    Garcia, K. Christopher
    CELL, 2018, 174 (03) : 672 - +
  • [39] Two-step binding mechanism for T-cell receptor recognition of peptide-MHC
    Wu, LC
    Tuot, DS
    Lyons, DS
    Garcia, KC
    Davis, MM
    NATURE, 2002, 418 (6897) : 552 - 556
  • [40] Selecting informative data for developing peptide-MHC binding predictors using a query by committee approach
    Christensen, JK
    Lamberth, K
    Nielsen, M
    Lundegaard, C
    Worning, P
    Lauemoller, SL
    Buus, S
    Brunak, S
    Lund, O
    NEURAL COMPUTATION, 2003, 15 (12) : 2931 - 2942