Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions

被引:52
|
作者
Kim, Yohan [1 ]
Sidney, John [1 ]
Buus, Soren [2 ]
Sette, Alessandro [1 ]
Nielsen, Morten [3 ,4 ]
Peters, Bjoern [1 ]
机构
[1] La Jolla Inst Allergy & Immunol, La Jolla, CA 92037 USA
[2] Univ Copenhagen, Dept Int Hlth Immunol & Microbiol, DK-2200 Copenhagen, Denmark
[3] Tech Univ Denmark, Ctr Biol Sequence Anal, Dept Syst Biol, DK-2800 Lyngby, Denmark
[4] Univ Nacl San Martin, Inst Invest Biotecnol, RA-1650 Buenos Aires B, DF, Argentina
来源
BMC BIOINFORMATICS | 2014年 / 15卷
基金
美国国家卫生研究院;
关键词
Benchmarking of MHC class I predictors; Epitope prediction; Sequence similarity; Cross-validation; T-CELL EPITOPES; DATABASE; IMMUNOGENICITY; IMMUNOLOGY; NETMHCPAN; MOLECULES; SEQUENCE; RESOURCE; AFFINITY;
D O I
10.1186/1471-2105-15-241
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: It is important to accurately determine the performance of peptide: MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set. Results: We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates. Conclusion: It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Accurate pan-specific prediction of peptide-MHC class II binding affinity with improved binding core identification
    Massimo Andreatta
    Edita Karosiene
    Michael Rasmussen
    Anette Stryhn
    Søren Buus
    Morten Nielsen
    Immunogenetics, 2015, 67 : 641 - 650
  • [42] Accurate pan-specific prediction of peptide-MHC class II binding affinity with improved binding core identification
    Andreatta, Massimo
    Karosiene, Edita
    Rasmussen, Michael
    Stryhn, Anette
    Buus, Soren
    Nielsen, Morten
    IMMUNOGENETICS, 2015, 67 (11-12) : 641 - 650
  • [43] Limitations of Ab Initio Predictions of Peptide Binding to MHC Class II Molecules
    Zhang, Hao
    Wang, Peng
    Papangelopoulos, Nikitas
    Xu, Ying
    Sette, Alessandro
    Bourne, Philip E.
    Lund, Ole
    Ponomarenko, Julia
    Nielsen, Morten
    Peters, Bjoern
    PLOS ONE, 2010, 5 (02):
  • [44] Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction
    Youngmahn Han
    Dongsup Kim
    BMC Bioinformatics, 18
  • [45] A community resource benchmarking predictions of peptide binding to MHC-I molecules
    Peters, Bjoern
    Bui, Huynh-Hoa
    Frankild, Sune
    Nielsen, Morten
    Lundegaard, Claus
    Kostem, Emrah
    Basch, Derek
    Lamberth, Kasper
    Harndahl, Mikkel
    Fleri, Ward
    Wilson, Stephen S.
    Sidney, John
    Lund, Ole
    Buus, Soren
    Sette, Alessandro
    PLOS COMPUTATIONAL BIOLOGY, 2006, 2 (06) : 574 - 584
  • [46] Deep convolutional neural networks for pan-specific peptide-MHC class I binding prediction
    Han, Youngmahn
    Kim, Dongsup
    BMC BIOINFORMATICS, 2017, 18
  • [47] Ranking-Based Convolutional Neural Network Models for Peptide-MHC Class I Binding Prediction
    Chen, Ziqi
    Min, Martin Renqiang
    Ning, Xia
    FRONTIERS IN MOLECULAR BIOSCIENCES, 2021, 8
  • [48] Cooperative binding of T cell receptor and CD4 to peptide-MHC enhances antigen sensitivity
    Rushdi, Muaz Nik
    Pan, Victor
    Li, Kaitao
    Choi, Hyun-Kyu
    Travaglino, Stefano
    Hong, Jinsung
    Griffitts, Fletcher
    Agnihotri, Pragati
    Mariuzza, Roy A.
    Ke, Yonggang
    Zhu, Cheng
    NATURE COMMUNICATIONS, 2022, 13 (01)
  • [49] Cooperative binding of T cell receptor and CD4 to peptide-MHC enhances antigen sensitivity
    Muaz Nik Rushdi
    Victor Pan
    Kaitao Li
    Hyun-Kyu Choi
    Stefano Travaglino
    Jinsung Hong
    Fletcher Griffitts
    Pragati Agnihotri
    Roy A. Mariuzza
    Yonggang Ke
    Cheng Zhu
    Nature Communications, 13
  • [50] FINE PEPTIDE SPECIFICITY OF CYTOTOXIC LYMPHOCYTES-T DIRECTED AGAINST ADENOVIRUS-INDUCED TUMORS AND PEPTIDE-MHC BINDING
    KAST, WM
    MELIEF, CJM
    INTERNATIONAL JOURNAL OF CANCER, 1991, : 90 - 94