Cross-protein transfer learning substantially improves disease variant prediction

被引:18
|
作者
Jagota, Milind [1 ]
Ye, Chengzhong [2 ]
Albors, Carlos [1 ]
Rastogi, Ruchir [1 ]
Koehl, Antoine [2 ]
Ioannidis, Nilah [1 ,3 ,4 ]
Song, Yun S. [1 ,2 ,4 ]
机构
[1] Univ Calif Berkeley, Comp Sci Div, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[3] Chan Zuckerberg Biohub, San Francisco, CA 94158 USA
[4] Univ Calif Berkeley, Ctr Computat Biol, Berkeley, CA 94720 USA
关键词
DESCRIPTORS; SEQUENCE; PEPTIDES; DESIGN; IMPACT; SCALE; SET; MAP;
D O I
10.1186/s13059-023-03024-6
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity. Results: We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes. Conclusions: Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Recent Advances in Machine Learning Variant Effect Prediction Tools for Protein Engineering
    Horne, Jesse
    Shukla, Diwakar
    INDUSTRIAL & ENGINEERING CHEMISTRY RESEARCH, 2022, 61 (19) : 6235 - 6245
  • [22] Cross-project bug type prediction based on transfer learning
    Xiaoting Du
    Zenghui Zhou
    Beibei Yin
    Guanping Xiao
    Software Quality Journal, 2020, 28 : 39 - 57
  • [23] Cross-Band Spectrum Prediction Based on Deep Transfer Learning
    Fandi Lin
    Jin Chen
    Jiachen Sun
    Guoru Ding
    Ling Yu
    中国通信, 2020, 17 (02) : 66 - 80
  • [24] Improving transfer learning for software cross-project defect prediction
    Omondiagbe, Osayande P.
    Licorish, Sherlock A.
    Macdonell, Stephen G.
    APPLIED INTELLIGENCE, 2024, 54 (07) : 5593 - 5616
  • [25] Research on Cross - project Software Defect Prediction Based on Transfer Learning
    Chen, Ya
    Ding, Xiaoming
    ADVANCES IN MATERIALS, MACHINERY, ELECTRONICS II, 2018, 1955
  • [26] Cross-project bug type prediction based on transfer learning
    Du, Xiaoting
    Zhou, Zenghui
    Yin, Beibei
    Xiao, Guanping
    SOFTWARE QUALITY JOURNAL, 2020, 28 (01) : 39 - 57
  • [27] Cross-Band Spectrum Prediction Based on Deep Transfer Learning
    Lin, Fandi
    Chen, Jin
    Sun, Jiachen
    Ding, Guoru
    Vu, Ling
    CHINA COMMUNICATIONS, 2020, 17 (02) : 66 - 80
  • [28] Deep Learning and Transfer Learning in Cardiology: A Review of Cardiovascular Disease Prediction Models
    Sunilkumar, G.
    Kumaresan, P.
    IEEE ACCESS, 2024, 12 : 193365 - 193386
  • [29] A Disease Prediction Model Based on Dynamic Sampling and Transfer Learning
    Hu M.-M.
    Chen X.
    Sun Y.-Z.
    Shen X.
    Wang X.-Q.
    Yu T.-Y.
    Mei Y.-D.
    Xiao L.
    Cheng W.
    Yang J.
    Yang Y.
    Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (10): : 2339 - 2354
  • [30] A multitask transfer learning framework for the prediction of virus-human protein–protein interactions
    Thi Ngan Dong
    Graham Brogden
    Gisa Gerold
    Megha Khosla
    BMC Bioinformatics, 22