Cross-protein transfer learning substantially improves disease variant prediction

被引:18
|
作者
Jagota, Milind [1 ]
Ye, Chengzhong [2 ]
Albors, Carlos [1 ]
Rastogi, Ruchir [1 ]
Koehl, Antoine [2 ]
Ioannidis, Nilah [1 ,3 ,4 ]
Song, Yun S. [1 ,2 ,4 ]
机构
[1] Univ Calif Berkeley, Comp Sci Div, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[3] Chan Zuckerberg Biohub, San Francisco, CA 94158 USA
[4] Univ Calif Berkeley, Ctr Computat Biol, Berkeley, CA 94720 USA
关键词
DESCRIPTORS; SEQUENCE; PEPTIDES; DESIGN; IMPACT; SCALE; SET; MAP;
D O I
10.1186/s13059-023-03024-6
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity. Results: We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes. Conclusions: Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Leveraging AHP and transfer learning in machine learning for improved prediction of infectious disease outbreaks
    Abdallah, Reham
    Abdelgaber, Sayed
    Sayed, Hanan Ali
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [42] Deep Transfer Learning Based Risk Prediction Model for Infectious Disease
    Jiang, Youshen
    Cai, Zhiping
    Cai, Kaiyu
    Xia, Jing
    Yan, Lizhen
    THEORETICAL COMPUTER SCIENCE, NCTCS 2022, 2022, 1693 : 183 - 193
  • [43] Genome-wide prediction of disease variant effects with a deep protein language model
    Brandes, Nadav
    Goldman, Grant
    Wang, Charlotte H. H.
    Ye, Chun Jimmie
    Ntranos, Vasilis
    NATURE GENETICS, 2023, 55 (09) : 1512 - +
  • [44] Genome-wide prediction of disease variant effects with a deep protein language model
    Nadav Brandes
    Grant Goldman
    Charlotte H. Wang
    Chun Jimmie Ye
    Vasilis Ntranos
    Nature Genetics, 2023, 55 : 1512 - 1522
  • [45] Transfer learning of condition-specific perturbation in gene interactions improves drug response prediction
    Bang, Dongmin
    Koo, Bonil
    Kim, Sun
    BIOINFORMATICS, 2024, 40 : i130 - i139
  • [46] A CNN-based approach with efficient transfer learning improves microRNA-mRNA prediction
    Peng, Chen-Hao
    Chen, Hui-Yu
    Cheng, Da-Chuan
    Chuang, Eric Y.
    Lee, Chien-Yueh
    CANCER RESEARCH, 2024, 84 (06)
  • [47] A multitask transfer learning framework for the prediction of virus-human protein-protein interactions
    Thi Ngan Dong
    Brogden, Graham
    Gerold, Gisa
    Khosla, Megha
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [48] Cardiovascular disease diagnosis using cross-domain transfer learning
    Tadesse, Girmaw Abebe
    Zhu, Tingting
    Liu, Yong
    Zhou, Yingling
    Chen, Jiyan
    Tian, Maoyi
    Clifton, David
    2019 41ST ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2019, : 4262 - 4265
  • [49] Machine-learning approach for disease prediction improves Genome wide association studies
    Eick, Lisa
    Cordioli, Mattia
    Yang, Zhiyu
    Jukarainen, Sakari
    Ganna, Andrea
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 294 - 294
  • [50] Enhancing patient representation learning with inferred family pedigrees improves disease risk prediction
    Huang, Xiayuan
    Arora, Jatin
    Erzurumluoglu, Abdullah Mesut
    Stanhope, Stephen A.
    Lam, Daniel
    Arora, Jatin
    Erzurumluoglu, Abdullah Mesut
    Lam, Daniel
    Khoueiry, Pierre
    Jensen, Jan N.
    Cai, James
    Lawless, Nathan
    Kriegl, Jan
    Ding, Zhihao
    de Jong, Johann
    Zhao, Hongyu
    Ding, Zhihao
    Wang, Zuoheng
    de Jong, Johann
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 32 (03) : 435 - 446