Cross-protein transfer learning substantially improves disease variant prediction

被引:18
|
作者
Jagota, Milind [1 ]
Ye, Chengzhong [2 ]
Albors, Carlos [1 ]
Rastogi, Ruchir [1 ]
Koehl, Antoine [2 ]
Ioannidis, Nilah [1 ,3 ,4 ]
Song, Yun S. [1 ,2 ,4 ]
机构
[1] Univ Calif Berkeley, Comp Sci Div, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Stat, Berkeley, CA 94720 USA
[3] Chan Zuckerberg Biohub, San Francisco, CA 94158 USA
[4] Univ Calif Berkeley, Ctr Computat Biol, Berkeley, CA 94720 USA
关键词
DESCRIPTORS; SEQUENCE; PEPTIDES; DESIGN; IMPACT; SCALE; SET; MAP;
D O I
10.1186/s13059-023-03024-6
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Background: Genetic variation in the human genome is a major determinant of individual disease risk, but the vast majority of missense variants have unknown etiological effects. Here, we present a robust learning framework for leveraging saturation mutagenesis experiments to construct accurate computational predictors of proteome-wide missense variant pathogenicity. Results: We train cross-protein transfer (CPT) models using deep mutational scanning (DMS) data from only five proteins and achieve state-of-the-art performance on clinical variant interpretation for unseen proteins across the human proteome. We also improve predictive accuracy on DMS data from held-out proteins. High sensitivity is crucial for clinical applications and our model CPT-1 particularly excels in this regime. For instance, at 95% sensitivity of detecting human disease variants annotated in ClinVar, CPT-1 improves specificity to 68%, from 27% for ESM-1v and 55% for EVE. Furthermore, for genes not used to train REVEL, a supervised method widely used by clinicians, we show that CPT-1 compares favorably with REVEL. Our framework combines predictive features derived from general protein sequence models, vertebrate sequence alignments, and AlphaFold structures, and it is adaptable to the future inclusion of other sources of information. We find that vertebrate alignments, albeit rather shallow with only 100 genomes, provide a strong signal for variant pathogenicity prediction that is complementary to recent deep learning-based models trained on massive amounts of protein sequence data. We release predictions for all possible missense variants in 90% of human genes. Conclusions: Our results demonstrate the utility of mutational scanning data for learning properties of variants that transfer to unseen proteins.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Cross-protein transfer learning substantially improves disease variant prediction
    Milind Jagota
    Chengzhong Ye
    Carlos Albors
    Ruchir Rastogi
    Antoine Koehl
    Nilah Ioannidis
    Yun S. Song
    Genome Biology, 24
  • [2] Model Establishment of Cross-Disease Course Prediction Using Transfer Learning
    Ying, Josh Jia-Ching
    Chang, Yen-Ting
    Chen, Hsin-Hua
    Chao, Wen-Cheng
    APPLIED SCIENCES-BASEL, 2022, 12 (10):
  • [3] Joint learning improves protein abundance prediction in cancers
    Li, Hongyang
    Siddiqui, Omer
    Zhang, Hongjiu
    Guan, Yuanfang
    BMC BIOLOGY, 2019, 17 (01)
  • [4] Joint learning improves protein abundance prediction in cancers
    Hongyang Li
    Omer Siddiqui
    Hongjiu Zhang
    Yuanfang Guan
    BMC Biology, 17
  • [5] Protein transfer learning improves identification of heat shock protein families
    Min, Seonwoo
    Kim, HyunGi
    Lee, Byunghan
    Yoon, Sungroh
    PLOS ONE, 2021, 16 (05):
  • [6] Transfer learning for cross-context prediction of protein expression from 5'UTR sequence
    Gilliot, Pierre-Aurelien
    Gorochowski, Thomas E.
    NUCLEIC ACIDS RESEARCH, 2024, 52 (13)
  • [7] Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions
    Zhang, Xiaolei
    Walsh, Roddy
    Whiffin, Nicola
    Buchan, Rachel
    Midwinter, William
    Wilk, Alicja
    Govind, Risha
    Li, Nicholas
    Ahmad, Mian
    Mazzarotto, Francesco
    Roberts, Angharad
    Theotokis, Pantazis I.
    Mazaika, Erica
    Allouba, Mona
    de Marvao, Antonio
    Pua, Chee Jian
    Day, Sharlene M.
    Ashley, Euan
    Colan, Steven D.
    Michels, Michelle
    Pereira, Alexandre C.
    Jacoby, Daniel
    Ho, Carolyn Y.
    Olivotto, Iacopo
    Gunnarsson, Gunnar T.
    Jefferies, John L.
    Semsarian, Chris
    Ingles, Jodie
    O'Regan, Declan P.
    Aguib, Yasmine
    Yacoub, Magdi H.
    Cook, Stuart A.
    Barton, Paul J. R.
    Bottolo, Leonardo
    Ware, James S.
    GENETICS IN MEDICINE, 2021, 23 (01) : 69 - 79
  • [8] Learning spatial structures of proteins improves protein-protein interaction prediction
    Song, Bosheng
    Luo, Xiaoyan
    Luo, Xiaoli
    Liu, Yuansheng
    Niu, Zhangming
    Zeng, Xiangxiang
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (02)
  • [9] Transfer learning with false negative control improves polygenic risk prediction
    Jeng, Xinge Jessie
    Hu, Yifei
    Venkat, Vaishnavi
    Lu, Tzu-Pin
    Tzeng, Jung-Ying
    PLOS GENETICS, 2023, 19 (11):
  • [10] Biologically relevant transfer learning improves transcription factor binding prediction
    Gherman Novakovsky
    Manu Saraswat
    Oriol Fornes
    Sara Mostafavi
    Wyeth W. Wasserman
    Genome Biology, 22