Structure-informed protein language models are robust predictors for variant effects

被引:2
|
作者
Sun, Yuanfei [1 ]
Shen, Yang [1 ,2 ,3 ]
机构
[1] Texas A&M Univ, Dept Elect & Comp Engn, College Stn, TX 77843 USA
[2] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX 77843 USA
[3] Texas A&M Univ, Inst Biosci & Technol, Dept Translat Med Sci, Houston, TX 77030 USA
关键词
MUTATION;
D O I
10.1007/s00439-024-02695-w
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.
引用
收藏
页码:209 / 225
页数:17
相关论文
共 49 条
  • [1] Variant effect prediction using structure-informed protein language models
    Sun, Yuanfei
    Shen, Yang
    BIOPHYSICAL JOURNAL, 2023, 122 (03) : 473A - 473A
  • [2] Unsupervised evolution of protein and antibody complexes with a structure-informed language model
    Shanker, Varun R.
    Bruun, Theodora U. J.
    Hie, Brian L.
    Kim, Peter S.
    SCIENCE, 2024, 385 (6704) : 46 - 53
  • [3] PrePPI: a structure-informed database of protein-protein interactions
    Zhang, Qiangfeng Cliff
    Petrey, Donald
    Garzon, Jose Ignacio
    Deng, Lei
    Honig, Barry
    NUCLEIC ACIDS RESEARCH, 2013, 41 (D1) : D828 - D833
  • [4] DNA language models are powerful predictors of genome-wide variant effects
    Benegas, Gonzalo
    Batra, Sanjit Singh
    Song, Yun S.
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (44)
  • [5] Structure-informed microbial population genetics elucidate selective pressures that shape protein evolution
    Kiefl, Evan
    Esen, Ozcan C.
    Miller, Samuel E.
    Kroll, Kourtney L.
    Willis, Amy D.
    Rappe, Michael S.
    Pan, Tao
    Eren, A. Murat
    SCIENCE ADVANCES, 2023, 9 (08)
  • [6] Aggrescan4D: structure-informed analysis of pH-dependent protein aggregation
    Barcenas, Oriol
    Kuriata, Aleksander
    Zalewski, Mateusz
    Iglesias, Valentin
    Pintado-Grima, Carlos
    Firlik, Grzegorz
    Burdukiewicz, Michal
    Kmiecik, Sebastian
    Ventura, Salvador
    NUCLEIC ACIDS RESEARCH, 2024, 52 (W1) : W170 - W175
  • [7] Embeddings from protein language models predict conservation and variant effects
    Céline Marquet
    Michael Heinzinger
    Tobias Olenyi
    Christian Dallago
    Kyra Erckert
    Michael Bernhofer
    Dmitrii Nechaev
    Burkhard Rost
    Human Genetics, 2022, 141 : 1629 - 1647
  • [8] Embeddings from protein language models predict conservation and variant effects
    Marquet, Celine
    Heinzinger, Michael
    Olenyi, Tobias
    Dallago, Christian
    Erckert, Kyra
    Bernhofer, Michael
    Nechaev, Dmitrii
    Rost, Burkhard
    HUMAN GENETICS, 2022, 141 (10) : 1629 - 1647
  • [9] Infrequent protein-ligand interactions and protein flexibility on the driver's seat of structure-informed drug discovery
    Duca, Jose S.
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2012, 243
  • [10] Robust Multiarea Distribution System State Estimation Based on Structure-Informed Graphic Network and Multitask Gaussian Process
    Hu, Jiaxiang
    Hu, Weihao
    Cao, Di
    Li, Sichen
    Chen, Jianjun
    Huang, Yuehui
    Chen, Zhe
    Blaabjerg, Frede
    IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2024, 20 (08) : 10599 - 10612