Protein language models can capture protein quaternary state

被引:5
|
作者
Avraham O. [1 ]
Tsaban T. [1 ]
Ben-Aharon Z. [1 ]
Tsaban L. [2 ,3 ]
Schueler-Furman O. [1 ]
机构
[1] Department of Microbiology and Molecular Genetics, Faculty of Medicine, Institute for Biomedical Research Israel-Canada, The Hebrew University of Jerusalem, Jerusalem
[2] Gaffin Center for Neuro-Oncology, Sharett Institute for Oncology, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem
[3] The Wohl Institute for Translational Medicine, Hadassah Medical Center and Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem
基金
以色列科学基金会;
关键词
Deep learning; Multilayer perceptron; Natural language processing; Oligomeric state prediction; Protein language models; Protein quaternary state;
D O I
10.1186/s12859-023-05549-w
中图分类号
学科分类号
摘要
Background: Determining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction. Results: We generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings. Conclusions: QUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.research.google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb . © 2023, The Author(s).
引用
收藏
相关论文
共 50 条
  • [41] Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review
    Chen, Jia-Ying
    Wang, Jing-Fu
    Hu, Yue
    Li, Xin-Hui
    Qian, Yu-Rong
    Song, Chao-Lin
    FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, 2025, 13
  • [42] PEDL: extracting protein-protein associations using deep language models and distant supervision
    Weber, Leon
    Thobe, Kirsten
    Lozano, Oscar Arturo Migueles
    Wolf, Jana
    Leser, Ulf
    BIOINFORMATICS, 2020, 36 : 490 - 498
  • [43] Single-sequence protein structure prediction using supervised transformer protein language models
    Wang, Wenkai
    Peng, Zhenling
    Yang, Jianyi
    NATURE COMPUTATIONAL SCIENCE, 2022, 2 (12): : 804 - +
  • [44] Assessing Coverage of Protein Interaction Data Using Capture–Recapture Models
    W. P. Kelly
    M. P. H. Stumpf
    Bulletin of Mathematical Biology, 2012, 74 : 356 - 374
  • [45] Accurate and Fast Prediction of Intrinsically Disordered Protein by Multiple Protein Language Models and Ensemble Learning
    Xu, Shijie
    Onoda, Akira
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2023, 64 (07) : 2901 - 2911
  • [46] Single-sequence protein structure prediction using supervised transformer protein language models
    Wenkai Wang
    Zhenling Peng
    Jianyi Yang
    Nature Computational Science, 2022, 2 : 804 - 814
  • [47] Accurate and Fast Prediction of Intrinsically Disordered Protein by Multiple Protein Language Models and Ensemble Learning
    Xu, Shijie
    Onoda, Akira
    Journal of Chemical Information and Modeling, 2024, 64 (07) : 2901 - 2911
  • [48] Major advances in protein function assignment by remote homolog detection with protein language models - A review
    Kilinc, Mesih
    Jia, Kejue
    Jernigan, Robert L.
    CURRENT OPINION IN STRUCTURAL BIOLOGY, 2025, 90
  • [49] GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models
    Li, Xiang
    Wei, Zhuoyu
    Hu, Yueran
    Zhu, Xiaolei
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 280
  • [50] Are genomic language models all you need? Exploring genomic language models on protein downstream tasks
    Boshar, Sam
    Trop, Evan
    de Almeida, Bernardo P.
    Copoiu, Liviu
    Pierrot, Thomas
    BIOINFORMATICS, 2024, 40 (09)