Using protein language models for protein interaction hot spot prediction with limited data

被引:3
|
作者
Sargsyan, Karen [1 ]
Lim, Carmay [1 ]
机构
[1] Acad Sinica, Inst Biomed Sci, Taipei 115, Taiwan
关键词
Protein language models; ESM-2; Protein-protein interaction; PPI-hotspot; Small datasets; Feature selection; BINDING; CONSURF;
D O I
10.1186/s12859-024-05737-2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundProtein language models, inspired by the success of large language models in deciphering human language, have emerged as powerful tools for unraveling the intricate code of life inscribed within protein sequences. They have gained significant attention for their promising applications across various areas, including the sequence-based prediction of secondary and tertiary protein structure, the discovery of new functional protein sequences/folds, and the assessment of mutational impact on protein fitness. However, their utility in learning to predict protein residue properties based on scant datasets, such as protein-protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore the feasibility of using protein language-learned representations as features for machine learning to predict PPI-hotspots using a dataset containing 414 experimentally confirmed PPI-hotspots and 504 PPI-nonhot spots.ResultsOur findings showcase the capacity of unsupervised learning with protein language models in capturing critical functional attributes of protein residues derived from the evolutionary information encoded within amino acid sequences. We show that methods relying on protein language models can compete with methods employing sequence and structure-based features to predict PPI-hotspots from the free protein structure. We observed an optimal number of features for model precision, suggesting a balance between information and overfitting.ConclusionsThis study underscores the potential of transformer-based protein language models to extract critical knowledge from sparse datasets, exemplified here by the challenging realm of predicting PPI-hotspots. These models offer a cost-effective and time-efficient alternative to traditional experimental methods for predicting certain residue properties. However, the challenge of explaining why specific features are important for determining certain residue properties remains.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Improved the heterodimer protein complex prediction with protein language models
    Chen, Bo
    Xie, Ziwei
    Qiu, Jiezhong
    Ye, Zhaofeng
    Xu, Jinbo
    Tang, Jie
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (04)
  • [22] Single-sequence protein structure prediction using supervised transformer protein language models
    Wang, Wenkai
    Peng, Zhenling
    Yang, Jianyi
    NATURE COMPUTATIONAL SCIENCE, 2022, 2 (12): : 804 - +
  • [23] Single-sequence protein structure prediction using supervised transformer protein language models
    Wenkai Wang
    Zhenling Peng
    Jianyi Yang
    Nature Computational Science, 2022, 2 : 804 - 814
  • [24] SpatialPPIv2: Enhancing protein-protein interaction prediction through graph neural networks with protein language models
    Hu, Wenxing
    Ohue, Masahito
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 27 : 508 - 518
  • [25] ParaAntiProt provides paratope prediction using antibody and protein language models
    Kalemati, Mahmood
    Noroozi, Alireza
    Shahbakhsh, Aref
    Koohi, Somayyeh
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [26] Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties
    Deng, Lei
    Guan, Jihong
    Wei, Xiaoming
    Yi, Yuan
    Zhang, Qiangfeng Cliff
    Zhou, Shuigeng
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2013, 20 (11) : 878 - 891
  • [27] Prediction of protein complexes based on protein interaction data and functional annotation data using kernel methods
    Zhang, Shi-Hua
    Ning, Xue-Mei
    Liu, Hong-Wei
    Zhang, Xiang-Sun
    COMPUTATIONAL INTELLIGENCE AND BIOINFORMATICS, PT 3, PROCEEDINGS, 2006, 4115 : 514 - 524
  • [28] A novel method for protein-protein interaction site prediction using phylogenetic substitution models
    La, David
    Kihara, Daisuke
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2012, 80 (01) : 126 - 141
  • [29] A binding free energy hot spot in the ankyrin repeat protein GABPβ mediated protein-protein interaction
    Desrosiers, DC
    Peng, ZY
    JOURNAL OF MOLECULAR BIOLOGY, 2005, 354 (02) : 375 - 384
  • [30] Protein–protein contact prediction by geometric triangle-aware protein language models
    Lin P.
    Tao H.
    Li H.
    Huang S.-Y.
    Nature Machine Intelligence, 2023, 5 (11) : 1275 - 1284