Using protein language models for protein interaction hot spot prediction with limited data

被引:3
|
作者
Sargsyan, Karen [1 ]
Lim, Carmay [1 ]
机构
[1] Acad Sinica, Inst Biomed Sci, Taipei 115, Taiwan
关键词
Protein language models; ESM-2; Protein-protein interaction; PPI-hotspot; Small datasets; Feature selection; BINDING; CONSURF;
D O I
10.1186/s12859-024-05737-2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
BackgroundProtein language models, inspired by the success of large language models in deciphering human language, have emerged as powerful tools for unraveling the intricate code of life inscribed within protein sequences. They have gained significant attention for their promising applications across various areas, including the sequence-based prediction of secondary and tertiary protein structure, the discovery of new functional protein sequences/folds, and the assessment of mutational impact on protein fitness. However, their utility in learning to predict protein residue properties based on scant datasets, such as protein-protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore the feasibility of using protein language-learned representations as features for machine learning to predict PPI-hotspots using a dataset containing 414 experimentally confirmed PPI-hotspots and 504 PPI-nonhot spots.ResultsOur findings showcase the capacity of unsupervised learning with protein language models in capturing critical functional attributes of protein residues derived from the evolutionary information encoded within amino acid sequences. We show that methods relying on protein language models can compete with methods employing sequence and structure-based features to predict PPI-hotspots from the free protein structure. We observed an optimal number of features for model precision, suggesting a balance between information and overfitting.ConclusionsThis study underscores the potential of transformer-based protein language models to extract critical knowledge from sparse datasets, exemplified here by the challenging realm of predicting PPI-hotspots. These models offer a cost-effective and time-efficient alternative to traditional experimental methods for predicting certain residue properties. However, the challenge of explaining why specific features are important for determining certain residue properties remains.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Single-sequence protein structure prediction by integrating protein language models
    Jing, Xiaoyang
    Wu, Fandi
    Luo, Xiao
    Xu, Jinbo
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (13)
  • [42] Assessing Coverage of Protein Interaction Data Using Capture–Recapture Models
    W. P. Kelly
    M. P. H. Stumpf
    Bulletin of Mathematical Biology, 2012, 74 : 356 - 374
  • [43] Application of data mining techniques to protein-protein interaction prediction
    Kocatas, A
    Gursoy, A
    Atalay, R
    COMPUTER AND INFORMATION SCIENCES - ISCIS 2003, 2003, 2869 : 316 - 323
  • [44] Using Topology Information for Protein-Protein Interaction Prediction
    Birlutiu, Adriana
    Heskes, Tom
    PATTERN RECOGNITION IN BIOINFORMATICS, PRIB 2014, 2014, 8626 : 10 - 22
  • [45] Prediction of protein–protein interaction using graph neural networks
    Kanchan Jha
    Sriparna Saha
    Hiteshi Singh
    Scientific Reports, 12
  • [46] Assessment of prediction accuracy of protein function from protein-protein interaction data
    Hishigaki, H
    Nakai, K
    Ono, T
    Tanigami, A
    Takagi, T
    YEAST, 2001, 18 (06) : 523 - 531
  • [47] Integrating experimental and literature protein-protein interaction data for protein complex prediction
    Yijia Zhang
    Hongfei Lin
    Zhihao Yang
    Jian Wang
    BMC Genomics, 16
  • [48] Integrating experimental and literature protein-protein interaction data for protein complex prediction
    Zhang, Yijia
    Lin, Hongfei
    Yang, Zhihao
    Wang, Jian
    BMC GENOMICS, 2015, 16
  • [49] Improved inter-protein contact prediction using dimensional hybrid residual networks and protein language models
    Si, Yunda
    Yan, Chengfei
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (02)
  • [50] Prediction of Protein-Protein Interaction Sites Using Convolutional Neural Network and Improved Data Sets
    Xie, Zengyan
    Deng, Xiaoya
    Shu, Kunxian
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2020, 21 (02)