Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion

被引:24
|
作者
Yuan, Qianmu [1 ]
Xie, Junjie [1 ]
Xie, Jiancong [1 ]
Zhao, Huiying [3 ,4 ]
Yang, Yuedong [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510000, Peoples R China
[2] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp MOE, Guangzhou 510000, Peoples R China
[3] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Basic & Translat Med Res Ctr, Guangzhou 510000, Peoples R China
[4] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
sequence-based; protein function prediction; pretrained language model; label diffusion; LARGE-SCALE; ONTOLOGY; NETWORK; TOOL;
D O I
10.1093/bib/bbad117
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor, which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5, 27.3 and 10.1% in area under the precision-recall curve on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction. The datasets, source codes and trained models of SPROF-GO are available at . The SPROF-GO web server is freely available at .
引用
收藏
页数:11
相关论文
共 34 条
  • [11] HiFun: homology independent protein function prediction by a novel protein-language self-attention model
    Wu, Jun
    Qing, Haipeng
    Ouyang, Jian
    Zhou, Jiajia
    Gao, Zihao
    Mason, Christopher E.
    Liu, Zhichao
    Shi, Tieliu
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (05)
  • [12] From bubonic plague to foot-and-mouth disease: Employing a holistic approach to homology-based protein structure prediction
    Barsky, D
    Zemla, A
    BIOPHYSICAL JOURNAL, 2002, 82 (01) : 305A - 305A
  • [13] A HOMOLOGY-BASED MOLECULAR-MODEL OF THE PROLINE-RICH HOMEODOMAIN PROTEIN PRH, FROM HEMATOPOIETIC-CELLS
    NEIDLE, S
    GOODWIN, GH
    FEBS LETTERS, 1994, 345 (2-3) : 93 - 98
  • [14] Author Correction: PLMSearch: Protein language model powers accurate and fast sequence search for remote homology (Mar,10.1038/s41467-024-46808-5,2024)
    Liu, Wei
    Wang, Ziye
    You, Ronghui
    Xie, Chenghan
    Wei, Hong
    Xiong, Yi
    Yang, Jianyi
    Zhu, Shanfeng
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [15] Prediction of protein secondary structure from PDB structure information based on Sequence segments homology searching
    Tatsumoto, S
    Satou, K
    Konagaya, A
    METMBS '04: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON MATHEMATICS AND ENGINEERING TECHNIQUES IN MEDICINE AND BIOLOGICAL SCIENCES, 2004, : 250 - 255
  • [16] ACP-CLB: An Anticancer Peptide Prediction Model Based on Multichannel Discriminative Processing and Integration of Large Pretrained Protein Language Models
    Geng, Aoyun
    Luo, Zhenjie
    Li, Aohan
    Zhang, Zilong
    Zou, Quan
    Wei, Leyi
    Cui, Feifei
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2025, 65 (05) : 2336 - 2349
  • [17] A homology-based model of the human 5-HT2A receptor derived from an in silico activated G-protein coupled receptor
    James J. Chambers
    David E. Nichols
    Journal of Computer-Aided Molecular Design, 2002, 16 : 511 - 520
  • [18] A homology-based model of the human 5-HT2A receptor derived from an in silico activated G-protein coupled receptor
    Chambers, JJ
    Nichols, DE
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2002, 16 (07) : 511 - 520
  • [19] Prediction of Self-Interacting Proteins from Protein Sequence Information Based on Random Projection Model and Fast Fourier Transform
    Chen, Zhan-Heng
    You, Zhu-Hong
    Li, Li-Ping
    Wang, Yan-Bin
    Wong, Leon
    Yi, Hai-Cheng
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2019, 20 (04)
  • [20] ProFPred: a two-step protein function prediction model based on sequence and evolutionary information
    Ge, Ruiquan
    feng, Guanwen
    Wang, Pu
    Miao, Qiguang
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1372 - 1376