Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion

被引:24
|
作者
Yuan, Qianmu [1 ]
Xie, Junjie [1 ]
Xie, Jiancong [1 ]
Zhao, Huiying [3 ,4 ]
Yang, Yuedong [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510000, Peoples R China
[2] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp MOE, Guangzhou 510000, Peoples R China
[3] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Basic & Translat Med Res Ctr, Guangzhou 510000, Peoples R China
[4] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
sequence-based; protein function prediction; pretrained language model; label diffusion; LARGE-SCALE; ONTOLOGY; NETWORK; TOOL;
D O I
10.1093/bib/bbad117
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor, which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5, 27.3 and 10.1% in area under the precision-recall curve on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction. The datasets, source codes and trained models of SPROF-GO are available at . The SPROF-GO web server is freely available at .
引用
收藏
页数:11
相关论文
共 34 条
  • [21] pLM-BLAST: distant homology detection based on direct comparison of sequence representations from protein language models
    Kaminski, Kamil
    Ludwiczak, Jan
    Pawlicki, Kamil
    Alva, Vikram
    Dunin-Horkawicz, Stanislaw
    BIOINFORMATICS, 2023, 39 (10)
  • [22] DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model
    Yihe Pang
    Bin Liu
    BMC Biology, 22
  • [23] DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model
    Pang, Yihe
    Liu, Bin
    BMC BIOLOGY, 2024, 22 (01)
  • [24] Accurate Prediction of Antifreeze Protein from Sequences through Natural Language Text Processing and Interpretable Machine Learning Approaches
    Dhibar, Saikat
    Jana, Biman
    JOURNAL OF PHYSICAL CHEMISTRY LETTERS, 2023, 14 (48): : 10727 - 10735
  • [25] DeepGOZero: improving protein function prediction from sequence and zero-shot learning based on ontology axioms
    Kulmanov, Maxat
    Hoehndorf, Robert
    BIOINFORMATICS, 2022, 38 (SUPPL 1) : 238 - 245
  • [26] Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier
    Li, Yang
    Hu, Xue-Gang
    You, Zhu-Hong
    Li, Li-Ping
    Li, Pei-Pei
    Wang, Yan-Bin
    Huang, Yu-An
    BMC BIOINFORMATICS, 2022, 23 (SUPPL 7)
  • [27] Robust and accurate prediction of self-interacting proteins from protein sequence information by exploiting weighted sparse representation based classifier
    Yang Li
    Xue-Gang Hu
    Zhu-Hong You
    Li-Ping Li
    Pei-Pei Li
    Yan-Bin Wang
    Yu-An Huang
    BMC Bioinformatics, 23
  • [28] SPOT-Contact-LM: improving single-sequence-based prediction of protein contact map using a transformer language model
    Singh, Jaspreet
    Litfin, Thomas
    Singh, Jaswinder
    Paliwal, Kuldip
    Zhou, Yaoqi
    BIOINFORMATICS, 2022, 38 (07) : 1888 - 1894
  • [29] Deep learning model for protein multi-label subcellular localization and function prediction based on multi-task collaborative training
    Bai, Peihao
    Li, Guanghui
    Luo, Jiawei
    Liang, Cheng
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (06)
  • [30] The Accurate Prediction of Antibody Deamidations by Combining High-Throughput Automated Peptide Mapping and Protein Language Model-Based Deep Learning
    Niu, Ben
    Lee, Benjamin
    Wang, Lili
    Chen, Wen
    Johnson, Jeffrey
    ANTIBODIES, 2024, 13 (03)