Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion

被引:24
|
作者
Yuan, Qianmu [1 ]
Xie, Junjie [1 ]
Xie, Jiancong [1 ]
Zhao, Huiying [3 ,4 ]
Yang, Yuedong [1 ,2 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510000, Peoples R China
[2] Sun Yat Sen Univ, Key Lab Machine Intelligence & Adv Comp MOE, Guangzhou 510000, Peoples R China
[3] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Basic & Translat Med Res Ctr, Guangzhou 510000, Peoples R China
[4] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
sequence-based; protein function prediction; pretrained language model; label diffusion; LARGE-SCALE; ONTOLOGY; NETWORK; TOOL;
D O I
10.1093/bib/bbad117
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor, which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5, 27.3 and 10.1% in area under the precision-recall curve on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction. The datasets, source codes and trained models of SPROF-GO are available at . The SPROF-GO web server is freely available at .
引用
收藏
页数:11
相关论文
共 34 条
  • [1] Fast and accurate protein intrinsic disorder prediction by using a pretrained language model
    Song, Yidong
    Yuan, Qianmu
    Chen, Sheng
    Chen, Ken
    Zhou, Yaoqi
    Yang, Yuedong
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (04)
  • [2] Homology-based inference sets the bar high for protein function prediction
    Tobias Hamp
    Rebecca Kassner
    Stefan Seemayer
    Esmeralda Vicedo
    Christian Schaefer
    Dominik Achten
    Florian Auer
    Ariane Boehm
    Tatjana Braun
    Maximilian Hecht
    Mark Heron
    Peter Hönigschmid
    Thomas A Hopf
    Stefanie Kaufmann
    Michael Kiening
    Denis Krompass
    Cedric Landerer
    Yannick Mahlich
    Manfred Roos
    Burkhard Rost
    BMC Bioinformatics, 14
  • [3] A large-scale assessment of sequence database search tools for homology-based protein function prediction
    Zhang, Chengxin
    Freddolino, Lydia
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (04)
  • [4] Homology-based inference sets the bar high for protein function prediction
    Hamp, Tobias
    Kassner, Rebecca
    Seemayer, Stefan
    Vicedo, Esmeralda
    Schaefer, Christian
    Achten, Dominik
    Auer, Florian
    Boehm, Ariane
    Braun, Tatjana
    Hecht, Maximilian
    Heron, Mark
    Hoenigschmid, Peter
    Hopf, Thomas A.
    Kaufmann, Stefanie
    Kiening, Michael
    Krompass, Denis
    Landerer, Cedric
    Mahlich, Yannick
    Roos, Manfred
    Rost, Burkhard
    BMC BIOINFORMATICS, 2013, 14
  • [5] DockRank: Ranking docked conformations using partner-specific sequence homology-based protein interface prediction
    Xue, Li C.
    Jordan, Rafael A.
    Yasser, EL-Manzalawy
    Dobbs, Drena
    Honavar, Vasant
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2014, 82 (02) : 250 - 267
  • [6] Comprehensive prediction and analysis of human protein essentiality based on a pretrained large language model
    Kang, Boming
    Fan, Rui
    Cui, Chunmei
    Cui, Qinghua
    NATURE COMPUTATIONAL SCIENCE, 2024, : 196 - 206
  • [7] Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning
    Yuan, Qianmu
    Chen, Sheng
    Wang, Yu
    Zhao, Huiying
    Yang, Yuedong
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (06)
  • [8] LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model
    Pakhrin, Subash C.
    Pokharel, Suresh
    Pratyush, Pawel
    Chaudhari, Meenal
    Ismail, Hamid D.
    Dukka, B. K. C. B.
    JOURNAL OF PROTEOME RESEARCH, 2023, 22 (08) : 2548 - 2557
  • [9] THPLM: a sequence-based deep learning framework for protein stability changes prediction upon point variations using pretrained protein language model
    Gong, Jianting
    Jiang, Lili
    Chen, Yongbing
    Zhang, Yixiang
    Li, Xue
    Ma, Zhiqiang
    Fu, Zhiguo
    He, Fei
    Sun, Pingping
    Ren, Zilin
    Tian, Mingyao
    BIOINFORMATICS, 2023, 39 (11)
  • [10] Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction
    Weissenow, Konstantin
    Heinzinger, Michael
    Rost, Burkhard
    STRUCTURE, 2022, 30 (08) : 1169 - +