Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning

被引:14
|
作者
Yuan, Qianmu [1 ]
Chen, Sheng [1 ]
Wang, Yu [3 ]
Zhao, Huiying [2 ]
Yang, Yuedong [1 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou 510000, Peoples R China
[2] Sun Yat Sen Univ, Sun Yat Sen Mem Hosp, Guangzhou 510000, Peoples R China
[3] Peng Cheng Nat Lab Shenzhen, Shenzhen, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
metal ion-binding site; alignment-free; pretrained language model; multi-task learning; RECOGNITION; GENERATION; DATABASES;
D O I
10.1093/bib/bbac444
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
More than one-third of the proteins contain metal ions in the Protein Data Bank. Correct identification of metal ion-binding residues is important for understanding protein functions and designing novel drugs. Due to the small size and high versatility of metal ions, it remains challenging to computationally predict their binding sites from protein sequence. Existing sequence-based methods are of low accuracy due to the lack of structural information, and time-consuming owing to the usage of multi-sequence alignment. Here, we propose LMetalSite, an alignment-free sequence-based predictor for binding sites of the four most frequently seen metal ions in BioLiP (Zn2+, Ca2+, Mg2+ and Mn2+). LMetalSite leverages the pretrained language model to rapidly generate informative sequence representations and employs transformer to capture long-range dependencies. Multi-task learning is adopted to compensate for the scarcity of training data and capture the intrinsic similarities between different metal ions. LMetalSite was shown to surpass state-of-the-art structure-based methods by more than 19.7, 14.4, 36.8 and 12.6% in area under the precision recall on the four independent tests, respectively. Further analyses indicated that the self-attention modules are effective to learn the structural contexts of residues from protein sequence. We provide the data sets, source codes and trained models of LMetalSite at https://github.com/biomed-AI/LMetalSite.
引用
收藏
页数:10
相关论文
共 2 条
  • [1] LMPhosSite: A Deep Learning-Based Approach for General Protein Phosphorylation Site Prediction Using Embeddings from the Local Window Sequence and Pretrained Protein Language Model
    Pakhrin, Subash C.
    Pokharel, Suresh
    Pratyush, Pawel
    Chaudhari, Meenal
    Ismail, Hamid D.
    Dukka, B. K. C. B.
    JOURNAL OF PROTEOME RESEARCH, 2023, 22 (08) : 2548 - 2557
  • [2] Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion
    Yuan, Qianmu
    Xie, Junjie
    Xie, Jiancong
    Zhao, Huiying
    Yang, Yuedong
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (03)