Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features

被引:15
|
作者
Pei, Hongdi [1 ]
Li, Jiayu [2 ]
Ma, Shuhan [1 ]
Jiang, Jici [1 ]
Li, Mingxin [1 ]
Zou, Quan [3 ,4 ]
Lv, Zhibin [1 ]
机构
[1] Sichuan Univ, Coll Biomed Engn, Chengdu 610065, Peoples R China
[2] Sichuan Univ, Coll Life Sci, Chengdu 610065, Peoples R China
[3] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China
[4] Univ Elect Sci & Technol China, Yangtze Delta Reg Inst Quzhou, Quzhou 324000, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 05期
基金
中国国家自然科学基金;
关键词
thermophilic proteins; BERT; machine learning; imbalanced dataset; deep learning; AMINO-ACID-COMPOSITION; FEATURE-SELECTION; PREDICTION; CLASSIFICATION; LANGUAGE;
D O I
10.3390/app13052858
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Thermophilic proteins have great potential to be utilized as biocatalysts in biotechnology. Machine learning algorithms are gaining increasing use in identifying such enzymes, reducing or even eliminating the need for experimental studies. While most previously used machine learning methods were based on manually designed features, we developed BertThermo, a model using Bidirectional Encoder Representations from Transformers (BERT), as an automatic feature extraction tool. This method combines a variety of machine learning algorithms and feature engineering methods, while relying on single-feature encoding based on the protein sequence alone for model input. BertThermo achieved an accuracy of 96.97% and 97.51% in 5-fold cross-validation and in independent testing, respectively, identifying thermophilic proteins more reliably than any previously described predictive algorithm. Additionally, BertThermo was tested by a balanced dataset, an imbalanced dataset and a dataset with homology sequences, and the results show that BertThermo was with the best robustness as comparied with state-of-the-art methods. The source code of BertThermo is available.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] VotePLMs-AFP: Identification of antifreeze proteins using transformer-embedding features and ensemble learning
    Qi, Dawei
    Liu, Taigang
    BIOCHIMICA ET BIOPHYSICA ACTA-GENERAL SUBJECTS, 2024, 1868 (12):
  • [2] ThermoFinder: A sequence-based thermophilic proteins prediction framework
    Yu, Han
    Luo, Xiaozhou
    INTERNATIONAL JOURNAL OF BIOLOGICAL MACROMOLECULES, 2024, 270
  • [3] Identification of efflux proteins based on contextual representations with deep bidirectional transformer encoders
    Taju, Semmy Wellem
    Shah, Syed Muazzam Ali
    Ou, Yu-Yen
    ANALYTICAL BIOCHEMISTRY, 2021, 633
  • [4] Identification of Helicobacter pylori Membrane Proteins Using Sequence-Based Features
    Liu, Mujiexin
    Chen, Hui
    Gao, Dong
    Ma, Cai-Yi
    Zhang, Zhao-Yue
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2022, 2022
  • [5] Sequence-based Identification of Arginine Amidation Sites in Proteins Using Deep Representations of Proteins and PseAAC
    Naseer, Sheraz
    Hussain, Waqar
    Khan, Yaser Daanial
    Rasool, Nouman
    CURRENT BIOINFORMATICS, 2020, 15 (08) : 937 - 948
  • [6] iDPGK: characterization and identification of lysine phosphoglycerylation sites based on sequence-based features
    Kai-Yao Huang
    Fang-Yu Hung
    Hui-Ju Kao
    Hui-Hsuan Lau
    Shun-Long Weng
    BMC Bioinformatics, 21
  • [7] Sequence-Based Recommendation with Bidirectional LSTM Network
    Fu, Hailin
    Li, Jianguo
    Chen, Jiemin
    Tang, Yong
    Zhu, Jia
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT III, 2018, 11166 : 428 - 438
  • [8] iDPGK: characterization and identification of lysine phosphoglycerylation sites based on sequence-based features
    Huang, Kai-Yao
    Hung, Fang-Yu
    Kao, Hui-Ju
    Lau, Hui-Hsuan
    Weng, Shun-Long
    BMC BIOINFORMATICS, 2020, 21 (01)
  • [9] Fast Sequence-Based Embedding with Diffusion Graphs
    Rozemberczki, Benedek
    Sarkar, Rik
    COMPLEX NETWORKS IX, 2018, : 99 - 107
  • [10] Spatially augmented guided sequence-based bidirectional encoder representation from transformer networks for hyperspectral classification studies
    Zhang, Yuanyuan
    Bao, Wenxing
    Liang, Hongbo
    Sun, Yanbo
    OPTICAL ENGINEERING, 2023, 62 (10)