Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features

被引:15
|
作者
Pei, Hongdi [1 ]
Li, Jiayu [2 ]
Ma, Shuhan [1 ]
Jiang, Jici [1 ]
Li, Mingxin [1 ]
Zou, Quan [3 ,4 ]
Lv, Zhibin [1 ]
机构
[1] Sichuan Univ, Coll Biomed Engn, Chengdu 610065, Peoples R China
[2] Sichuan Univ, Coll Life Sci, Chengdu 610065, Peoples R China
[3] Univ Elect Sci & Technol China, Inst Fundamental & Frontier Sci, Chengdu 610054, Peoples R China
[4] Univ Elect Sci & Technol China, Yangtze Delta Reg Inst Quzhou, Quzhou 324000, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 05期
基金
中国国家自然科学基金;
关键词
thermophilic proteins; BERT; machine learning; imbalanced dataset; deep learning; AMINO-ACID-COMPOSITION; FEATURE-SELECTION; PREDICTION; CLASSIFICATION; LANGUAGE;
D O I
10.3390/app13052858
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Thermophilic proteins have great potential to be utilized as biocatalysts in biotechnology. Machine learning algorithms are gaining increasing use in identifying such enzymes, reducing or even eliminating the need for experimental studies. While most previously used machine learning methods were based on manually designed features, we developed BertThermo, a model using Bidirectional Encoder Representations from Transformers (BERT), as an automatic feature extraction tool. This method combines a variety of machine learning algorithms and feature engineering methods, while relying on single-feature encoding based on the protein sequence alone for model input. BertThermo achieved an accuracy of 96.97% and 97.51% in 5-fold cross-validation and in independent testing, respectively, identifying thermophilic proteins more reliably than any previously described predictive algorithm. Additionally, BertThermo was tested by a balanced dataset, an imbalanced dataset and a dataset with homology sequences, and the results show that BertThermo was with the best robustness as comparied with state-of-the-art methods. The source code of BertThermo is available.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Reassessment of sequence-based targets for identification of Bacillus species
    Blackwood, KS
    Turenne, CY
    Harmsen, D
    Kabani, AM
    JOURNAL OF CLINICAL MICROBIOLOGY, 2004, 42 (04) : 1626 - 1630
  • [42] Towards a unified paradigm for sequence-based identification of fungi
    Koljalg, Urmas
    Nilsson, R. Henrik
    Abarenkov, Kessy
    Tedersoo, Leho
    Taylor, Andy F. S.
    Bahram, Mohammad
    Bates, Scott T.
    Bruns, Thomas D.
    Bengtsson-Palme, Johan
    Callaghan, Tony M.
    Douglas, Brian
    Drenkhan, Tiia
    Eberhardt, Ursula
    Duenas, Margarita
    Grebenc, Tine
    Griffith, Gareth W.
    Hartmann, Martin
    Kirk, Paul M.
    Kohout, Petr
    Larsson, Ellen
    Lindahl, Bjoern D.
    Luecking, Robert
    Martin, Maria P.
    Matheny, P. Brandon
    Nguyen, Nhu H.
    Niskanen, Tuula
    Oja, Jane
    Peay, Kabir G.
    Peintner, Ursula
    Peterson, Marko
    Poldmaa, Kadri
    Saag, Lauri
    Saar, Irja
    Schüessler, Arthur
    Scott, James A.
    Senes, Carolina
    Smith, Matthew E.
    Suija, Ave
    Taylor, D. Lee
    Telleria, M. Teresa
    Weiss, Michael
    Larsson, Karl-Henrik
    MOLECULAR ECOLOGY, 2013, 22 (21) : 5271 - 5277
  • [43] Sequence-based identification of species belonging to the genus Debaryomyces
    Martorell, P
    Fernández-Espinar, MT
    Querol, A
    FEMS YEAST RESEARCH, 2005, 5 (12) : 1157 - 1165
  • [44] Serological versus sequence-based methods for Legionella identification
    Baladron, B.
    Gil, V.
    Pelaz, C.
    LEGIONELLA: STATE OF THE ART 30 YEARS AFTER ITS RECOGNITION, 2006, : 58 - +
  • [45] Sequence-based approaches to alkaloid biosynthesis gene identification
    Kutchan, TM
    PHYTOCHEMISTRY IN THE GENOMICS AND POST-GENOMICS ERAS, 2002, 36 : 163 - 178
  • [46] rpoB gene sequence-based identification of Staphylococcus species
    Drancourt, M
    Raoult, D
    JOURNAL OF CLINICAL MICROBIOLOGY, 2002, 40 (04) : 1333 - 1338
  • [47] A DNA sequence-based identification checklist for Taiwanese chondrichthyans
    Straube, Nicolas
    White, William T.
    Ho, Hsuan-Ching
    Rochel, Elisabeth
    Corrigan, Shannon
    Li, Chenhong
    Naylor, Gavin J. P.
    ZOOTAXA, 2013, 3752 (01) : 256 - +
  • [48] Sequence-based identification of Anopheles species in eastern Ethiopia
    Tamar E. Carter
    Solomon Yared
    Shantoy Hansel
    Karen Lopez
    Daniel Janies
    Malaria Journal, 18
  • [49] An empirical study on the matrix-based protein representations and their combination with sequence-based approaches
    Nanni, Loris
    Lumini, Alessandra
    Brahnam, Sheryl
    AMINO ACIDS, 2013, 44 (03) : 887 - 901
  • [50] An empirical study on the matrix-based protein representations and their combination with sequence-based approaches
    Loris Nanni
    Alessandra Lumini
    Sheryl Brahnam
    Amino Acids, 2013, 44 : 887 - 901