A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides

被引:0
|
作者
Phasit Charoenkwan
Warot Chotpatiwetchkul
Vannajan Sanghiran Lee
Chanin Nantasenamat
Watshara Shoombuatong
机构
[1] Chiang Mai University,Modern Management and Information Technology, College of Arts, Media and Technology
[2] King Mongkut’s Institute of Technology Ladkrabang,Applied Computational Chemistry Research Unit, Department of Chemistry, School of Science
[3] University of Malaya,Department of Chemistry, Centre of Theoretical and Computational Physics, Faculty of Science
[4] Mahidol University,Center of Data Mining and Biomedical Informatics, Faculty of Medical Technology
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906–0.910) and 2–17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlabstack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
引用
收藏
相关论文
共 50 条
  • [41] CRYSpred: Accurate Sequence-Based Protein Crystallization Propensity Prediction Using Sequence-Derived Structural Characteristics
    Mizianty, Marcin J.
    Kurgan, Lukasz A.
    PROTEIN AND PEPTIDE LETTERS, 2012, 19 (01): : 40 - 49
  • [42] Sequence-based Identification of Arginine Amidation Sites in Proteins Using Deep Representations of Proteins and PseAAC
    Naseer, Sheraz
    Hussain, Waqar
    Khan, Yaser Daanial
    Rasool, Nouman
    CURRENT BIOINFORMATICS, 2020, 15 (08) : 937 - 948
  • [43] Holistic in silico developability assessment of novel classes of small proteins using publicly available sequence-based predictors
    Pais, Daniel A. M.
    Mayer, Jan-Peter A.
    Felderer, Karin
    Batalha, Maria B.
    Eichner, Timo
    Santos, Sofia T.
    Kumar, Raman
    Silva, Sandra D.
    Kaufmann, Hitto
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2024, 38 (01)
  • [44] Toward an improved discrimination of outer membrane proteins using a sequence-based approach
    Liang, Gui-zhao
    Ma, Xiu-yan
    Li, Yuan-chao
    Lv, Feng-lin
    Yang, Li
    BIOSYSTEMS, 2011, 105 (01) : 101 - 106
  • [45] iANOP-Enble: a sequence-based ensemble classifier for identifying antioxidant proteins by PseAAC and Random Forests
    Xiao, Xuan
    Ju, Weifeng
    Hui, Mengjuan
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON AUTOMATION, MECHANICAL CONTROL AND COMPUTATIONAL ENGINEERING (AMCCE 2017), 2017, 118 : 587 - 593
  • [46] IdentPMP: identification of moonlighting proteins in plants using sequence-based learning models
    Liu, Xinyi
    Shen, Yueyue
    Zhang, Youhua
    Liu, Fei
    Ma, Zhiyu
    Yue, Zhenyu
    Yue, Yi
    PEERJ, 2021, 9
  • [47] Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation
    Phasit Charoenkwan
    Chanin Nantasenamat
    Md. Mehedi Hasan
    Watshara Shoombuatong
    Journal of Computer-Aided Molecular Design, 2020, 34 : 1105 - 1116
  • [48] AHLS-pred: a novel sequence-based predictor of acyl-homoserine-lactone synthases using machine learning algorithms
    Hu, Jie
    Wang, Jin
    Li, Jiahao
    Hu, Haidong
    Wu, Bin
    Ren, Hongqiang
    Wang, Jinfeng
    ENVIRONMENTAL MICROBIOLOGY REPORTS, 2022, 14 (04): : 616 - 631
  • [49] DNA fingerprinting of thermophilic lactic acid bacteria using repetitive sequence-based polymerase chain reaction
    De Urraza, PJ
    Gómez-Zavaglia, A
    Lozano, ME
    Romanowski, V
    De Antoni, GL
    JOURNAL OF DAIRY RESEARCH, 2000, 67 (03) : 381 - 392
  • [50] Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation
    Charoenkwan, Phasit
    Nantasenamat, Chanin
    Hasan, Md. Mehedi
    Shoombuatong, Watshara
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2020, 34 (10) : 1105 - 1116