A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides

被引:24
|
作者
Charoenkwan, Phasit [1 ]
Chotpatiwetchkul, Warot [2 ]
Lee, Vannajan Sanghiran [3 ]
Nantasenamat, Chanin [4 ]
Shoombuatong, Watshara [4 ]
机构
[1] Chiang Mai Univ, Coll Arts Media & Technol, Modern Management & Informat Technol, Chiang Mai 50200, Thailand
[2] King Mongkuts Inst Technol Ladkrabang, Sch Sci, Dept Chem, Appl Computat Chem Res Unit, Bangkok 10520, Thailand
[3] Univ Malaya, Fac Sci, Ctr Theoret & Computat Phys, Dept Chem, Kuala Lumpur 50603, Malaysia
[4] Mahidol Univ, Fac Med Technol, Ctr Data Min & Biomed Informat, Bangkok 10700, Thailand
关键词
AMINO-ACID; FEATURE-SELECTION; WEB SERVER; THERMOSTABILITY; DISCRIMINATION; ELUCIDATION; STABILITY; ENZYMES;
D O I
10.1038/s41598-021-03293-w
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlastack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties
    Chen, Wei
    Lin, Hao
    Feng, Peng-Mian
    Ding, Chen
    Zuo, Yong-Chun
    Chou, Kuo-Chen
    PLOS ONE, 2012, 7 (10):
  • [22] A sequence-based multiple kernel model for identifying DNA-binding proteins
    Yuqing Qian
    Limin Jiang
    Yijie Ding
    Jijun Tang
    Fei Guo
    BMC Bioinformatics, 22
  • [23] A sequence-based multiple kernel model for identifying DNA-binding proteins
    Qian, Yuqing
    Jiang, Limin
    Ding, Yijie
    Tang, Jijun
    Guo, Fei
    BMC BIOINFORMATICS, 2021, 22 (SUPPL 3)
  • [24] SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins
    Xu, Lei
    Liang, Guangmin
    Shi, Shuhua
    Liao, Changrui
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2018, 19 (06):
  • [25] Identification of Thermophilic Proteins Based on Sequence-Based Bidirectional Representations from Transformer-Embedding Features
    Pei, Hongdi
    Li, Jiayu
    Ma, Shuhan
    Jiang, Jici
    Li, Mingxin
    Zou, Quan
    Lv, Zhibin
    APPLIED SCIENCES-BASEL, 2023, 13 (05):
  • [26] PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only
    Wei, Leyi
    Xing, Pengwei
    Tang, Jijun
    Zou, Quan
    IEEE TRANSACTIONS ON NANOBIOSCIENCE, 2017, 16 (04) : 240 - 247
  • [27] A Sequence-Based Predictor of Zika Virus Proteins Developed by Integration of PseAAC and Statistical Moments
    Hussain, Waqar
    Rasool, Nouman
    Khan, Yaser D.
    COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2020, 23 (08) : 797 - 804
  • [28] iCTX-Type: A Sequence-Based Predictor for Identifying the Types of Conotoxins in Targeting Ion Channels
    Ding, Hui
    Deng, En-Ze
    Yuan, Lu-Feng
    Liu, Li
    Lin, Hao
    Chen, Wei
    Chou, Kuo-Chen
    BIOMED RESEARCH INTERNATIONAL, 2014, 2014
  • [29] CPPred-RF: A Sequence-based Predictor for Identifying Cell Penetrating Peptides and Their Uptake Efficiency
    Wei, Leyi
    Xing, PengWei
    Su, Ran
    Shi, Gaotao
    Ma, Zhanshan Sam
    Zou, Quan
    JOURNAL OF PROTEOME RESEARCH, 2017, 16 (05) : 2044 - 2053
  • [30] Knot or not? Identifying unknotted proteins in knotted families with sequence-based Machine Learning model
    Sikora, Maciej
    Klimentova, Eva
    Uchal, Dawid
    Sramkova, Denisa
    Perlinska, Agata P.
    Nguyen, Mai Lan
    Korpacz, Marta
    Malinowska, Roksana
    Nowakowski, Szymon
    Rubach, Pawel
    Simecek, Petr
    Sulkowska, Joanna I.
    PROTEIN SCIENCE, 2024, 33 (07)