A novel sequence-based predictor for identifying and characterizing thermophilic proteins using estimated propensity scores of dipeptides

被引:24
|
作者
Charoenkwan, Phasit [1 ]
Chotpatiwetchkul, Warot [2 ]
Lee, Vannajan Sanghiran [3 ]
Nantasenamat, Chanin [4 ]
Shoombuatong, Watshara [4 ]
机构
[1] Chiang Mai Univ, Coll Arts Media & Technol, Modern Management & Informat Technol, Chiang Mai 50200, Thailand
[2] King Mongkuts Inst Technol Ladkrabang, Sch Sci, Dept Chem, Appl Computat Chem Res Unit, Bangkok 10520, Thailand
[3] Univ Malaya, Fac Sci, Ctr Theoret & Computat Phys, Dept Chem, Kuala Lumpur 50603, Malaysia
[4] Mahidol Univ, Fac Med Technol, Ctr Data Min & Biomed Informat, Bangkok 10700, Thailand
关键词
AMINO-ACID; FEATURE-SELECTION; WEB SERVER; THERMOSTABILITY; DISCRIMINATION; ELUCIDATION; STABILITY; ENZYMES;
D O I
10.1038/s41598-021-03293-w
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Owing to their ability to maintain a thermodynamically stable fold at extremely high temperatures, thermophilic proteins (TTPs) play a critical role in basic research and a variety of applications in the food industry. As a result, the development of computation models for rapidly and accurately identifying novel TTPs from a large number of uncharacterized protein sequences is desirable. In spite of existing computational models that have already been developed for characterizing thermophilic proteins, their performance and interpretability remain unsatisfactory. We present a novel sequence-based thermophilic protein predictor, termed SCMTPP, for improving model predictability and interpretability. First, an up-to-date and high-quality dataset consisting of 1853 TPPs and 3233 non-TPPs was compiled from published literature. Second, the SCMTPP predictor was created by combining the scoring card method (SCM) with estimated propensity scores of g-gap dipeptides. Benchmarking experiments revealed that SCMTPP had a cross-validation accuracy of 0.883, which was comparable to that of a support vector machine-based predictor (0.906-0.910) and 2-17% higher than that of commonly used machine learning models. Furthermore, SCMTPP outperformed the state-of-the-art approach (ThermoPred) on the independent test dataset, with accuracy and MCC of 0.865 and 0.731, respectively. Finally, the SCMTPP-derived propensity scores were used to elucidate the critical physicochemical properties for protein thermostability enhancement. In terms of interpretability and generalizability, comparative results showed that SCMTPP was effective for identifying and characterizing TPPs. We had implemented the proposed predictor as a user-friendly online web server at http://pmlastack.pythonanywhere.com/SCMTPP in order to allow easy access to the model. SCMTPP is expected to be a powerful tool for facilitating community-wide efforts to identify TPPs on a large scale and guiding experimental characterization of TPPs.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] CLPred: a sequence-based protein crystallization predictor using BLSTM neural network
    Xuan, Wenjing
    Liu, Ning
    Huang, Neng
    Li, Yaohang
    Wang, Jianxin
    BIOINFORMATICS, 2020, 36 : I709 - I717
  • [32] iTIS-PseTNC: A sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition
    Chen, Wei
    Feng, Peng-Mian
    Deng, En-Ze
    Lin, Hao
    Chou, Kuo-Chen
    ANALYTICAL BIOCHEMISTRY, 2014, 462 : 76 - 83
  • [33] A Sequence-based Predictor for Identifying DNase Hypersensitive Sites Via Physical-chemical Property Matrix
    Qiu, Wang-Ren
    Zou, Guo-Ying
    Xu, Zhao-Chun
    2015 INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND INFORMATION SYSTEM (SEIS 2015), 2015, : 379 - 385
  • [34] A sequence-based two-layer predictor for identifying enhancers and their strength through enhanced feature extraction
    Amilpur, Santhosh
    Bhukya, Raju
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2022, 20 (02)
  • [35] Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae
    He, Wenying
    Ju, Ying
    Zeng, Xiangxiang
    Liu, Xiangrong
    Zou, Quan
    FRONTIERS IN MICROBIOLOGY, 2018, 9
  • [36] iAMY-SCM: Improved prediction and analysis of amyloid proteins using a scoring card method with propensity scores of dipeptides
    Charoenkwan, Phasit
    Kanthawong, Sakawrat
    Nantasenamat, Chanin
    Hasan, Md Mehedi
    Shoombuatong, Watshara
    GENOMICS, 2021, 113 (01) : 689 - 698
  • [37] Identification of Helicobacter pylori Membrane Proteins Using Sequence-Based Features
    Liu, Mujiexin
    Chen, Hui
    Gao, Dong
    Ma, Cai-Yi
    Zhang, Zhao-Yue
    COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2022, 2022
  • [38] Novel sequence-based method for identifying transcription factor binding sites in prokaryotic genomes
    Sahota, Gurmukh
    Stormo, Gary D.
    BIOINFORMATICS, 2010, 26 (21) : 2672 - 2677
  • [39] Characterizing embryonic gene expression patterns in the mouse using nonredundant sequence-based selection
    Sousa-Nunes, R
    Rana, AA
    Kettleborough, R
    Brickman, JM
    Clements, M
    Forrest, A
    Grimmond, S
    Avner, P
    Smith, JC
    Dunwoodie, SL
    GENOME RESEARCH, 2003, 13 (12) : 2609 - 2620
  • [40] M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning
    Wei, Leyi
    Chen, Huangrong
    Su, Ran
    MOLECULAR THERAPY-NUCLEIC ACIDS, 2018, 12 : 635 - 644