Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naive Bayes

被引:141
|
作者
Lou, Wangchao [1 ]
Wang, Xiaoqing [1 ]
Chen, Fan [1 ]
Chen, Yixiao [1 ]
Jiang, Bo [1 ]
Zhang, Hua [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou, Zhejiang, Peoples R China
来源
PLOS ONE | 2014年 / 9卷 / 01期
基金
中国国家自然科学基金;
关键词
RIBOSOMAL-RNA-BINDING; SECONDARY STRUCTURE; EVOLUTIONARY CONSERVATION; FOLD RECOGNITION; IDENTIFICATION; COVARIANCE; RESOLUTION; ACCURATE; RECEPTORS; DOMAINS;
D O I
10.1371/journal.pone.0086703
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naive Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] TEHRAN AIR POLLUTANTS PREDICTION BASED ON RANDOM FOREST FEATURE SELECTION METHOD
    Shamsoddini, A.
    Aboodi, M. R.
    Karami, J.
    ISPRS INTERNATIONAL JOINT CONFERENCES OF THE 2ND GEOSPATIAL INFORMATION RESEARCH (GI RESEARCH 2017); THE 4TH SENSORS AND MODELS IN PHOTOGRAMMETRY AND REMOTE SENSING (SMPR 2017); THE 6TH EARTH OBSERVATION OF ENVIRONMENTAL CHANGES (EOEC 2017), 2017, 42-4 (W4): : 483 - 488
  • [42] EcmPred: Prediction of extracellular matrix proteins based on random forest with maximum relevance minimum redundancy feature selection
    Kandaswamy, Krishna Kumar
    Pugalenthi, Ganesan
    Kalies, Kai-Uwe
    Hartmann, Enno
    Martinetz, Thomas
    JOURNAL OF THEORETICAL BIOLOGY, 2013, 317 : 377 - 383
  • [43] KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest
    Jia, Yuran
    Huang, Shan
    Zhang, Tianjiao
    FRONTIERS IN GENETICS, 2021, 12
  • [44] A sequence-based multiple kernel model for identifying DNA-binding proteins
    Qian, Yuqing
    Jiang, Limin
    Ding, Yijie
    Tang, Jijun
    Guo, Fei
    BMC BIOINFORMATICS, 2021, 22 (SUPPL 3)
  • [45] A sequence-based multiple kernel model for identifying DNA-binding proteins
    Yuqing Qian
    Limin Jiang
    Yijie Ding
    Jijun Tang
    Fei Guo
    BMC Bioinformatics, 22
  • [46] Prediction of DNA-Binding Protein-Drug-Binding Sites Using Residue Interaction Networks and Sequence Feature
    Wang, Wei
    Zhang, Yu
    Liu, Dong
    Zhang, HongJun
    Wang, XianFang
    Zhou, Yun
    FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, 2022, 10
  • [47] Hybrid_DBP: Prediction of DNA-binding proteins using hybrid features and convolutional neural networks
    Yu, Shaoyou
    Peng, Dejun
    Zhu, Wen
    Liao, Bo
    Wang, Peng
    Yang, Dongxuan
    Wu, Fangxiang
    FRONTIERS IN PHARMACOLOGY, 2022, 13
  • [48] Improving Landslides Prediction: Meteorological Data Preprocessing Using Random Forest-Based Feature Selection
    Guerrero Rodriguez, Byron
    Salvador Meneses, Jaime
    Garcia-Rodriguez, Jose
    16TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING MODELS IN INDUSTRIAL AND ENVIRONMENTAL APPLICATIONS (SOCO 2021), 2022, 1401 : 379 - 387
  • [49] Prediction of forest unit volume based on hybrid feature selection and ensemble learning
    Jie Wang
    Jing Xu
    Yan Peng
    Hongpeng Wang
    Junhao Shen
    Evolutionary Intelligence, 2020, 13 : 21 - 32
  • [50] Prediction of forest unit volume based on hybrid feature selection and ensemble learning
    Wang, Jie
    Xu, Jing
    Peng, Yan
    Wang, Hongpeng
    Shen, Junhao
    EVOLUTIONARY INTELLIGENCE, 2020, 13 (01) : 21 - 32