Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naive Bayes

被引:141
|
作者
Lou, Wangchao [1 ]
Wang, Xiaoqing [1 ]
Chen, Fan [1 ]
Chen, Yixiao [1 ]
Jiang, Bo [1 ]
Zhang, Hua [1 ]
机构
[1] Zhejiang Gongshang Univ, Sch Comp & Informat Engn, Hangzhou, Zhejiang, Peoples R China
来源
PLOS ONE | 2014年 / 9卷 / 01期
基金
中国国家自然科学基金;
关键词
RIBOSOMAL-RNA-BINDING; SECONDARY STRUCTURE; EVOLUTIONARY CONSERVATION; FOLD RECOGNITION; IDENTIFICATION; COVARIANCE; RESOLUTION; ACCURATE; RECEPTORS; DOMAINS;
D O I
10.1371/journal.pone.0086703
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naive Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Prediction Method of Type 2 Diabetes Mellitus Based on a Combination of Hybrid Feature Selection and Random Forest
    Wang, Yunming
    Hu, Jiangang
    Fan, Xinru
    Gao, Xiue
    Liu, Changzheng
    WEB INFORMATION SYSTEMS AND APPLICATIONS, WISA 2024, 2024, 14883 : 439 - 450
  • [32] Sequence-based predictor of ATP-binding residues using random forest and mRMR-IFS feature selection
    Ma, Xin
    Sun, Xiao
    JOURNAL OF THEORETICAL BIOLOGY, 2014, 360 : 59 - 66
  • [33] Prediction of DNA-binding residues from protein sequence information using random forests
    Liangjiang Wang
    Mary Qu Yang
    Jack Y Yang
    BMC Genomics, 10
  • [34] Prediction of DNA-binding residues from protein sequence information using random forests
    Wang, Liangjiang
    Yang, Mary Qu
    Yang, Jack Y.
    BMC GENOMICS, 2009, 10
  • [35] StackDPPred: a stacking based prediction of DNA-binding protein from sequence
    Mishra, Avdesh
    Pokhrel, Pujan
    Hoque, Md Tamjidul
    BIOINFORMATICS, 2019, 35 (03) : 433 - 441
  • [36] Hybrid Feature Selection and Peptide Binding Affinity Prediction using an EDA based Algorithm
    Shelke, Kalpesh
    Jayaraman, Srikant
    Ghosh, Shameek
    Valadi, Jayaraman
    2013 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2013, : 2384 - 2389
  • [37] Feature selection algorithm based on random forest
    Yao, Deng-Ju
    Yang, Jing
    Zhan, Xiao-Juan
    Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2014, 44 (01): : 137 - 141
  • [38] SELECTION OF HIGH-AFFINITY BINDING-SITES FOR SEQUENCE-SPECIFIC, DNA-BINDING PROTEINS FROM RANDOM SEQUENCE OLIGONUCLEOTIDES
    PIERROU, S
    ENERBACK, S
    CARLSSON, P
    ANALYTICAL BIOCHEMISTRY, 1995, 229 (01) : 99 - 105
  • [39] Divergence-Based Feature Selection for Naive Bayes Text Classification
    Wang, Huizhen
    Zhu, Jingbo
    Su, Keh-Yih
    IEEE NLP-KE 2008: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, 2008, : 209 - +
  • [40] The Feature selection and Comparison performance of Student's academic between Random Forest, Naive bayes and XGboost
    Thanarat, Preut
    Kiatjindarat, Waranyoo
    Jareanpon, Chatklaw
    2023 IEEE INTERNATIONAL CONFERENCE ON TEACHING, ASSESSMENT AND LEARNING FOR ENGINEERING, TALE, 2023, : 636 - 641