Identification of protein functions using a machine-learning approach based on sequence-derived properties

被引:38
|
作者
Lee, Bum Ju [2 ]
Shin, Moon Sun [3 ]
Oh, Young Joon [2 ]
Oh, Hae Seok [4 ]
Ryu, Keun Ho [1 ]
机构
[1] Chungbuk Natl Univ, Sch Elect & Comp Engn, Cheongju 361763, Chungbuk, South Korea
[2] Jungwon Univ, Ctr Ind Res, Goesan Gun 367805, Chungbuk, South Korea
[3] Konkuk Univ, Dept Comp Sci, Chungju Si 380701, Chungbuk, South Korea
[4] Kyungwon Univ, Dept Comp Sci, Songnam 461701, Gyeonggi Do, South Korea
关键词
SUPPORT VECTOR MACHINES; PREDICTING ENZYME CLASS; MEMBRANE-PROTEIN; BINDING-PROTEINS; CHARGED RESIDUE; CLASSIFICATION; SVM; ACCURACY; FAMILY; LOCALIZATION;
D O I
10.1186/1477-5956-7-27
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results: A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion: We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Identification of protein functions using a machine-learning approach based on sequence-derived properties
    Bum Ju Lee
    Moon Sun Shin
    Young Joon Oh
    Hae Seok Oh
    Keun Ho Ryu
    [J]. Proteome Science, 7
  • [2] A machine learning approach for the identification of odorant binding proteins from sequence-derived properties
    Ganesan Pugalenthi
    Ke Tang
    PN Suganthan
    G Archunan
    R Sowdhamini
    [J]. BMC Bioinformatics, 8
  • [3] A machine learning approach for the identification of odorant binding proteins from sequence-derived properties
    Pugalenthi, Ganesan
    Tang, Ke
    Suganthan, P. N.
    Archunan, G.
    Sowdhamini, R.
    [J]. BMC BIOINFORMATICS, 2007, 8 (1)
  • [4] Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
    Jha, Tony
    Mendel, Jovinna
    Cho, Hyuk
    Choudhary, Madhusudan
    [J]. BIOINFORMATICS AND BIOLOGY INSIGHTS, 2022, 16
  • [5] Prediction of Bacterial sRNAs Using Sequence-Derived Features and Machine Learning
    Jha, Tony
    Mendel, Jovinna
    Cho, Hyuk
    Choudhary, Madhusudan
    [J]. BIOINFORMATICS AND BIOLOGY INSIGHTS, 2022, 16
  • [6] An Evaluation of Machine Learning Approaches for the Prediction of Essential Genes in Eukaryotes Using Protein Sequence-Derived Features
    Campos, Tulio L.
    Korhonen, Pasi K.
    Gasser, Robin B.
    Young, Neil D.
    [J]. COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2019, 17 : 785 - 796
  • [7] Transmembrane region prediction by using sequence-derived features and machine learning methods
    Yan, Renxiang
    Wang, Xiaofeng
    Huang, Lanqing
    Tian, Yarong
    Cai, Weiwen
    [J]. RSC ADVANCES, 2017, 7 (46) : 29200 - 29211
  • [8] A Machine Learning Approach to Identify DNA Replication Proteins from Sequence-Derived Features
    Yang, Runtao
    Zhang, Chengjin
    Gao, Rui
    Zhang, Lina
    [J]. 2015 IEEE 28TH CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (CCECE), 2015, : 13 - 18
  • [9] SortPred: The first machine learning based predictor to identify bacterial sortases and their classes using sequence-derived information
    Malik, Adeel
    Subramaniyam, Sathiyamoorthy
    Kim, Chang-Bae
    Manavalan, Balachandran
    [J]. COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2022, 20 : 165 - 174
  • [10] Protein fold recognition using sequence-derived predictions
    Fischer, D
    Eisenberg, D
    [J]. PROTEIN SCIENCE, 1996, 5 (05) : 947 - 955