Identification of protein functions using a machine-learning approach based on sequence-derived properties

被引:38
|
作者
Lee, Bum Ju [2 ]
Shin, Moon Sun [3 ]
Oh, Young Joon [2 ]
Oh, Hae Seok [4 ]
Ryu, Keun Ho [1 ]
机构
[1] Chungbuk Natl Univ, Sch Elect & Comp Engn, Cheongju 361763, Chungbuk, South Korea
[2] Jungwon Univ, Ctr Ind Res, Goesan Gun 367805, Chungbuk, South Korea
[3] Konkuk Univ, Dept Comp Sci, Chungju Si 380701, Chungbuk, South Korea
[4] Kyungwon Univ, Dept Comp Sci, Songnam 461701, Gyeonggi Do, South Korea
关键词
SUPPORT VECTOR MACHINES; PREDICTING ENZYME CLASS; MEMBRANE-PROTEIN; BINDING-PROTEINS; CHARGED RESIDUE; CLASSIFICATION; SVM; ACCURACY; FAMILY; LOCALIZATION;
D O I
10.1186/1477-5956-7-27
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Predicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities. Results: A highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function. Conclusion: We present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] Identification of human drug targets using machine-learning algorithms
    Kumari, Priyanka
    Nath, Abhigyan
    Chaube, Radha
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2015, 56 : 175 - 181
  • [42] Identification of amnestic mild cognitive impairment by structural and functional MRI using a machine-learning approach
    Hwang, Hyunyoung
    Kim, Si Eun
    Lee, Ho-Joon
    Lee, Dong Ah
    Park, Kang Min
    [J]. CLINICAL NEUROLOGY AND NEUROSURGERY, 2024, 238
  • [43] VALIDATING A MACHINE-LEARNING APPROACH TO CANCER STAGE IDENTIFICATION USING MEDICARE CLAIMS AND SEER DATA
    Smith, R.
    Miller-Wilson, L. A.
    Ho, N.
    Carter, Cuyun G.
    Fayyaz, I
    Pope, A.
    Pelizzari, P.
    Pyenson, B.
    [J]. VALUE IN HEALTH, 2023, 26 (06) : S283 - S283
  • [44] Identification of deleterious non-synonymous single nucleotide polymorphisms using sequence-derived information
    Hu, Jing
    Yan, Changhui
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [45] EXAMINING THE CLINICAL UTILITY OF INSIGHT: A MACHINE-LEARNING APPROACH TO SEPSIS IDENTIFICATION
    Topiwala, Raj
    Patel, Kanak
    Meisenberg, Barry
    [J]. CRITICAL CARE MEDICINE, 2019, 47
  • [46] Identification of groundwater potential zones of Idukki district using remote sensing and GIS-based machine-learning approach
    Khan, Zohaib Ahmed
    Jhamnani, Bharat
    [J]. WATER SUPPLY, 2023, 23 (06) : 2426 - 2446
  • [47] Identification of deleterious non-synonymous single nucleotide polymorphisms using sequence-derived information
    Jing Hu
    Changhui Yan
    [J]. BMC Bioinformatics, 9
  • [48] VacPred: Sequence-based prediction of plant vacuole proteins using machine-learning techniques
    Arvind Kumar Yadav
    Deepak Singla
    [J]. Journal of Biosciences, 2020, 45
  • [49] VacPred: Sequence-based prediction of plant vacuole proteins using machine-learning techniques
    Yadav, Arvind Kumar
    Singla, Deepak
    [J]. JOURNAL OF BIOSCIENCES, 2020, 45 (01)
  • [50] A novel approach to fold recognition using sequence-derived properties from sets of structurally similar local fragments of proteins
    Hvidsten, Torgeir R.
    Kryshtafovych, Andriy
    Komorowski, Jan
    Fidelis, Krzysztof
    [J]. BIOINFORMATICS, 2003, 19 : II81 - II91