Hum-mPLoc 3.0: prediction enhancement of human protein subcellular localization through modeling the hidden correlations of gene ontology and functional domain features

被引:78
|
作者
Zhou, Hang [1 ,2 ]
Yang, Yang [3 ,4 ]
Shen, Hong-Bin [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, Inst Image Proc & Pattern Recognit, Shanghai 200240, Peoples R China
[2] Minist Educ China, Key Lab Syst Control & Informat Proc, Shanghai 200240, Peoples R China
[3] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[4] Key Lab Shanghai Educ Commiss Intelligent Interac, Shanghai 200240, Peoples R China
基金
上海市自然科学基金;
关键词
AMINO-ACID-COMPOSITION; TERMINAL TARGETING SEQUENCES; SUPPORT VECTOR MACHINES; WEB-SERVER; SEMANTIC SIMILARITY; ENSEMBLE CLASSIFIER; EUKARYOTIC PROTEINS; MULTIPLE SITES; PSI-BLAST; GO TERMS;
D O I
10.1093/bioinformatics/btw723
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Protein subcellular localization prediction has been an important research topic in computational biology over the last decade. Various automatic methods have been proposed to predict locations for large scale protein datasets, where statistical machine learning algorithms are widely used for model construction. A key step in these predictors is encoding the amino acid sequences into feature vectors. Many studies have shown that features extracted from biological domains, such as gene ontology and functional domains, can be very useful for improving the prediction accuracy. However, domain knowledge usually results in redundant features and high-dimensional feature spaces, which may degenerate the performance of machine learning models. Results: In this paper, we propose a new amino acid sequence-based human protein subcellular location prediction approach Hum-mPLoc 3.0, which covers 12 human subcellular localizations. The sequences are represented by multi-view complementary features, i. e. context vocabulary annotation-based gene ontology (GO) terms, peptide-based functional domains, and residuebased statistical features. To systematically reflect the structural hierarchy of the domain knowledge bases, we propose a novel feature representation protocol denoted as HCM (Hidden Correlation Modeling), which will create more compact and discriminative feature vectors by modeling the hidden correlations between annotation terms. Experimental results on four benchmark datasets show that HCM improves prediction accuracy by 5-11% and F1 by 8-19% compared with conventional GO-based methods. A large-scale application of Hum-mPLoc 3.0 on the whole human proteome reveals proteins co-localization preferences in the cell.
引用
收藏
页码:843 / 853
页数:11
相关论文
共 3 条
  • [1] A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0
    Shen, Hong-Bin
    Chou, Kuo-Chen
    [J]. ANALYTICAL BIOCHEMISTRY, 2009, 394 (02) : 269 - 274
  • [2] Hum-mPLoc: An ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites
    Shen, Hong-Bin
    Chou, Kuo-Chen
    [J]. BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS, 2007, 355 (04) : 1006 - 1011
  • [3] CELLO2GO: A Web Server for Protein subCELlular LOcalization Prediction with Functional Gene Ontology Annotation
    Yu, Chin-Sheng
    Cheng, Chih-Wen
    Su, Wen-Chi
    Chang, Kuei-Chung
    Huang, Shao-Wei
    Hwang, Jenn-Kang
    Lu, Chih-Hao
    [J]. PLOS ONE, 2014, 9 (06):