Identifying subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm

被引:6
|
作者
Li, Zhan-Chao [1 ]
Lai, Yan-Hua [2 ]
Chen, Li-Li [2 ]
Chen, Chao [3 ]
Xie, Yun [1 ]
Dai, Zong [2 ]
Zou, Xiao-Yong [2 ]
机构
[1] Guangdong Pharmaceut Univ, Sch Chem & Chem Engn, Guangzhou 510006, Guangdong, Peoples R China
[2] Sun Yat Sen Univ, Sch Chem & Chem Engn, Guangzhou 510275, Guangdong, Peoples R China
[3] Guangdong Pharmaceut Univ, Sch Tradit Chinese Med, Guangzhou 510006, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
PREDICTION; SINGLE; SITES; PLANT;
D O I
10.1039/c3mb25451h
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
In the post-genome era, one of the most important and challenging tasks is to identify the subcellular localizations of protein complexes, and further elucidate their functions in human health with applications to understand disease mechanisms, diagnosis and therapy. Although various experimental approaches have been developed and employed to identify the subcellular localizations of protein complexes, the laboratory technologies fall far behind the rapid accumulation of protein complexes. Therefore, it is highly desirable to develop a computational method to rapidly and reliably identify the subcellular localizations of protein complexes. In this study, a novel method is proposed for predicting subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm. Protein complexes are modeled as weighted graphs containing nodes and edges, where nodes represent proteins, edges represent protein-protein interactions and weights are descriptors of protein primary structures. Some topological structure features are proposed and adopted to characterize protein complexes based on graph theory. Random forest is employed to construct a model and predict subcellular localizations of protein complexes. Accuracies on a training set by a 10-fold cross-validation test for predicting plasma membrane/membrane attached, cytoplasm and nucleus are 84.78%, 71.30%, and 82.00%, respectively. And accuracies for the independent test set are 81.31%, 69.95% and 81.00%, respectively. These high prediction accuracies exhibit the state-of-the-art performance of the current method. It is anticipated that the proposed method may become a useful high-throughput tool and plays a complementary role to the existing experimental techniques in identifying subcellular localizations of mammalian protein complexes. The source code of Matlab and the dataset can be obtained freely on request from the authors.
引用
收藏
页码:658 / 667
页数:10
相关论文
共 50 条
  • [1] Identifying functions of protein complexes based on topology similarity with random forest
    Li, Zhan-Chao
    Lai, Yan-Hua
    Chen, Li-Li
    Xie, Yun
    Dai, Zong
    Zou, Xiao-Yong
    MOLECULAR BIOSYSTEMS, 2014, 10 (03) : 514 - 525
  • [2] Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features
    Tian, Leqi
    Wu, Wenbin
    Yu, Tianwei
    BIOMOLECULES, 2023, 13 (07)
  • [3] Scheduling Algorithm Based on Logistics Random Graph Theory
    Li, Jing
    Peng, Haiyun
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (07): : 243 - 254
  • [4] Feature selection algorithm based on graph theory and random forests for protein secondary structure prediction
    Altun, Gulsah
    Hu, Hae-Jin
    Gremalschi, Stefan
    Harrison, Robert W.
    Pan, Yi
    BIOINFORMATICS RESEARCH AND APPLICATIONS, PROCEEDINGS, 2007, 4463 : 590 - +
  • [5] An improved graph entropy-based method for identifying protein complexes
    Chen, Bolin
    Yan, Yan
    Shi, Jinhong
    Zhang, Shenggui
    Wu, Fang-Xiang
    2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 123 - 126
  • [6] An algorithm for identifying protein complexes based on maximal clique extension
    Li, Min
    Wang, Jian-Xin
    Liu, Bin-Bin
    Chen, Jian-Er
    Zhongnan Daxue Xuebao (Ziran Kexue Ban)/Journal of Central South University (Science and Technology), 2010, 41 (02): : 560 - 565
  • [7] On protein complexes identifying algorithm based on the novel modularity function
    Guo, Maozu
    Dai, Qiguo
    Xu, Liqiu
    Liu, Xiaoyan
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2014, 51 (10): : 2178 - 2186
  • [8] Advances in spatial proteomics: Mapping proteome architecture from protein complexes to subcellular localizations
    Breckels, Lisa M.
    Hutchings, Charlotte
    Ingole, Kishor D.
    Kim, Suyeon
    Lilley, Kathryn S.
    Makwana, Mehul V.
    Mccaskie, Kieran J. A.
    Villanueva, Eneko
    CELL CHEMICAL BIOLOGY, 2024, 31 (09) : 1665 - 1687
  • [9] Mining of Protein Subcellular Localizations based on a Syntactic Dependency Tree and WordNet
    Kim, Mi-Young
    KNOWLEDGE-BASED SOFTWARE ENGINEERING, 2008, 180 : 373 - +
  • [10] Identification of protein complexes algorithm based on random walk model
    Dong Xuantong
    Lin Zhijie
    Ren Yuan
    2014 2ND INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2014, : 383 - 388