Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features

被引:11
|
作者
Tian, Leqi [1 ,2 ]
Wu, Wenbin [1 ]
Yu, Tianwei [1 ,2 ,3 ]
机构
[1] Chinese Univ Hong Kong, Sch Data Sci, Shenzhen 518172, Peoples R China
[2] Shenzhen Res Inst Big Data, Shenzhen 518172, Peoples R China
[3] Guangdong Prov Key Lab Big Data Comp, Shenzhen 518172, Peoples R China
关键词
feature selection; random forest; gene network; CANCER;
D O I
10.3390/biom13071153
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features (p) compared to the size of samples (n). Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets-non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] On the modification Highly Connected Subgraphs (HCS) algorithm in graph clustering for weighted graph
    Albirri, E. R.
    Sugeng, K. A.
    Aldila, D.
    1ST INTERNATIONAL CONFERENCE OF COMBINATORICS, GRAPH THEORY, AND NETWORK TOPOLOGY, 2018, 1008
  • [2] Identifying subcellular localizations of mammalian protein complexes based on graph theory with a random forest algorithm
    Li, Zhan-Chao
    Lai, Yan-Hua
    Chen, Li-Li
    Chen, Chao
    Xie, Yun
    Dai, Zong
    Zou, Xiao-Yong
    MOLECULAR BIOSYSTEMS, 2013, 9 (04) : 658 - 667
  • [3] POWERS OF A CONNECTED GRAPH ARE HIGHLY HAMILTONIAN
    BHAT, VN
    KAPOOR, SF
    NOTICES OF THE AMERICAN MATHEMATICAL SOCIETY, 1970, 17 (06): : 941 - &
  • [4] Partitioning a Graph into Highly Connected Subgraphs
    Borozan, Valentin
    Ferrara, Michael
    Fujita, Shinya
    Furuya, Michitaka
    Manoussakis, Yannis
    Narayanan, N.
    Stolee, Derrick
    JOURNAL OF GRAPH THEORY, 2016, 82 (03) : 322 - 333
  • [5] POWERS OF A CONNECTED GRAPH ARE HIGHLY HAMILTONIAN
    BHAT, VN
    KAPOOR, SF
    JOURNAL OF RESEARCH OF THE NATIONAL BUREAU OF STANDARDS SECTION B-MATHEMATICAL SCIENCES, 1971, B 75 (1-2): : 63 - +
  • [6] Minimal connected enclosures on an embedded planar graph
    Discrete Appl Math, 1-3 (25-38):
  • [7] Minimal connected enclosures on an embedded planar graph
    Alexopoulos, C
    Provan, JS
    Ratliff, HD
    Stutzman, BR
    DISCRETE APPLIED MATHEMATICS, 1999, 91 (1-3) : 25 - 38
  • [8] Taming graph kernels with random features
    Choromanski, Krzysztof
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [9] GHS algorithm on a graph with random weights
    Rybarczyk, Katarzyna
    THEORETICAL COMPUTER SCIENCE, 2020, 828 (828-829) : 19 - 31
  • [10] ALGORITHM FOR PARTITIONING A GRAPH INTO MINIMALLY CONNECTED SUBGRAPHS
    RYZHKOV, AP
    ENGINEERING CYBERNETICS, 1975, 13 (06): : 96 - 102