Analysis of Ensemble Feature Selection for Correlated High-Dimensional RNA-Seq Cancer Data

被引:2
|
作者
Polewko-Klim, Aneta [1 ]
Rudnicki, Witold R. [1 ,2 ,3 ]
机构
[1] Univ Bialystok, Inst Informat, Bialystok, Poland
[2] Univ Bialystok, Computat Ctr, Bialystok, Poland
[3] Univ Warsaw, Interdisciplinary Ctr Math & Computat Modelling, Warsaw, Poland
来源
关键词
Random forest; RNA; Feature selection; Ensemble learning; COMPREHENSIVE GENOMIC CHARACTERIZATION;
D O I
10.1007/978-3-030-50420-5_39
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when Utest was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.
引用
收藏
页码:525 / 538
页数:14
相关论文
共 50 条
  • [21] Feature selection for high-dimensional imbalanced data
    Yin, Liuzhi
    Ge, Yong
    Xiao, Keli
    Wang, Xuehua
    Quan, Xiaojun
    [J]. NEUROCOMPUTING, 2013, 105 : 3 - 11
  • [22] Feature selection for high-dimensional data in astronomy
    Zheng, Hongwen
    Zhang, Yanxia
    [J]. ADVANCES IN SPACE RESEARCH, 2008, 41 (12) : 1960 - 1964
  • [23] A filter feature selection for high-dimensional data
    Janane, Fatima Zahra
    Ouaderhman, Tayeb
    Chamlal, Hasna
    [J]. JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17
  • [24] Feature Selection with High-Dimensional Imbalanced Data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Wald, Randall
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
  • [25] Feature selection for high-dimensional temporal data
    Michail Tsagris
    Vincenzo Lagani
    Ioannis Tsamardinos
    [J]. BMC Bioinformatics, 19
  • [26] Feature selection for high-dimensional temporal data
    Tsagris, Michail
    Lagani, Vincenzo
    Tsamardinos, Ioannis
    [J]. BMC BIOINFORMATICS, 2018, 19
  • [27] Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data
    Garcia-Diaz, Pilar
    Sanchez-Berriel, Isabel
    Martinez-Rojas, Juan A.
    Diez-Pascual, Ana M.
    [J]. GENOMICS, 2020, 112 (02) : 1916 - 1925
  • [28] Fuzzy Forests: Extending Random Forest Feature Selection for Correlated, High-Dimensional Data
    Conn, Daniel
    Ngun, Tuck
    Li, Gang
    Ramirez, Christina M.
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2019, 91 (09):
  • [29] An empirical threshold of selection probability for analysis of high-dimensional correlated data
    Kim, Kipoong
    Koo, Jajoon
    Sun, Hokeun
    [J]. JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2020, 90 (09) : 1606 - 1617
  • [30] Bayesian Hyper-LASSO Classification for Feature Selection with Application to Endometrial Cancer RNA-seq Data
    Lai Jiang
    Celia M. T. Greenwood
    Weixin Yao
    Longhai Li
    [J]. Scientific Reports, 10