Analysis of Ensemble Feature Selection for Correlated High-Dimensional RNA-Seq Cancer Data

被引:2
|
作者
Polewko-Klim, Aneta [1 ]
Rudnicki, Witold R. [1 ,2 ,3 ]
机构
[1] Univ Bialystok, Inst Informat, Bialystok, Poland
[2] Univ Bialystok, Computat Ctr, Bialystok, Poland
[3] Univ Warsaw, Interdisciplinary Ctr Math & Computat Modelling, Warsaw, Poland
来源
关键词
Random forest; RNA; Feature selection; Ensemble learning; COMPREHENSIVE GENOMIC CHARACTERIZATION;
D O I
10.1007/978-3-030-50420-5_39
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Discovery of diagnostic and prognostic molecular markers is important and actively pursued the research field in cancer research. For complex diseases, this process is often performed using Machine Learning. The current study compares two approaches for the discovery of relevant variables: by application of a single feature selection algorithm, versus by an ensemble of diverse algorithms. These approaches are used to identify variables that are relevant discerning of four cancer types using RNA-seq profiles from the Cancer Genome Atlas. The comparison is carried out in two directions: evaluating the predictive performance of models and monitoring the stability of selected variables. The most informative features are identified using a four feature selection algorithms, namely U-test, ReliefF, and two variants of the MDFS algorithm. Discerning normal and tumor tissues is performed using the Random Forest algorithm. The highest stability of the feature set was obtained when Utest was used. Unfortunately, models built on feature sets obtained from the ensemble of feature selection algorithms were no better than for models developed on feature sets obtained from individual algorithms. On the other hand, the feature selectors leading to the best classification results varied between data sets.
引用
收藏
页码:525 / 538
页数:14
相关论文
共 50 条
  • [1] Testing for association between RNA-Seq and high-dimensional data
    Rauschenberger, Armin
    Jonker, Marianne A.
    van de Wiel, Mark A.
    Menezes, Renee X.
    [J]. BMC BIOINFORMATICS, 2016, 17
  • [2] Testing for association between RNA-Seq and high-dimensional data
    Armin Rauschenberger
    Marianne A. Jonker
    Mark A. van de Wiel
    Renée X. Menezes
    [J]. BMC Bioinformatics, 17
  • [3] qtQDA: quantile transformed quadratic discriminant analysis for high-dimensional RNA-seq data
    Kochan, Necla
    Tutuncu, G. Yazgi
    Smyth, Gordon K.
    Gandoffo, Luke C.
    Giner, Goeknur
    [J]. PEERJ, 2019, 7
  • [4] A novel feature selection for RNA-seq analysis
    Han, Henry
    [J]. COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2017, 71 : 245 - 257
  • [5] Data Driven Feature Selection for RNA-Seq Differential Expression Analysis
    Han, Henry
    [J]. PATTERN RECOGNITION IN BIOINFORMATICS, PRIB 2014, 2014, 8626 : 114 - 115
  • [6] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
    Verleysen, Michel
    [J]. NCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEURAL COMPUTATION THEORY AND APPLICATIONS, 2011, : IS23 - IS25
  • [7] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
    Verleysen, Michel
    [J]. ECTA 2011/FCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION THEORY AND APPLICATIONS AND INTERNATIONAL CONFERENCE ON FUZZY COMPUTATION THEORY AND APPLICATIONS, 2011,
  • [8] Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains
    Pes, Barbara
    [J]. NEURAL COMPUTING & APPLICATIONS, 2020, 32 (10): : 5951 - 5973
  • [9] Ensemble feature selection for high-dimensional data: a stability analysis across multiple domains
    Barbara Pes
    [J]. Neural Computing and Applications, 2020, 32 : 5951 - 5973
  • [10] Stability of feature selection in classification issues for high-dimensional correlated data
    Émeline Perthame
    Chloé Friguet
    David Causeur
    [J]. Statistics and Computing, 2016, 26 : 783 - 796