Multistage feature selection approach for high-dimensional cancer data

被引:20
|
作者
Alkuhlani, Alhasan [1 ]
Nassef, Mohammad [1 ]
Farag, Ibrahim [1 ]
机构
[1] Cairo Univ, Fac Comp & Informat, Dept Comp Sci, Giza, Egypt
关键词
DNA methylation (DNAm); CpG sites; Feature selection; Genetic algorithms; Support vector machine (SVM); Incremental feature selection (IFS); Enrichment analysis; DNA METHYLATION; BREAST-CANCER; GENE SELECTION; CLASSIFICATION; ARRAY; IDENTIFICATION; RESISTANCE; ALGORITHM; LOCUS; RISK;
D O I
10.1007/s00500-016-2439-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cancer is a serious disease that causes death worldwide. DNA methylation (DNAm) is an epigenetic mechanism, which controls the regulation of gene expression and is useful in early detection of cancer. The challenge with DNA methylation microarray datasets is the huge number of CpG sites compared to the number of samples. Recent research efforts attempted to reduce this high dimensionality by different feature selection techniques. This article proposes a multistage feature selection approach to select the optimal CpG sites from three different DNAm cancer datasets (breast, colon and lung). The proposed approach combines three different filter feature selection methods including Fisher Criterion, t-test and Area Under ROC Curve. In addition, as a wrapper feature selection, we apply genetic algorithms with Support Vector Machine Recursive Feature Elimination (SVM-RFE) as its fitness function, and SVM as its evaluator. Using the Incremental Feature Selection (IFS) strategy, subsets of 24, 13 and 27 optimal CpG sites are selected for the breast, colon and lung cancer datasets, respectively. By applying fivefold cross-validation on the training datasets, these subsets of optimal CpG sites showed perfect classification accuracies of 100, 100 and 97.67%, respectively. Moreover, the testing of the three independent cancer datasets by these final subsets resulted in accuracies 96.02, 98.81 and 94.51%, respectively. The experimental results demonstrated high classification performance and small optimal feature subsets. Consequently, the biological significance of the genes corresponding to these feature subsets is validated using enrichment analysis.
引用
收藏
页码:6895 / 6906
页数:12
相关论文
共 50 条
  • [1] Multistage feature selection approach for high-dimensional cancer data
    Alhasan Alkuhlani
    Mohammad Nassef
    Ibrahim Farag
    [J]. Soft Computing, 2017, 21 : 6895 - 6906
  • [2] Feature selection for high-dimensional data
    Bolón-Canedo V.
    Sánchez-Maroño N.
    Alonso-Betanzos A.
    [J]. Progress in Artificial Intelligence, 2016, 5 (2) : 65 - 75
  • [3] Feature selection for high-dimensional data
    Destrero A.
    Mosci S.
    De Mol C.
    Verri A.
    Odone F.
    [J]. Computational Management Science, 2009, 6 (1) : 25 - 40
  • [4] A Light Causal Feature Selection Approach to High-Dimensional Data
    Ling, Zhaolong
    Li, Ying
    Zhang, Yiwen
    Yu, Kui
    Zhou, Peng
    Li, Bo
    Wu, Xindong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (08) : 7639 - 7650
  • [5] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS
    Verleysen, Michel
    [J]. NCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEURAL COMPUTATION THEORY AND APPLICATIONS, 2011, : IS23 - IS25
  • [6] Feature selection for high-dimensional imbalanced data
    Yin, Liuzhi
    Ge, Yong
    Xiao, Keli
    Wang, Xuehua
    Quan, Xiaojun
    [J]. NEUROCOMPUTING, 2013, 105 : 3 - 11
  • [7] Feature selection for high-dimensional data in astronomy
    Zheng, Hongwen
    Zhang, Yanxia
    [J]. ADVANCES IN SPACE RESEARCH, 2008, 41 (12) : 1960 - 1964
  • [8] A filter feature selection for high-dimensional data
    Janane, Fatima Zahra
    Ouaderhman, Tayeb
    Chamlal, Hasna
    [J]. JOURNAL OF ALGORITHMS & COMPUTATIONAL TECHNOLOGY, 2023, 17
  • [9] Feature Selection with High-Dimensional Imbalanced Data
    Van Hulse, Jason
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    Wald, Randall
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 507 - 514
  • [10] Feature selection for high-dimensional temporal data
    Michail Tsagris
    Vincenzo Lagani
    Ioannis Tsamardinos
    [J]. BMC Bioinformatics, 19