Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines

被引:218
|
作者
Maldonado, Sebastian [1 ]
Weber, Richard [2 ]
Famili, Fazel [3 ]
机构
[1] Univ Los Andes, Santiago, Chile
[2] Univ Chile, Dept Ind Engn, Santiago, Chile
[3] Natl Res Council Canada, Ottawa, ON, Canada
关键词
Feature selection; Imbalanced data set; Dimensionality reduction; Support Vector Machine; Data mining; GENE SELECTION; CLASSIFICATION; CARCINOMAS; SURVIVAL;
D O I
10.1016/j.ins.2014.07.015
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Feature selection and classification of imbalanced data sets are two of the most interesting machine learning challenges, attracting a growing attention from both, industry and academia. Feature selection addresses the dimensionality reduction problem by determining a subset of available features to build a good model for classification or prediction, while the class-imbalance problem arises when the class distribution is too skewed. Both issues have been independently studied in the literature, and a plethora of methods to address high dimensionality as well as class-imbalance has been proposed. The aim of this work is to simultaneously explore both issues, proposing a family of methods that select those attributes that are relevant for the identification of the target class in binary classification. We propose a backward elimination approach based on successive holdout steps, whose contribution measure is based on a balanced loss function obtained on an independent subset. Our experiments are based on six highly imbalanced microarray data sets, comparing our methods with well-known feature selection techniques, and obtaining a better prediction with consistently fewer relevant features. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:228 / 246
页数:19
相关论文
共 50 条
  • [1] Online feature selection for high-dimensional class-imbalanced data
    Zhou, Peng
    Hu, Xuegang
    Li, Peipei
    Wu, Xindong
    [J]. KNOWLEDGE-BASED SYSTEMS, 2017, 136 : 187 - 199
  • [2] Class-imbalanced classifiers for high-dimensional data
    Lin, Wei-Jiun
    Chen, James J.
    [J]. BRIEFINGS IN BIOINFORMATICS, 2013, 14 (01) : 13 - 26
  • [3] SMOTE for high-dimensional class-imbalanced data
    Rok Blagus
    Lara Lusa
    [J]. BMC Bioinformatics, 14
  • [4] SMOTE for high-dimensional class-imbalanced data
    Blagus, Rok
    Lusa, Lara
    [J]. BMC BIOINFORMATICS, 2013, 14
  • [5] Class prediction for high-dimensional class-imbalanced data
    Blagus, Rok
    Lusa, Lara
    [J]. BMC BIOINFORMATICS, 2010, 11 : 523
  • [6] Class prediction for high-dimensional class-imbalanced data
    Rok Blagus
    Lara Lusa
    [J]. BMC Bioinformatics, 11
  • [7] Online Streaming Feature Selection for High-Dimensional and Class-Imbalanced Data Based on Neighborhood Rough Set
    Chen X.
    Lin Y.
    Wang C.
    [J]. Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2019, 32 (08): : 726 - 735
  • [8] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
    Fu, Guang-Hui
    Wu, Yuan-Jiao
    Zong, Min-Jie
    Pan, Jianxin
    [J]. BMC BIOINFORMATICS, 2020, 21 (01)
  • [9] Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data
    Guang-Hui Fu
    Yuan-Jiao Wu
    Min-Jie Zong
    Jianxin Pan
    [J]. BMC Bioinformatics, 21
  • [10] Dealing with high-dimensional class-imbalanced datasets: Embedded feature selection for SVM classification
    Maldonado, Sebastian
    Lopez, Julio
    [J]. APPLIED SOFT COMPUTING, 2018, 67 : 94 - 105