An Embedded Feature Selection Method for Imbalanced Data Classification

被引:216
|
作者
Liu, Haoyue [1 ]
Zhou, MengChu [1 ,2 ]
Liu, Qing [1 ]
机构
[1] New Jersey Inst Technol, Dept Elect & Comp Engn, Newark, NJ 07102 USA
[2] Macau Univ Sci & Technol, Inst Syst Engn, Taipa 999078, Macau, Peoples R China
基金
美国国家科学基金会;
关键词
Classification and regression tree; feature selection; imbalanced data; weighted Gini index (WGI);
D O I
10.1109/JAS.2019.1911447
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue. Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index (WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve (ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of F-measure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem.
引用
收藏
页码:703 / 715
页数:13
相关论文
共 50 条
  • [1] An Embedded Feature Selection Method for Imbalanced Data Classification
    Haoyue Liu
    MengChu Zhou
    Qing Liu
    [J]. IEEE/CAA Journal of Automatica Sinica, 2019, 6 (03) : 703 - 715
  • [2] A Classification Method Based on Feature Selection for Imbalanced Data
    Liu, Yi
    Wang, Yanzhen
    Ren, Xiaoguang
    Zhou, Hao
    Diao, Xingchun
    [J]. IEEE ACCESS, 2019, 7 : 81794 - 81807
  • [3] Imbalanced Data Classification Based on Feature Selection Techniques
    Ksieniewicz, Pawel
    Wozniak, Michal
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING (IDEAL 2018), PT II, 2018, 11315 : 296 - 303
  • [4] Feature Selection in Imbalanced Data
    Kamalov F.
    Thabtah F.
    Leung H.H.
    [J]. Annals of Data Science, 2023, 10 (6) : 1527 - 1541
  • [5] FEATURE SELECTION AND CLASSIFICATION INTEGRATED METHOD FOR IDENTIFYING CITED TEXT SPANS FOR CITANCES ON IMBALANCED DATA
    Yee, Jen-Yuan
    Tsai, Cheng-Jung
    Hsu, Tien-Yu
    Lin, Jung-Yi
    Cheng, Pei-Cheng
    [J]. MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2021, 34 (04) : 355 - 373
  • [6] A novel oversampling and feature selection hybrid algorithm for imbalanced data classification
    Feng, Fang
    Li, Kuan-Ching
    Yang, Erfu
    Zhou, Qingguo
    Han, Lihong
    Hussain, Amir
    Cai, Mingjiang
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (03) : 3231 - 3267
  • [7] Weighted ReliefF with threshold constraints of feature selection for imbalanced data classification
    Song, Yan
    Si, Weiyun
    Dai, Feifan
    Yang, Guisong
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (14):
  • [8] A novel oversampling and feature selection hybrid algorithm for imbalanced data classification
    Fang Feng
    Kuan-Ching Li
    Erfu Yang
    Qingguo Zhou
    Lihong Han
    Amir Hussain
    Mingjiang Cai
    [J]. Multimedia Tools and Applications, 2023, 82 : 3231 - 3267
  • [9] Iterative ensemble feature selection for multiclass classification of imbalanced microarray data
    Yang, Junshan
    Zhou, Jiarui
    Zhu, Zexuan
    Ma, Xiaoliang
    Ji, Zhen
    [J]. JOURNAL OF BIOLOGICAL RESEARCH-THESSALONIKI, 2016, 23
  • [10] A Novel Feature Selection Method in the Categorization of Imbalanced Textual Data
    Pouramini, Jafar
    Minaei-Bidgoli, Behrouze
    Esmaeili, Mahdi
    [J]. KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2018, 12 (08): : 3725 - 3748