An empirical study on the joint impact of feature selection and data resampling on imbalance classification

被引:27
|
作者
Zhang, Chongsheng [1 ]
Soda, Paolo [2 ,3 ]
Bi, Jingjun [1 ]
Fan, Gaojuan [1 ]
Almpanidis, George [1 ]
Garcia, Salvador [4 ]
Ding, Weiping [5 ]
机构
[1] Henan Univ, Henan Key Lab Big Data Anal & Proc, Kaifeng, Henan, Peoples R China
[2] Univ Campus Biomed Rome, Dept Engn, Rome, Italy
[3] Umea Univ, Dept Radiat Sci, Biomed Engn, Radiat Phys, Umea, Sweden
[4] Univ Granada, DaSCI Andalusian Res Inst, Granada, Spain
[5] Nantong Univ, Sch Informat Sci & Technol, Nantong, Peoples R China
关键词
Imbalanced classification; Feature selection; Data selection; Resampling; SMOTE;
D O I
10.1007/s10489-022-03772-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many real-world datasets exhibit imbalanced distributions, in which the majority classes have sufficient samples, whereas the minority classes often have a very small number of samples. Data resampling has proven to be effective in alleviating such imbalanced settings, while feature selection is a commonly used technique for improving classification performance. However, the joint impact of feature selection and data resampling on two-class imbalance classification has rarely been addressed before. This work investigates the performance of two opposite imbalanced classification frameworks in which feature selection is applied before or after data resampling. We conduct a large-scale empirical study with a total of 9225 experiments on 52 publicly available datasets. The results show that both frameworks should be considered for finding the best performing imbalanced classification model. We also study the impact of classifiers, the ratio between the number of majority and minority samples (IR), and the ratio between the number of samples and features (SFR) on the performance of imbalance classification. Overall, this work provides a new reference value for researchers and practitioners in imbalance learning.
引用
收藏
页码:5449 / 5461
页数:13
相关论文
共 50 条
  • [41] Optimizing Neural Networks for Academic Performance Classification Using Feature Selection and Resampling Approach
    Supriyadi D.
    Purwanto P.
    Warsito B.
    Mendel, 2023, 29 (02) : 261 - 272
  • [42] An empirical study to investigate the impact of data resampling techniques on the performance of class maintainability prediction models
    Malhotra, Ruchika
    Lata, Kusum
    NEUROCOMPUTING, 2021, 459 : 432 - 453
  • [43] The impact of feature selection on medical document classification
    Parlak, Bekir
    Uysal, Alper Kursat
    2016 11TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2016,
  • [44] An Empirical Study on the Effectiveness of Feature Selection and Ensemble Learning Techniques for Music Genre Classification
    Shariat, Raad
    Zhang, John
    PROCEEDINGS OF THE 18TH INTERNATIONAL AUDIO MOSTLY CONFERENCE, AM 2023, 2023, : 51 - 58
  • [45] Online feature selection and classification with incomplete data
    Kalkan, Habil
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2014, 22 (06) : 1625 - 1636
  • [46] Feature Selection for Classification of Hyperspectral Data by SVM
    Pal, Mahesh
    Foody, Giles M.
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2010, 48 (05): : 2297 - 2307
  • [47] Feature Selection in Clinical Data Processing For Classification
    Seethal, C. R.
    Panicker, Janu R.
    Vasudevan, Veena
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE (ICIS), 2016, : 172 - 175
  • [48] Automatic feature selection for classification of health data
    He, HX
    Jin, HD
    Chen, J
    AI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3809 : 910 - 913
  • [49] Feature Selection for EEG Data Classification with Weka
    Murtazina, Marina
    Avdeenko, Tatiana
    ADVANCES IN SWARM INTELLIGENCE, ICSI 2022, PT II, 2022, : 279 - 288
  • [50] A Projected Feature Selection Algorithm for Data Classification
    Yin, Zhiwu
    Huang, Shangteng
    2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 3665 - 3668