Comprehensive empirical investigation for prioritizing the pipeline of using feature selection and data resampling techniques

被引:0
|
作者
Tyagi, Pooja [1 ]
Singh, Jaspreeti [1 ]
Gosain, Anjana [1 ]
机构
[1] Guru Gobind Singh Indraprastha Univ, Univ Sch Informat Commun & Technol, New Delhi, India
关键词
Imbalanced data; feature selection; machine learning; oversampling; undersampling; CLASS-IMBALANCED DATASETS; CLASSIFICATION METHOD; PREDICTION; SMOTE; CLASSIFIERS; TESTS;
D O I
10.3233/JIFS-233511
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The contemporary real-world datasets often suffer from the problem of class imbalance as well as high dimensionality. For combating class imbalance, data resampling is a commonly used approach whereas for tackling high dimensionality feature selection is used. The aforesaid problems have been studied extensively as independent problems in the literature but the possible synergy between them is still not clear. This paper studies the effects of addressing both the issues in conjunction by using a combination of resampling and feature selection techniques on binary-class imbalance classification. In particular, the primary goal of this study is to prioritize the sequence or pipeline of using these techniques and to analyze the performance of the two opposite pipelines that apply feature selection before or after resampling techniques i.e., F + S or S + F. For this, a comprehensive empirical study is carried out by conducting a total of 34,560 tests on 30 publicly available datasets using a combination of 12 resampling techniques for class imbalance and 12 feature selection methods, evaluating the performance on 4 different classifiers. Through the experiments we conclude that there is no specific pipeline that proves better than the other and both the pipelines should be considered for obtaining the best classification results on high dimensional imbalanced data. Additionally, while using Decision Tree (DT) or Random Forest (RF) as base learner the predominance of S + F over F + S is observed whereas in case of Support Vector Machine (SVM) and Logistic Regression (LR), F + S outperforms S + F in most cases. According to the mean ranking obtained from Friedman test the best combination of resampling and feature selection techniques for DT, SVM, LR and RF are SMOTE + RFE (Synthetic Minority Oversampling Technique and Recursive Feature Elimination), Least Absolute Shrinkage and Selection Operator (LASSO) + SMOTE, SMOTE + Embedded feature selection using RF and SMOTE + RFE respectively.
引用
收藏
页码:6019 / 6040
页数:22
相关论文
共 50 条
  • [1] An empirical study on the joint impact of feature selection and data resampling on imbalance classification
    Zhang, Chongsheng
    Soda, Paolo
    Bi, Jingjun
    Fan, Gaojuan
    Almpanidis, George
    Garcia, Salvador
    Ding, Weiping
    APPLIED INTELLIGENCE, 2023, 53 (05) : 5449 - 5461
  • [2] An empirical study on the joint impact of feature selection and data resampling on imbalance classification
    Chongsheng Zhang
    Paolo Soda
    Jingjun Bi
    Gaojuan Fan
    George Almpanidis
    Salvador García
    Weiping Ding
    Applied Intelligence, 2023, 53 : 5449 - 5461
  • [3] Correction to: An empirical study on the joint impact of feature selection and data resampling on imbalance classification
    Chongsheng Zhang
    Paolo Soda
    Jingjun Bi
    Gaojuan Fan
    George Almpanidis
    Salvador García
    Weiping Ding
    Applied Intelligence, 2023, 53 : 8506 - 8506
  • [4] Predicting defects in imbalanced data using resampling methods: an empirical investigation
    Malhotra, Ruchika
    Jain, Juhi
    PEERJ COMPUTER SCIENCE, 2022, 8
  • [5] When is resampling beneficial for feature selection with imbalanced wide data?
    Ramos-Perez, Ismael
    Arnaiz-Gonzalez, Alvar
    Rodriguez, Juan J.
    Garcia-Osorio, Cesar
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 188
  • [6] On the Suitability of Combining Feature Selection and Resampling to Manage Data Complexity
    Martin-Felez, Raul
    Mollineda, Ramon A.
    CURRENT TOPICS IN ARTIFICIAL INTELLIGENCE, 2010, 5988 : 141 - +
  • [7] An Empirical Evaluation of Techniques for Feature Selection with Cost
    Adams, Stephen
    Meekins, Ryan
    Beling, Peter A.
    2017 17TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2017), 2017, : 834 - 841
  • [8] A comprehensive investigation of the impact of feature selection techniques on crashing fault residence prediction models
    Zhao, Kunsong
    Xu, Zhou
    Yan, Meng
    Zhang, Tao
    Yang, Dan
    Li, Wei
    INFORMATION AND SOFTWARE TECHNOLOGY, 2021, 139
  • [9] Feature Selection in Big Data using Filter Based Techniques
    Srinivas, Sumitra K.
    Kancharla, Gangadhara Rao
    2019 4TH MEC INTERNATIONAL CONFERENCE ON BIG DATA AND SMART CITY (ICBDSC), 2019, : 139 - 145
  • [10] Exploiting resampling techniques for model selection in forecasting: an empirical evaluation using out-of-sample tests
    Sarris, Dimitrios
    Spiliotis, Evangelos
    Assimakopoulos, Vassilios
    OPERATIONAL RESEARCH, 2020, 20 (02) : 701 - 721