Comparison of resampling methods for dealing with imbalanced data in binary classification problem

被引:2
|
作者
Park, Geun U. [1 ]
Jun, Inkyun G. [1 ]
机构
[1] Yonsei Univ, Div Biostat, Dept Biomed Syst Informat, Coll Med, 50-1 Yonsei Ro, Seoul 03722, South Korea
关键词
imbalanced-learn; imbalanced binary data; under-sampling; over-sampling; NEIGHBOR; SMOTE;
D O I
10.5351/KJAS.2019.32.3.349
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
A class imbalance problem arises when one class outnumbers the other class by a large proportion in binary data. Studies such as transforming the learning data have been conducted to solve this imbalance problem. In this study, we compared resampling methods among methods to deal with an imbalance in the classification problem. We sought to find a way to more effectively detect the minority class in the data. Through simulation, a total of 20 methods of over-sampling, under-sampling, and combined method of over- and under-sampling were compared. The logistic regression, support vector machine, and random forest models, which are commonly used in classification problems, were used as classifiers. The simulation results showed that the random under sampling (RUS) method had the highest sensitivity with an accuracy over 0.5. The next most sensitive method was an over-sampling adaptive synthetic sampling approach. This revealed that the RUS method was suitable for finding minority class values. The results of applying to some real data sets were similar to those of the simulation.
引用
收藏
页码:349 / 374
页数:26
相关论文
共 50 条
  • [1] A Comparison of Undersampling, Oversampling, and SMOTE Methods for Dealing with Imbalanced Classification in Educational Data Mining
    Wongvorachan, Tarid
    He, Surina
    Bulut, Okan
    INFORMATION, 2023, 14 (01)
  • [2] A Comparison of Ensemble Methods Combining Resampling Techniques for Class Imbalanced Data
    Lee, Hee-Jae
    Lee, Sungim
    KOREAN JOURNAL OF APPLIED STATISTICS, 2014, 27 (03) : 357 - 371
  • [3] Binary Classification with Imbalanced Data
    Chiang, Jyun-You
    Lio, Yuhlong
    Hsu, Chien-Ya
    Ho, Chia-Ling
    Tsai, Tzong-Ru
    ENTROPY, 2024, 26 (01)
  • [4] Calibration methods in imbalanced binary classification
    Guilbert, Theo
    Caelen, Olivier
    Chirita, Andrei
    Saerens, Marco
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2024, : 1319 - 1352
  • [5] Imbalanced Data Classification Based on a Hybrid Resampling SVM Method
    Cao, Lu
    Zhai, Yikui
    IEEE 12TH INT CONF UBIQUITOUS INTELLIGENCE & COMP/IEEE 12TH INT CONF ADV & TRUSTED COMP/IEEE 15TH INT CONF SCALABLE COMP & COMMUN/IEEE INT CONF CLOUD & BIG DATA COMP/IEEE INT CONF INTERNET PEOPLE AND ASSOCIATED SYMPOSIA/WORKSHOPS, 2015, : 1533 - 1536
  • [6] A Combination of Resampling and Ensemble Method for Text Classification on Imbalanced Data
    Feng, Haijun
    Qin, Wen
    Wang, Huijing
    Li, Yi
    Hu, Guangwu
    BIG DATA, BIGDATA 2021, 2022, 12988 : 3 - 16
  • [7] CCR: A COMBINED CLEANING AND RESAMPLING ALGORITHM FOR IMBALANCED DATA CLASSIFICATION
    Koziarski, Michal
    Wozniak, Michal
    INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2017, 27 (04) : 727 - 736
  • [8] Comparison of Sampling Methods for Imbalanced Data Classification in Random Forest
    Paing, May Phu
    Pintavirooj, C.
    Tungjitkusolmun, Supan
    Choomchuay, Somsak
    Hamamoto, Kazuhiko
    2018 11TH BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BMEICON 2018), 2018,
  • [9] VALIDATION ASSESSMENTS ON RESAMPLING METHOD IN IMBALANCED BINARY CLASSIFICATION FOR LINEAR DISCRIMINANT ANALYSIS
    Jamaluddin, Ahmad Hakiim
    Mahat, Nor Idayu
    JOURNAL OF INFORMATION AND COMMUNICATION TECHNOLOGY-MALAYSIA, 2021, 20 (01): : 83 - 102
  • [10] Optimal selection of resampling methods for imbalanced data with high complexity
    Kim, Annie
    Jung, Inkyung
    PLOS ONE, 2023, 18 (07):