An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data

被引:17
|
作者
Lee, Dohyun [1 ]
Kim, Kyoungok [2 ]
机构
[1] Seoul Natl Univ Sci & Technol Seoul, Dept Data Sci, 232 Gongreungno, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol Seoul, Dept Ind Engn, 232 Gongreungno, Seoul 01811, South Korea
基金
新加坡国家研究基金会;
关键词
Class imbalance; Oversampling; Sampling size; Adaptive boosting; Ensemble learning; DATA-SETS; SMOTE; ENSEMBLES;
D O I
10.1016/j.eswa.2021.115442
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Resampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning algorithms. Oversampling methods, in particular, alleviate class imbalance by increasing the size of the minority class. However, previous studies related to oversampling generally have focused on where to add new samples, how to generate new samples, and how to prevent noise and they rarely have investigated how much sampling is sufficient. In many cases, the oversampling size is set so that the minority class has the same size as the majority class. This setting only considers the size of the classes in sample size determination, and the balanced training set can induce overfitting with the addition of too many minority samples. Moreover, the effectiveness of oversampling can be improved by adding synthetics into the appropriate locations. To address this issue, this study proposes a method to determine the oversampling size less than the sample size needed to obtain a balance between classes, while considering not only the absolute imbalance but also the difficulty of classification in a dataset on the basis of classification complexity. The effectiveness of the proposed sample size in oversampling is evaluated using several boosting algorithms with different oversampling methods for 16 imbalanced datasets. The results show that the proposed sample size achieves better classification performance than the sample size for attaining class balance.
引用
收藏
页数:10
相关论文
共 50 条
  • [21] Hyperspectral Image Classification with Imbalanced Data Based on Oversampling and Convolutional Neural Network
    Cai, Lei
    Zhang, Geng
    AI IN OPTICS AND PHOTONICS (AOPC 2019), 2019, 11342
  • [22] A new boundary-degree-based oversampling method for imbalanced data
    Yueqi Chen
    Witold Pedrycz
    Jie Yang
    Applied Intelligence, 2023, 53 : 26518 - 26541
  • [23] Local distribution-based adaptive minority oversampling for imbalanced data classification
    Wang, Xinyue
    Xu, Jian
    Zeng, Tieyong
    Jing, Liping
    NEUROCOMPUTING, 2021, 422 : 200 - 213
  • [24] A new boundary-degree-based oversampling method for imbalanced data
    Chen, Yueqi
    Pedrycz, Witold
    Yang, Jie
    APPLIED INTELLIGENCE, 2023, 53 (22) : 26518 - 26541
  • [25] An oversampling method for wafer map defect pattern classification considering small and imbalanced data
    Kim, Eun-Su
    Choi, Seung-Hyun
    Lee, Dong-Hee
    Kim, Kwang-Jae
    Bae, Young-Mok
    Oh, Young-Chan
    COMPUTERS & INDUSTRIAL ENGINEERING, 2021, 162
  • [26] SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
    Basgall, Maria Jose
    Hasperue, Waldo
    Naiouf, Marcelo
    Fernandez, Alberto
    Herrera, Francisco
    JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2018, 18 (03): : 203 - 209
  • [27] Model-Based Oversampling for Imbalanced Sequence Classification
    Gong, Zhichen
    Chen, Huanhuan
    CIKM'16: PROCEEDINGS OF THE 2016 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2016, : 1009 - 1018
  • [28] Counterfactual-based minority oversampling for imbalanced classification
    Wang, Shu
    Luo, Hao
    Huang, Shanshan
    Li, Qingsong
    Liu, Li
    Su, Guoxin
    Liu, Ming
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 122
  • [29] An oversampling framework for imbalanced classification based on Laplacian eigenmaps
    Ye, Xiucai
    Li, Hongmin
    Imakura, Akira
    Sakurai, Tetsuya
    NEUROCOMPUTING, 2020, 399 : 107 - 116
  • [30] Imbalanced Learning with Oversampling based on Classification Contribution Degree
    Jiang, Zhenhao
    Yang, Jie
    Liu, Yan
    ADVANCED THEORY AND SIMULATIONS, 2021, 4 (05)