A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning

被引:60
|
作者
Elreedy, Dina [1 ]
Atiya, Amir F. [1 ]
Kamalov, Firuz [2 ]
机构
[1] Cairo Univ, Comp Engn Dept, Giza 12613, Egypt
[2] Canadian Univ Dubai, Dept Elect Engn, Dubai 117781, U Arab Emirates
关键词
SMOTE; Class imbalance; Distribution density; Over-sampling; Minority class; SAMPLING APPROACH; CLASSIFICATION;
D O I
10.1007/s10994-022-06296-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns' probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.
引用
收藏
页码:4903 / 4923
页数:21
相关论文
共 50 条
  • [21] An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets
    Thejas, G. S.
    Hariprasad, Yashas
    Iyengar, S. S.
    Sunitha, N. R.
    Badrinath, Prajwal
    Chennupati, Shasank
    MACHINE LEARNING WITH APPLICATIONS, 2022, 8
  • [22] Minority-prediction-probability-based oversampling technique for imbalanced learning
    Wei, Zhen
    Zhang, Li
    Zhao, Lei
    INFORMATION SCIENCES, 2023, 622 : 1273 - 1295
  • [23] BO-SMOTE: A Novel Bayesian-Optimization-Based Synthetic Minority Oversampling Technique
    Yan, Shen
    Zhao, Ziyan
    Liu, Shixin
    Zhou, Mengchu
    IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2024, 54 (04): : 2079 - 2091
  • [24] A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification
    Liu, Ruijuan
    APPLIED INTELLIGENCE, 2023, 53 (01) : 786 - 803
  • [25] LSMOTE: A link-based Synthetic Minority Oversampling Technique for binary imbalanced datasets
    Cai, Qin-Nan
    Zhang, Zhong-Liang
    Wu, Yu-Heng
    Zhang, Xiu-Ming
    NEUROCOMPUTING, 2024, 608
  • [26] A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification
    Ruijuan Liu
    Applied Intelligence, 2023, 53 : 786 - 803
  • [27] Distributed Synthetic Minority Oversampling Technique
    Sakshi Hooda
    Suman Mann
    International Journal of Computational Intelligence Systems, 2019, 12 : 929 - 936
  • [28] Distributed Synthetic Minority Oversampling Technique
    Hooda, Sakshi
    Mann, Suman
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2019, 12 (02) : 929 - 936
  • [29] C5.0 Algorithm and Synthetic Minority Oversampling Technique (SMOTE) for Rainfall Forecasting in Bandung Regency
    Kurniawan, Erwin
    Nhita, Fhira
    Aditsania, Annisa
    Saepudin, Deni
    2019 7TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (ICOICT), 2019, : 561 - 565
  • [30] An extensive study of C-SMOTE, a Continuous Synthetic Minority Oversampling Technique for Evolving Data Streams
    Bernardo, Alessio
    Della Valle, Emanuele
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 196