Efficient hybrid oversampling and intelligent undersampling for imbalanced big data classification

被引:2
|
作者
Vairetti, Carla [1 ,3 ]
Assadi, Jose Luis
Maldonado, Sebastian [2 ,3 ]
机构
[1] Univ Los Andes, Fac Ingn & Ciencias Aplicadas, Los Andes, Chile
[2] Univ Chile, Sch Econ & Business, Dept Management Control & Informat Syst, Santiago, Chile
[3] Inst Sistemas Complejos Ingenieri ISCI, Santiago, Chile
关键词
Imbalanced classification; SMOTE; Big data; Intelligent undersampling; MapReduce; SMOTE; MAPREDUCE; OUTCOMES; MACHINE; INSIGHT;
D O I
10.1016/j.eswa.2024.123149
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Imbalanced classification is a well-known challenge faced by many real -world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset
    Shamsudin, Haziqah
    Yusof, Umi Kalsom
    Jayalakshmi, Andal
    Khalid, Mohd Nor Akmal
    [J]. 2020 IEEE 16TH INTERNATIONAL CONFERENCE ON CONTROL & AUTOMATION (ICCA), 2020, : 803 - 808
  • [42] Using Area Under the Precision Recall Curve to Assess the Effect of Random Undersampling in the Classification of Imbalanced Medicare Big Data
    Hancock III, John T.
    Khoshgoftaar, Taghi M.
    Johnson, Justin M.
    [J]. INTERNATIONAL JOURNAL OF RELIABILITY QUALITY AND SAFETY ENGINEERING, 2024, 31 (01)
  • [43] A non-parameter oversampling approach for imbalanced data classification based on hybrid natural neighborsA non-parameter oversampling approach for imbalanced data classification. . .J. Lin and L. Liang
    Junyue Lin
    Lu Liang
    [J]. Applied Intelligence, 2025, 55 (5)
  • [44] SOUL: Scala Oversampling and Undersampling Library for imbalance classification
    Rodriguez, Nestor
    Lopez, David
    Fernandez, Alberto
    Garcia, Salvador
    Herrera, Francisco
    [J]. SOFTWAREX, 2021, 15
  • [45] Classification of Imbalanced Data by Oversampling in Kernel Space of Support Vector Machines
    Mathew, Josey
    Pang, Chee Khiang
    Luo, Ming
    Leong, Weng Hoe
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (09) : 4065 - 4076
  • [46] A NOVEL RULE-BASED OVERSAMPLING APPROACH FOR IMBALANCED DATA CLASSIFICATION
    Zhang, Xiao
    Paz, Ivan
    Nebot, Angela
    [J]. 37TH ANNUAL EUROPEAN SIMULATION AND MODELLING CONFERENCE 2023, ESM 2023, 2023, : 208 - 212
  • [47] A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
    Tao, Liangliang
    Zhu, Huping
    Wang, Qingya
    Liang, Yage
    Deng, Xiaozheng
    [J]. IEEE ACCESS, 2023, 11 : 130688 - 130696
  • [48] Grouping-based Oversampling in Kernel Space for Imbalanced Data Classification
    Ren, Jinjun
    Wang, Yuping
    Cheung, Yiu-ming
    Gao, Xiao-Zhi
    Guo, Xiaofang
    [J]. PATTERN RECOGNITION, 2023, 133
  • [49] Binary imbalanced data classification based on diversity oversampling by generative models
    Zhai, Junhai
    Qi, Jiaxing
    Shen, Chu
    [J]. INFORMATION SCIENCES, 2022, 585 : 313 - 343
  • [50] Combining Random Subspace Approach with smote Oversampling for Imbalanced Data Classification
    Ksieniewicz, Pawel
    [J]. HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2019, 2019, 11734 : 660 - 673