A Hybrid Sampling Approach for Imbalanced Binary and Multi-Class Data Using Clustering Analysis

被引:8
|
作者
Palli, Abdul Sattar [1 ,2 ]
Jaafar, Jafreezal [1 ,3 ]
Hashmani, Manzoor Ahmed [1 ,4 ]
Gomes, Heitor Murilo [5 ,6 ]
Gilal, Abdul Rehman [1 ,7 ]
机构
[1] Univ Teknol PETRONAS, Dept Comp & Informat Sci, Seri Iskandar 32610, Perak, Malaysia
[2] Minist Narcot Control, Antinarcot Force, Islamabad 46000, Pakistan
[3] UTP, Ctr Res Data Sci, Seri Iskandar 32610, Perak, Malaysia
[4] UTP, High Performance Cloud Comp Ctr HPC3, Seri Iskandar 32610, Perak, Malaysia
[5] Univ Waikato Wellington, AI Inst, Hamilton 3240, New Zealand
[6] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington 6012, New Zealand
[7] Sukkur IBA Univ, Dept Comp Sci, Sukkur 65200, Sindh, Pakistan
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Class imbalance; classification; clustering analysis; binary class; multi-class; CLASSIFIERS; SMOTE;
D O I
10.1109/ACCESS.2022.3218463
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalance is data resampling that involves either over-sampling minority class instances or under-sampling majority class instances. In the case of under-sampling, there is a chance of losing some crucial information, whereas over-sampling can cause an overfitting problem. Therefore, we propose a novel Cluster-based Hybrid Sampling for Imbalance Data (CBHSID) approach to address these issues. The CBHSID calculates the mean of the data observations based on the number of classes. It uses the calculated mean as a threshold value to segregate majority and minority classes. CBHSID applies affinity propagation cluster analysis to each class to create sub-clusters and calculates the distance of each data item of sub-cluster using centroid mean. CBHSID removes data observations that are away from the center of sub-cluster during under-sampling. On the other hand, during the over-sampling, it generates synthetic samples using data observations near to the center of sub-cluster. We compared CBHSID with a few state-of-the-art data balancing methods on 12 binary and 4 multi-class benchmark datasets. Based on Geometric-Mean (G-Mean), Recall, and F1-score, our method outperformed the other compared methods on 14 datasets out of 16. Results also revealed that CBHSID is suitable for addressing class imbalance issues in both binary and multi-class classifications. In the current state, we have only validated CBHSID on stationary data streams. Consequently, CBHSID can further be tested on non-stationary data streams in online learning environments.
引用
收藏
页码:118639 / 118653
页数:15
相关论文
共 50 条
  • [1] A Dynamic Sampling Framework for Multi-Class Imbalanced Data
    Debowski, B.
    Areibi, S.
    Grewal, G.
    Tempelman, J.
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 113 - 118
  • [2] MULTI-CLASS DATA CLASSIFICATION FOR IMBALANCED DATA SET USING COMBINED SAMPLING APPROACHES
    Prachuabsupakij, Wanthanee
    Snonthornphisaj, Nuanwan
    [J]. KDIR 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND INFORMATION RETRIEVAL, 2011, : 166 - 171
  • [3] Multi-class Boosting for Imbalanced Data
    Fernandez-Baldera, Antonio
    Buenaposada, Jose M.
    Baumela, Luis
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2015), 2015, 9117 : 57 - 64
  • [4] Multi-class WHMBoost: An ensemble algorithm for multi-class imbalanced data
    Zhao, Jiakun
    Jin, Ju
    Zhang, Yibo
    Zhang, Ruifeng
    Chen, Si
    [J]. INTELLIGENT DATA ANALYSIS, 2022, 26 (03) : 599 - 614
  • [5] Hybrid Sampling and Dynamic Weighting-Based Classification Method for Multi-Class Imbalanced Data Stream
    Han, Meng
    Li, Ang
    Gao, Zhihui
    Mu, Dongliang
    Liu, Shujuan
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (10):
  • [6] Combating Mutuality with Difficulty Factors in Multi-class Imbalanced Data: A Similarity-based Hybrid Sampling
    Zheng, Zhong
    Yan, Yuanting
    Zhang, Yiwen
    Zhang, Yanping
    [J]. 2022 IEEE 9TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2022, : 387 - 396
  • [7] Improved multi-class classification approach for imbalanced big data on spark
    Tinku Singh
    Riya Khanna
    Manish Satakshi
    [J]. The Journal of Supercomputing, 2023, 79 : 6583 - 6611
  • [8] Improved multi-class classification approach for imbalanced big data on spark
    Singh, Tinku
    Khanna, Riya
    Satakshi
    Kumar, Manish
    [J]. JOURNAL OF SUPERCOMPUTING, 2023, 79 (06): : 6583 - 6611
  • [9] Evaluating Difficulty of Multi-class Imbalanced Data
    Lango, Mateusz
    Napierala, Krystyna
    Stefanowski, Jerzy
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 312 - 322
  • [10] Survey on Highly Imbalanced Multi-class Data
    Hamid, Hakim Abdul
    Yusoff, Marina
    Mohamed, Azlinah
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (06) : 211 - 229