A Hybrid Sampling Approach for Imbalanced Binary and Multi-Class Data Using Clustering Analysis

被引:8
|
作者
Palli, Abdul Sattar [1 ,2 ]
Jaafar, Jafreezal [1 ,3 ]
Hashmani, Manzoor Ahmed [1 ,4 ]
Gomes, Heitor Murilo [5 ,6 ]
Gilal, Abdul Rehman [1 ,7 ]
机构
[1] Univ Teknol PETRONAS, Dept Comp & Informat Sci, Seri Iskandar 32610, Perak, Malaysia
[2] Minist Narcot Control, Antinarcot Force, Islamabad 46000, Pakistan
[3] UTP, Ctr Res Data Sci, Seri Iskandar 32610, Perak, Malaysia
[4] UTP, High Performance Cloud Comp Ctr HPC3, Seri Iskandar 32610, Perak, Malaysia
[5] Univ Waikato Wellington, AI Inst, Hamilton 3240, New Zealand
[6] Victoria Univ Wellington, Sch Engn & Comp Sci, Wellington 6012, New Zealand
[7] Sukkur IBA Univ, Dept Comp Sci, Sukkur 65200, Sindh, Pakistan
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Class imbalance; classification; clustering analysis; binary class; multi-class; CLASSIFIERS; SMOTE;
D O I
10.1109/ACCESS.2022.3218463
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalance is data resampling that involves either over-sampling minority class instances or under-sampling majority class instances. In the case of under-sampling, there is a chance of losing some crucial information, whereas over-sampling can cause an overfitting problem. Therefore, we propose a novel Cluster-based Hybrid Sampling for Imbalance Data (CBHSID) approach to address these issues. The CBHSID calculates the mean of the data observations based on the number of classes. It uses the calculated mean as a threshold value to segregate majority and minority classes. CBHSID applies affinity propagation cluster analysis to each class to create sub-clusters and calculates the distance of each data item of sub-cluster using centroid mean. CBHSID removes data observations that are away from the center of sub-cluster during under-sampling. On the other hand, during the over-sampling, it generates synthetic samples using data observations near to the center of sub-cluster. We compared CBHSID with a few state-of-the-art data balancing methods on 12 binary and 4 multi-class benchmark datasets. Based on Geometric-Mean (G-Mean), Recall, and F1-score, our method outperformed the other compared methods on 14 datasets out of 16. Results also revealed that CBHSID is suitable for addressing class imbalance issues in both binary and multi-class classifications. In the current state, we have only validated CBHSID on stationary data streams. Consequently, CBHSID can further be tested on non-stationary data streams in online learning environments.
引用
收藏
页码:118639 / 118653
页数:15
相关论文
共 50 条
  • [21] A Combination Method for Multi-Class Imbalanced Data Classification
    Li, Hu
    Zou, Peng
    Han, Weihong
    Xia, Rongze
    [J]. 2013 10TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA 2013), 2013, : 365 - 368
  • [22] Feature selection and its combination with data over-sampling for multi-class imbalanced datasets
    Tsai, Chih-Fong
    Chen, Kuan-Chen
    Lin, Wei -Chao
    [J]. APPLIED SOFT COMPUTING, 2024, 153
  • [23] Entropy-based Sampling Approaches for Multi-Class Imbalanced Problems
    Li, Lusi
    He, Haibo
    Li, Jie
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (11) : 2159 - 2170
  • [24] AMDO: An Over-Sampling Technique for Multi-Class Imbalanced Problems
    Yang, Xuebing
    Kuang, Qiuming
    Zhang, Wensheng
    Zhang, Guoping
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (09) : 1672 - 1685
  • [25] Sentiment Classification from Multi-class Imbalanced Twitter Data Using Binarization
    Krawczyk, Bartosz
    McInnes, Bridget T.
    Cano, Alberto
    [J]. HYBRID ARTIFICIAL INTELLIGENT SYSTEMS, HAIS 2017, 2017, 10334 : 26 - 37
  • [26] An Experimental Analysis of Drift Detection Methods on Multi-Class Imbalanced Data Streams
    Palli, Abdul Sattar
    Jaafar, Jafreezal
    Gomes, Heitor Murilo
    Hashmani, Manzoor Ahmed
    Gilal, Abdul Rehman
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (22):
  • [27] A Hybrid Approach for Binary Classification of Imbalanced Data
    Tsai, Hsinhan
    Yang, Ta-Wei
    Wong, Wai-Man
    Kao, Han-Yi
    Chou, Cheng-Fu
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2024, 23 (03)
  • [28] Learning from Combination of Data Chunks for Multi-class Imbalanced Data
    Liu, Xu-Ying
    Li, Qian-Qian
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2014, : 1680 - 1687
  • [29] AUC Evaluation of Multi-class Classifier Performance in Imbalanced Data
    Ni, Huangjing
    Wang, Wei
    [J]. 2010 INTERNATIONAL CONFERENCE ON FUTURE CONTROL AND AUTOMATION (ICFCA 2010), 2010, : 48 - 51
  • [30] Efficient DANNLO classifier for multi-class imbalanced data on Hadoop
    Satyanarayana S.
    Tayar Y.
    Prasad R.S.R.
    [J]. International Journal of Information Technology, 2019, 11 (2) : 321 - 329