Clustering-based undersampling in class-imbalanced data

被引:474
|
作者
Lin, Wei-Chao [1 ]
Tsai, Chih-Fong [2 ]
Hu, Ya-Han [3 ]
Jhang, Jing-Shang [2 ]
机构
[1] Asia Univ, Dept Comp Sci & Informat Engn, Taichung, Taiwan
[2] Natl Cent Univ, Dept Informat Management, Taoyuan, Taiwan
[3] Natl Chung Cheng Univ, Dept Informat Management, Chiayi, Taiwan
关键词
Class imbalance; Imbalanced data; Machine learning; Clustering; Classifier ensembles; CLASSIFICATION; PREDICTION;
D O I
10.1016/j.ins.2017.05.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Class imbalance is often a problem in various real-world data sets, where one class (i.e. the minority class) contains a small number of data points and the other (i.e. the majority class) contains a large number of data points. It is notably difficult to develop an effective model using current data mining and machine learning algorithms without considering data preprocessing to balance the imbalanced data sets. Random undersampling and over sampling have been used in numerous studies to ensure that the different classes contain the same number of data points. A classifier ensemble (i.e. a structure containing several classifiers) can be trained on several different balanced data sets for later classification purposes. In this paper, we introduce two undersampling strategies in which a clustering technique is used during the data preprocessing step. Specifically, the number of clusters in the majority class is set to be equal to the number of data points in the minority class. The first strategy uses the cluster centers to represent the majority class, whereas the second strategy uses the nearest neighbors of the cluster centers. A further study was conducted to examine the effect on performance of the addition or deletion of 5 to 10 cluster centers in the majority class. The experimental results obtained using 44 small-scale and 2 large-scale data sets revealed that the clustering-based undersampling approach with the second strategy outperformed five state-of-the-art approaches. Specifically, this approach combined with a single multilayer perceptron classifier and C4.5 decision tree classifier ensembles delivered optimal performance over both small-and large-scale data sets. (C) 2017 Elsevier Inc. All rights reserved.
引用
收藏
页码:17 / 26
页数:10
相关论文
共 50 条
  • [1] An Incremental Clustering-Based Fault Detection Algorithm for Class-Imbalanced Process Data
    Kwak, Jueun
    Lee, Taehyung
    Kim, Chang Ouk
    [J]. IEEE TRANSACTIONS ON SEMICONDUCTOR MANUFACTURING, 2015, 28 (03) : 318 - 328
  • [2] Consensus Clustering-Based Undersampling Approach to Imbalanced Learning
    Onan, Aytug
    [J]. SCIENTIFIC PROGRAMMING, 2019, 2019
  • [3] Subclass-based Undersampling for Class-imbalanced Image Classification
    Lehmann, Daniel
    Ebner, Marc
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS (VISAPP), VOL 5, 2022, : 493 - 500
  • [4] Exploring of clustering algorithm on class-imbalanced data
    Li Xuan
    Chen Zhigang
    Yang Fan
    [J]. PROCEEDINGS OF THE 2013 8TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION (ICCSE 2013), 2013, : 89 - 93
  • [5] An Earth mover's distance-based undersampling approach for handling class-imbalanced data
    Rekha, Gillala
    Krishna Reddy, V.
    Tyagi, Amit Kumar
    [J]. International Journal of Intelligent Information and Database Systems, 2020, 13 (2-4): : 376 - 392
  • [6] A Scalable Exemplar-Based Subspace Clustering Algorithm for Class-Imbalanced Data
    You, Chong
    Li, Chi
    Robinson, Daniel P.
    Vidal, Rene
    [J]. COMPUTER VISION - ECCV 2018, PT IX, 2018, 11213 : 68 - 85
  • [7] Clustering-based Binary-class Classification for Imbalanced Data Sets
    Chen, Chao
    Shyu, Mei-Ling
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2011, : 384 - 389
  • [8] EUSC: A clustering-based surrogate model to accelerate evolutionary undersampling in imbalanced classification
    Hoang Lam Le
    Landa-Silva, Dario
    Galar, Mikel
    Garcia, Salvador
    Triguero, Isaac
    [J]. APPLIED SOFT COMPUTING, 2021, 101
  • [9] Imbalanced credit card fraud detection data: A solution based on hybrid neural network and clustering-based undersampling technique
    Huang, Huajie
    Liu, Bo
    Xue, Xiaoyu
    Cao, Jiuxin
    Chen, Xinyi
    [J]. APPLIED SOFT COMPUTING, 2024, 154
  • [10] Novel fuzzy clustering-based undersampling framework for class imbalance problem
    Vibha Pratap
    Amit Prakash Singh
    [J]. International Journal of System Assurance Engineering and Management, 2023, 14 : 967 - 976