A density-based oversampling approach for class imbalance and data overlap

被引:0
|
作者
Zhang, Ruizhi [1 ]
Lu, Shaowu [1 ]
Yan, Baokang [1 ]
Yu, Puliang [1 ]
Tang, Xiaoqi [2 ]
机构
[1] Wuhan Univ Sci & Technol, Sch Informat Sci & Engn, Heping Rd, Wuhan, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Mech Sci & Engn, Luoyu Rd, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Class imbalance; Data overlap; Synthetic minority oversampling technique; Kernel density estimation; Neighbor density selection; SAMPLING TECHNIQUE; SMOTE;
D O I
10.1016/j.cie.2023.109747
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In data mining classification, class imbalance is characterized that different classes have an obvious difference in the number of samples. Most classifiers typically assume a balanced class distribution or assign equal classification error costs to different classes. Therefore, directly using imbalanced class will worsen the classification performance. The oversampling algorithms can achieve the balance by synthesizing new samples, but the uncontrollable positions of the synthetic samples may aggravate the data overlap and further deteriorate the classification performance. To tackle this challenge, an improved synthetic minority oversampling technique based on kernel density estimation and neighbor density selection (KDENDS_SMOTE) is proposed in this paper. First, each sample is mapped into a high-dimensional space to avoid the choice of the window width and to overcome the nonlinear separable limitation. Kernel density estimation is then used to derive the density ratio, which serves as a measure of the degree of data overlap. Subsequently, the stability degree of the density ratio is calculated using neighbor information, and a scoring mechanism combining the density ratio and its stability degree is proposed to assess the fitness of selected samples. Furthermore, the neighbor density selection based on the above scoring mechanism can guide SMOTE to generate new samples within a safe and stable region, away from areas with data overlap. Finally, compared with six advanced oversampling methods on fifteen real-world datasets, the KDENDS_SMOTE can effectively mitigate the data overlap and improve the classification performance.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Nearest neighbors and density-based undersampling for imbalanced data classification with class overlap
    Sun, Peiqi
    Du, Yanhui
    Xiong, Siyun
    [J]. NEUROCOMPUTING, 2024, 609
  • [2] Relative Density-Based Intuitionistic Fuzzy SVM for Class Imbalance Learning
    Fu, Cui
    Zhou, Shuisheng
    Zhang, Dan
    Chen, Li
    [J]. ENTROPY, 2023, 25 (01)
  • [3] DBOS_US: a density-based graph under-sampling method to handle class imbalance and class overlap issues in software fault prediction
    Bhandari, Kirti
    Kumar, Kuldeep
    Sangal, Amrit Lal
    [J]. JOURNAL OF SUPERCOMPUTING, 2024, 80 (15): : 22682 - 22725
  • [4] A Robust Oversampling Approach for Class Imbalance Problem With Small Disjuncts
    Sun, Yi
    Cai, Lijun
    Liao, Bo
    Zhu, Wen
    Xu, Junlin
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (06) : 5550 - 5562
  • [5] A Bag Oversampling Approach for Class Imbalance in Multiple Instance Learning
    Mera, Carlos
    Arrieta, Jose
    Orozco-Alzate, Mauricio
    Branch, John
    [J]. PROGRESS IN PATTERN RECOGNITION, IMAGE ANALYSIS, COMPUTER VISION, AND APPLICATIONS, CIARP 2015, 2015, 9423 : 724 - 731
  • [6] An Efficient Density-based Approach for Data Mining Tasks
    Domeniconi, Carlotta
    Gunopulos, Dimitrios
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2004, 6 (06) : 750 - 770
  • [7] An Efficient Density-based Approach for Data Mining Tasks
    Carlotta Domeniconi
    Dimitrios Gunopulos
    [J]. Knowledge and Information Systems, 2004, 6 : 750 - 770
  • [8] A Boosting based Adaptive Oversampling Technique for Treatment of Class Imbalance
    Devi, Debashree
    Biswas, Saroj K.
    Purkayastha, Biswajit
    [J]. 2019 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI - 2019), 2019,
  • [9] CARBO: Clustering and rotation based oversampling for class imbalance learning
    Paul, Mahit Kumar
    Pal, Biprodip
    Sattar, A. H. M. Sarowar
    Siddique, A. S. M. Mustakim Rahman
    Hasan, Md. Al Mehedi
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [10] Measuring harmfulness of class imbalance by data complexity measures in oversampling methods
    Gosain, Anjana
    Saha, Anju
    Singh, Deepika
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT ENGINEERING INFORMATICS, 2019, 7 (2-3) : 203 - 230