A density-based oversampling approach for class imbalance and data overlap

被引:0
|
作者
Zhang, Ruizhi [1 ]
Lu, Shaowu [1 ]
Yan, Baokang [1 ]
Yu, Puliang [1 ]
Tang, Xiaoqi [2 ]
机构
[1] Wuhan Univ Sci & Technol, Sch Informat Sci & Engn, Heping Rd, Wuhan, Peoples R China
[2] Huazhong Univ Sci & Technol, Sch Mech Sci & Engn, Luoyu Rd, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Class imbalance; Data overlap; Synthetic minority oversampling technique; Kernel density estimation; Neighbor density selection; SAMPLING TECHNIQUE; SMOTE;
D O I
10.1016/j.cie.2023.109747
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In data mining classification, class imbalance is characterized that different classes have an obvious difference in the number of samples. Most classifiers typically assume a balanced class distribution or assign equal classification error costs to different classes. Therefore, directly using imbalanced class will worsen the classification performance. The oversampling algorithms can achieve the balance by synthesizing new samples, but the uncontrollable positions of the synthetic samples may aggravate the data overlap and further deteriorate the classification performance. To tackle this challenge, an improved synthetic minority oversampling technique based on kernel density estimation and neighbor density selection (KDENDS_SMOTE) is proposed in this paper. First, each sample is mapped into a high-dimensional space to avoid the choice of the window width and to overcome the nonlinear separable limitation. Kernel density estimation is then used to derive the density ratio, which serves as a measure of the degree of data overlap. Subsequently, the stability degree of the density ratio is calculated using neighbor information, and a scoring mechanism combining the density ratio and its stability degree is proposed to assess the fitness of selected samples. Furthermore, the neighbor density selection based on the above scoring mechanism can guide SMOTE to generate new samples within a safe and stable region, away from areas with data overlap. Finally, compared with six advanced oversampling methods on fifteen real-world datasets, the KDENDS_SMOTE can effectively mitigate the data overlap and improve the classification performance.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] An Oversampling Method for Class Imbalance Problems on Large Datasets
    Rodriguez-Torres, Fredy
    Martinez-Trinidad, Jose F.
    Carrasco-Ochoa, Jesus A.
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (07):
  • [32] Combined effects of class imbalance and class overlap on instance-based classification
    Garcia, V.
    Alejo, R.
    Sanchez, J. S.
    Sotoca, J. M.
    Mollineda, R. A.
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2006, PROCEEDINGS, 2006, 4224 : 371 - 378
  • [33] Data repair of density-based data cleaning approach using conditional functional dependencies
    Al-Janabi, Samir
    Janicki, Ryszard
    [J]. DATA TECHNOLOGIES AND APPLICATIONS, 2022, 56 (03) : 429 - 446
  • [34] Oversampling Methods to Handle the Class Imbalance Problem: A Review
    Sharma, Harsh
    Gosain, Anushika
    [J]. SOFT COMPUTING AND ITS ENGINEERING APPLICATIONS, ICSOFTCOMP 2022, 2023, 1788 : 96 - 110
  • [35] Oversampling Algorithm Based on Spatial Distribution of Data Sets for Imbalance Learning
    Liu, Yiran
    Han, Wanjiang
    Wang, Xiaoxiang
    Li, Qi
    [J]. 2020 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS 2020), 2020, : 45 - 49
  • [36] Addressing the class-imbalance and class-overlap problems by a metaheuristic-based under-sampling approach
    Soltanzadeh, Paria
    Feizi-Derakhshi, M. Reza
    Hashemzadeh, Mahdi
    [J]. PATTERN RECOGNITION, 2023, 143
  • [37] A new instance density-based synthetic minority oversampling method for imbalanced classification problems
    Ma, Chung-Kang
    Park, You-Jin
    [J]. ENGINEERING OPTIMIZATION, 2022, 54 (10) : 1743 - 1757
  • [38] Density-Based Clustering to Deal with Highly Imbalanced Data in Multi-Class Problems
    Mondragon, Julio Cesar Munguia
    Lara, Erendira Rendon
    Eleuterio, Roberto Alejo
    Gutirrez, Everardo Efren Granda
    Lopez, Federico Del Razo
    [J]. MATHEMATICS, 2023, 11 (18)
  • [39] StreamSW: A density-based approach for clustering data streams over sliding windows
    Reddy, K. Shyam Sunder
    Bindu, C. Shoba
    [J]. MEASUREMENT, 2019, 144 : 14 - 19
  • [40] Looking for natural patterns in data - Part 1. Density-based approach
    Daszykowski, M
    Walczak, B
    Massart, DL
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2001, 56 (02) : 83 - 92