Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

被引:34
|
作者
Gong, Lina [1 ,2 ,3 ]
Jiang, Shujuan [1 ,2 ]
Jiang, Li [1 ,2 ]
机构
[1] China Univ Min & Technol, Sch Comp Sci & Technol, Xuzhou 221116, Jiangsu, Peoples R China
[2] China Univ Min & Technol, Minist Educ, Mine Digitizat Engn Res Ctr, Xuzhou 221116, Jiangsu, Peoples R China
[3] Zaozhuang Univ, Dept Informat Sci & Engn, Zaozhuang 277160, Peoples R China
来源
IEEE ACCESS | 2019年 / 7卷
关键词
Software defect prediction; over-sampling; class imbalance; K-means; noise filtering; SMOTE; QUALITY; MODELS; NOISY;
D O I
10.1109/ACCESS.2019.2945858
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In practice, Software Defect Prediction (SDP) models often suffer from highly imbalanced data, which makes classifiers difficult to identify defective instances. Recently, many techniques were proposed to tackle this problem, over-sampling technique is one of the most well-known methods to address class imbalance problem. This technique balances the number of defective and non-defective instances by generating new defective instances. However, these approaches would generate non-diverse synthetic instances, and many unnecessary noise instances at the same time. Motived by this, we propose a Cluster-based Over-sampling with noise filtering (KMFOS) approach to tackle class imbalance problem in SDP. KMFOS firstly divides defective instances into K clusters, and new defective instances are generated by interpolation between instances of each two clusters. After this, these new defective instances would diversely spread in the space of defective dataset. Then, we extend this cluster-based over-sampling through the Closest List Noise Identification (CLNI) to clean the noise instances. We do extensive experiments on 24 projects to compare KMFOS with some over-sampling approaches such as SMOTE, Borderline-SMOTE, ADASYN, random over-sampling (ROS), K-means SMOTE, SMOTE + IPF, SMOTE + ENN and SMOTE + Tomek Links using five prediction classifiers. At the same time, we also compare KMFOS with other state-of-the-art class-imbalance methods including balance bagging classifier, RUS boost classifier, Instance Hardness Threshold and cost-sensitive methods. Experimental results indicate our KMFOS can obtain better Recall and bal values than other over-sampling methods and other compared class-imbalance methods. Hence, KMFOS is an efficient approach to generate balanced data for SDP and improves the performance of predicting models.
引用
收藏
页码:145725 / 145737
页数:13
相关论文
共 38 条
  • [1] Diversity based multi-cluster over sampling approach to alleviate the class imbalance problem in software defect prediction
    Arun, C.
    Lakshmi, C.
    [J]. INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2023,
  • [2] Cluster-Based Minority Over-Sampling for Imbalanced Datasets
    Puntumapon, Kamthorn
    Rakthamamon, Thanawin
    Waiyamai, Kitsana
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2016, E99D (12): : 3101 - 3109
  • [3] Tackling class overlap and imbalance problems in software defect prediction
    Lin Chen
    Bin Fang
    Zhaowei Shang
    Yuanyan Tang
    [J]. Software Quality Journal, 2018, 26 : 97 - 125
  • [4] Tackling class overlap and imbalance problems in software defect prediction
    Chen, Lin
    Fang, Bin
    Shang, Zhaowei
    Tang, Yuanyan
    [J]. SOFTWARE QUALITY JOURNAL, 2018, 26 (01) : 97 - 125
  • [5] Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance
    Mohammad Mahdi NezhadShokouhi
    Mohammad Ali Majidi
    Abbas Rasoolzadegan
    [J]. The Journal of Supercomputing, 2020, 76 : 602 - 635
  • [6] Software defect prediction using over-sampling and feature extraction based on Mahalanobis distance
    NezhadShokouhi, Mohammad Mahdi
    Majidi, Mohammad Ali
    Rasoolzadegan, Abbas
    [J]. JOURNAL OF SUPERCOMPUTING, 2020, 76 (01): : 602 - 635
  • [7] An Empirical Study on Software Defect Prediction Using Over-Sampling by SMOTE
    Pak, Cholmyong
    Wang, Tian Tian
    Su, Xiao Hong
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2018, 28 (06) : 811 - 830
  • [8] An Empirical Study on Data Sampling Methods in Addressing Class Imbalance Problem in Software Defect Prediction
    Odejide, Babajide J.
    Bajeh, Amos O.
    Balogun, Abdullateef O.
    Alanamu, Zubair O.
    Adewole, Kayode S.
    Akintola, Abimbola G.
    Salihu, Shakirat A.
    Usman-Hamza, Fatima E.
    Mojeed, Hammed A.
    [J]. SOFTWARE ENGINEERING PERSPECTIVES IN SYSTEMS, VOL. 1, 2022, 501 : 594 - 610
  • [9] Improved over-sampling techniques based on sparse representation for imbalance problem
    Zou, Xionggao
    Feng, Yueping
    Li, Huiying
    Jiang, Shuyu
    [J]. INTELLIGENT DATA ANALYSIS, 2018, 22 (05) : 939 - 958
  • [10] Geometric mean based boosting algorithm with over-sampling to resolve data imbalance problem for bankruptcy prediction
    Kim, Myoung-Jong
    Kang, Dae-Ki
    Kim, Hong Bae
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (03) : 1074 - 1082