SOFTWARE DEFECT PREDICTION: ANALYSIS OF CLASS IMBALANCE AND PERFORMANCE STABILITY

被引:0
|
作者
Balogun, Abdullateef O. [1 ,2 ,5 ]
Basri, Shuib [1 ,5 ]
Abdulkadir, Said J. [1 ,6 ]
Adeyemo, Victor E. [3 ]
Imam, Abdullahi A. [1 ,4 ,5 ]
Bajeh, Amos O. [2 ]
机构
[1] Univ Teknol Petronas, Dept Comp & Informat Sci, Seri Iskandar 32610, Perak, Malaysia
[2] Univ Ilorin, Dept Comp Sci, Ilorin, Nigeria
[3] Taylors Univ, Sch Comp & IT, Subang Jaya, Selangor, Malaysia
[4] Ahmadu Bello Univ, Dept Comp Sci, Zaria, Nigeria
[5] Univ Teknol Petronas, Software Qual & Qual Engn SQ2E Res Cluster, Seri Iskandar 32610, Perak, Malaysia
[6] Univ Teknol Petronas, Ctr Res Data Sci CERDAS, Seri Iskandar 32610, Perak, Malaysia
来源
关键词
Class imbalance; Data quality; Software defect prediction; SELECTION; CLASSIFICATION; FRAMEWORK; MODEL;
D O I
暂无
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The performance of prediction models in software defect prediction depends on the quality of datasets used for training such models. Class imbalance is one of data quality problems that affect prediction models. This has drawn the attention of researchers and many approaches have been developed to address this issue. In this study, an extensive empirical study is presented, which evaluates the performance stability of prediction models in SDP. Ten software defect datasets from NASA and PROMISE repositories with varying imbalance ratio (IR) values were used as the original datasets. New datasets are generated from the original datasets using undersampling (Random under Sampling: RUS) and oversampling (Synthetic Minority Oversampling Technique: SMOTE) methods with different IR values. The sampling techniques were based on the equal proportion (100%) of the increment (SMOTE) of minority class label or decrement (RUS) of the majority class label until each dataset is balanced. IR is the ratio of the defective instances to non-defective instances in a dataset. Each newly generated datasets with different IR values based on different sampling techniques were randomized before applying prediction models. Nine standard prediction models were used on the newly generated datasets. The performance of the prediction models was measured using the Area Under Curve (AUC) and Co-efficient of Variation (CV) is used to determine the performance stability. Firstly, experimental results showed that class imbalance had a negative effect on the performance of prediction models and the oversampling method (SMOTE) enhanced the performances of prediction models. Secondly, Oversampling method of balancing datasets is better than using Undersampling methods as the latter had poor performance as a result of the random deletion of useful instances in the datasets. Finally, among the prediction models used in this study, it appeared that Logistic Regression (LR) (RUS: 30.05; SMOTE: 33.51), Naive Bayes (NB) (RUS: 34.18; SMOTE: 33.05), and Random Forest (RF) (RUS: 29.24; SMOTE: 64.25) with their respective CV values are more stable prediction models and they work well with imbalanced datasets.
引用
收藏
页码:3294 / 3308
页数:15
相关论文
共 50 条
  • [1] Influence Analysis Method of Class Imbalance on Software Defect Prediction Model Stability and Prediction Performance
    Zhang, Yan-Mei
    Zhi, Sheng-Lin
    Jiang, Shu-Juan
    Yuan, Guan
    [J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2023, 51 (08): : 2076 - 2087
  • [2] The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study
    Yu, Qiao
    Jiang, Shujuan
    Zhang, Yanmei
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2017, E100D (02) : 265 - 272
  • [3] Using Class Imbalance Learning for Software Defect Prediction
    Wang, Shuo
    Yao, Xin
    [J]. IEEE TRANSACTIONS ON RELIABILITY, 2013, 62 (02) : 434 - 443
  • [4] Class Imbalance Data-Generation for Software Defect Prediction
    Li, Zheng
    Zhang, Xingyao
    Guo, Junxia
    Shang, Ying
    [J]. 2019 26TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC), 2019, : 276 - 283
  • [5] Tackling class overlap and imbalance problems in software defect prediction
    Lin Chen
    Bin Fang
    Zhaowei Shang
    Yuanyan Tang
    [J]. Software Quality Journal, 2018, 26 : 97 - 125
  • [6] Class Imbalance Reduction (CIR): A Novel Approach to Software Defect Prediction in the Presence of Class Imbalance
    Bejjanki, Kiran Kumar
    Gyani, Jayadev
    Gugulothu, Narsimha
    [J]. SYMMETRY-BASEL, 2020, 12 (03):
  • [7] Tackling class overlap and imbalance problems in software defect prediction
    Chen, Lin
    Fang, Bin
    Shang, Zhaowei
    Tang, Yuanyan
    [J]. SOFTWARE QUALITY JOURNAL, 2018, 26 (01) : 97 - 125
  • [8] An Ensemble Oversampling Model for Class Imbalance Problem in Software Defect Prediction
    Huda, Shamsul
    Liu, Kevin
    Abdelrazek, Mohamed
    Ibrahim, Amani
    Alyahya, Sultan
    Al-Dossari, Hmood
    Ahmad, Shafiq
    [J]. IEEE ACCESS, 2018, 6 : 24184 - 24195
  • [9] A Survey of Different Approaches for the Class Imbalance Problem in Software Defect Prediction
    Dar, Abdul Waheed
    Farooq, Sheikh Umar
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE SCIENCE AND COMPUTATIONAL INTELLIGENCE-IJSSCI, 2022, 14 (01):
  • [10] Class Imbalance Learning to Heterogeneous Cross-Software Projects Defect Prediction
    Vashisht, Rohit
    Rizvi, Syed Afzal Murtaza
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE INNOVATION, 2022, 10 (01)