On the relative value of data resampling approaches for software defect prediction

被引:0
|
作者
Kwabena Ebo Bennin
Jacky W. Keung
Akito Monden
机构
[1] City University of Hong Kong,Department of Computer Science
[2] Okayama University,Graduate School of Natural Science and Technology
来源
关键词
Software defect prediction; Imbalanced data; Data resampling approaches; Class imbalance; Empirical study;
D O I
暂无
中图分类号
学科分类号
摘要
Software defect data sets are typically characterized by an unbalanced class distribution where the defective modules are fewer than the non-defective modules. Prediction performances of defect prediction models are detrimentally affected by the skewed distribution of the faulty minority modules in the data set since most algorithms assume both classes in the data set to be equally balanced. Resampling approaches address this concern by modifying the class distribution to balance the minority and majority class distribution. However, very little is known about the best distribution for attaining high performance especially in a more practical scenario. There are still inconclusive results pertaining to the suitable ratio of defect and clean instances (Pfp), the statistical and practical impacts of resampling approaches on prediction performance and the more stable resampling approach across several performance measures. To assess the impact of resampling approaches, we investigated the bias and effect of commonly used resampling approaches on prediction accuracy in software defect prediction. Analyzes of six resampling approaches on 40 releases of 20 open-source projects across five performance measures and five imbalance rates were performed. The experimental results obtained indicate that there were statistical differences between the prediction results with and without resampling methods when evaluated with the geometric-mean, recall(pd), probability of false alarms(pf ) and balance performance measures. However, resampling methods could not improve the AUC values across all prediction models implying that resampling methods can help in defect classification but not defect prioritization. A stable Pfp rate was dependent on the performance measure used. Lower Pfp rates are required for lower pf values while higher Pfp values are required for higher pd values. Random Under-Sampling and Borderline-SMOTE proved to be the more stable resampling method across several performance measures among the studied resampling methods. Performance of resampling methods are dependent on the imbalance ratio, evaluation measure and to some extent the prediction model. Newer oversampling methods should aim at generating relevant and informative data samples and not just increasing the minority samples.
引用
收藏
页码:602 / 636
页数:34
相关论文
共 50 条
  • [1] On the relative value of data resampling approaches for software defect prediction
    Bennin, Kwabena Ebo
    Keung, Jacky W.
    Monden, Akito
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2019, 24 (02) : 602 - 636
  • [2] An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction
    Bennin, Kwabena Ebo
    Tahir, Amjed
    MacDonell, Stephen G.
    Borstler, Jurgen
    [J]. IET SOFTWARE, 2022, 16 (02) : 185 - 199
  • [3] .Applying novel resampling strategies to software defect prediction
    Pelayo, Lourdes
    Dick, Scott
    [J]. NAFIPS 2007 - 2007 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, 2007, : 69 - +
  • [4] Software Defect Prediction with Bayesian Approaches
    Hernandez-Molinos, Maria Jose
    Sanchez-Garcia, Angel J.
    Barrientos-Martinez, Rocio Erandi
    Perez-Arriaga, Juan Carlos
    Ocharan-Hernandez, Jorge Octavio
    [J]. MATHEMATICS, 2023, 11 (11)
  • [5] Progress on approaches to software defect prediction
    Li, Zhiqiang
    Jing, Xiao-Yuan
    Zhu, Xiaoke
    [J]. IET SOFTWARE, 2018, 12 (03) : 161 - 175
  • [6] Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models
    Bennin, Kwabena Ebo
    Keung, Jacky
    Monden, Akito
    [J]. 2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC 2017), 2017, : 630 - 635
  • [7] Software Defect Prediction with Skewed Data
    Seliya, Naeem
    Khoshgoftaar, Taghi M.
    [J]. 16TH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, 2010, : 403 - +
  • [8] On the relative value of cross-company and within-company data for defect prediction
    Burak Turhan
    Tim Menzies
    Ayşe B. Bener
    Justin Di Stefano
    [J]. Empirical Software Engineering, 2009, 14 : 540 - 578
  • [9] On the relative value of cross-company and within-company data for defect prediction
    Turhan, Burak
    Menzies, Tim
    Bener, Ayse B.
    Di Stefano, Justin
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2009, 14 (05) : 540 - 578
  • [10] On the Value of Oversampling for Deep Learning in Software Defect Prediction
    Yedida, Rahul
    Menzies, Tim
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (08) : 3103 - 3116