The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification

被引:25
|
作者
Bennin, Kwabena Ebo [1 ]
Keung, Jacky [1 ]
Monden, Akito [2 ]
Phannachitta, Passakorn [3 ]
Mensah, Solomon [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
[2] Okayama Univ, Grad Sch Nat Sci & Technol, Okayama, Japan
[3] Chiang Mai Univ, Coll Arts Media & Technol, Chiang Mai, Thailand
关键词
Imbalanced data; Defect prediction; Sampling methods; Statistical significance; Empirical software engineering; STATIC CODE ATTRIBUTES; PREDICTION; SMOTE;
D O I
10.1109/ESEM.2017.50
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. Goal: To investigate the statistical and practical significance of using resampled data for constructing defect prediction models. Method: We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results: There are statistical significant differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets. Conclusions: Existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing.
引用
收藏
页码:364 / 373
页数:10
相关论文
共 50 条
  • [31] Assessing the Significant Impact of Concept Drift in Software Defect Prediction
    Kabir, Md Alamgir
    Keung, Jacky W.
    Bennin, Kwabena E.
    Zhang, Miao
    [J]. 2019 IEEE 43RD ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2019, : 53 - 58
  • [32] Review Study on Software Defect Prediction Models premised upon Various Data Mining Approaches
    Bisht, Bharti
    Gandhi, Parul
    [J]. PROCEEDINGS OF THE 2019 6TH INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT (INDIACOM), 2019, : 815 - 818
  • [33] An empirical study on the effectiveness of data resampling approaches for cross-project software defect prediction
    Bennin, Kwabena Ebo
    Tahir, Amjed
    MacDonell, Stephen G.
    Borstler, Jurgen
    [J]. IET SOFTWARE, 2022, 16 (02) : 185 - 199
  • [34] Software Defect Prediction with Skewed Data
    Seliya, Naeem
    Khoshgoftaar, Taghi M.
    [J]. 16TH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY AND QUALITY IN DESIGN, 2010, : 403 - +
  • [35] Applications of Timer of Data Sampling Software
    Zhuo, Hongyan
    Ge, Chengliang
    Liu, Zhiqiang
    Zhang, Jiaru
    [J]. PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON ELECTRONIC INDUSTRY AND AUTOMATION (EIA 2017), 2017, 145 : 148 - 150
  • [36] Data sampling approach using heuristic Learning Vector Quantization (LVQ) classifier for software defect prediction
    Amanullah, M.
    Ramya, S. Thanga
    Sudha, M.
    Pushparathi, V. P. Gladis
    Haldorai, Anandakumar
    Pant, Bhaskar
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 44 (03) : 3867 - 3876
  • [37] Using Process Enactment Data Analysis to Support Orthogonal Defect Classification for Software Process Improvement
    Soylemez, Mehmet
    Tarhan, Ayca
    [J]. 2013 JOINT CONFERENCE OF THE 23RD INTERNATIONAL WORKSHOP ON SOFTWARE MEASUREMENT AND THE 2013 EIGHTH INTERNATIONAL CONFERENCE ON SOFTWARE PROCESS AND PRODUCT MEASUREMENT (IWSM-MENSURA), 2013, : 120 - 125
  • [38] A Survey on Data-driven Software Vulnerability Assessment and Prioritization
    Le, Triet H. M.
    Chen, Huaming
    Babar, M. Ali
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (05)
  • [39] Mining software defect data to support software testing management
    Rattikorn Hewett
    [J]. Applied Intelligence, 2011, 34 : 245 - 257
  • [40] Mining software defect data to support software testing management
    Hewett, Rattikorn
    [J]. APPLIED INTELLIGENCE, 2011, 34 (02) : 245 - 257