The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification

被引:25
|
作者
Bennin, Kwabena Ebo [1 ]
Keung, Jacky [1 ]
Monden, Akito [2 ]
Phannachitta, Passakorn [3 ]
Mensah, Solomon [1 ]
机构
[1] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
[2] Okayama Univ, Grad Sch Nat Sci & Technol, Okayama, Japan
[3] Chiang Mai Univ, Coll Arts Media & Technol, Chiang Mai, Thailand
关键词
Imbalanced data; Defect prediction; Sampling methods; Statistical significance; Empirical software engineering; STATIC CODE ATTRIBUTES; PREDICTION; SMOTE;
D O I
10.1109/ESEM.2017.50
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Context: Recent studies have shown that performance of defect prediction models can be affected when data sampling approaches are applied to imbalanced training data for building defect prediction models. However, the magnitude (degree and power) of the effect of these sampling methods on the classification and prioritization performances of defect prediction models is still unknown. Goal: To investigate the statistical and practical significance of using resampled data for constructing defect prediction models. Method: We examine the practical effects of six data sampling methods on performances of five defect prediction models. The prediction performances of the models trained on default datasets (no sampling method) are compared with that of the models trained on resampled datasets (application of sampling methods). To decide whether the performance changes are significant or not, robust statistical tests are performed and effect sizes computed. Twenty releases of ten open source projects extracted from the PROMISE repository are considered and evaluated using the AUC, pd, pf and G-mean performance measures. Results: There are statistical significant differences and practical effects on the classification performance (pd, pf and G-mean) between models trained on resampled datasets and those trained on the default datasets. However, sampling methods have no statistical and practical effects on defect prioritization performance (AUC) with small or no effect values obtained from the models trained on the resampled datasets. Conclusions: Existing sampling methods can properly set the threshold between buggy and clean samples, while they cannot improve the prediction of defect-proneness itself. Sampling methods are highly recommended for defect classification purposes when all faulty modules are to be considered for testing.
引用
收藏
页码:364 / 373
页数:10
相关论文
共 50 条
  • [1] Impact of the Distribution Parameter of Data Sampling Approaches on Software Defect Prediction Models
    Bennin, Kwabena Ebo
    Keung, Jacky
    Monden, Akito
    [J]. 2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE (APSEC 2017), 2017, : 630 - 635
  • [2] Software defect classification: A comparative study with rough hybrid approaches
    Ramanna, Sheela
    Bhatt, Rajen
    Biernot, Piotr
    [J]. ROUGH SETS AND INTELLIGENT SYSTEMS PARADIGMS, PROCEEDINGS, 2007, 4585 : 630 - +
  • [3] Oversampling boosting for classification of imbalanced software defect data
    Li, Guangling
    Wang, Shihai
    [J]. PROCEEDINGS OF THE 35TH CHINESE CONTROL CONFERENCE 2016, 2016, : 4149 - 4154
  • [4] Defect Prioritization in the Software Industry: Challenges and Opportunities
    Kaushik, Nilam
    Amoui, Mehdi
    Tahvildari, Ladan
    Liu, Weining
    Li, Shimin
    [J]. 2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON SOFTWARE TESTING, VERIFICATION AND VALIDATION (ICST 2013), 2013, : 70 - 73
  • [5] A Comparison of Semi-Supervised Classification Approaches for Software Defect Prediction
    Catal, Cagatay
    [J]. JOURNAL OF INTELLIGENT SYSTEMS, 2014, 23 (01) : 75 - 82
  • [6] On the relative value of data resampling approaches for software defect prediction
    Kwabena Ebo Bennin
    Jacky W. Keung
    Akito Monden
    [J]. Empirical Software Engineering, 2019, 24 : 602 - 636
  • [7] On the relative value of data resampling approaches for software defect prediction
    Bennin, Kwabena Ebo
    Keung, Jacky W.
    Monden, Akito
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2019, 24 (02) : 602 - 636
  • [8] Impact of Data Sampling on Feature Selection Techniques for Software Defect Prediction
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. PROCEEDINGS 18TH ISSAT INTERNATIONAL CONFERENCE ON RELIABILITY & QUALITY IN DESIGN, 2012, : 91 - +
  • [9] Aggregating Data Sampling with Feature Subset Selection to Address Skewed Software Defect Data
    Gao, Kehan
    Khoshgoftaar, Taghi M.
    Napolitano, Amri
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2015, 25 (9-10) : 1531 - 1550
  • [10] Sampling Approaches for Imbalanced Data Classification Problem in Machine Learning
    Tyagi, Shivani
    Mittal, Sangeeta
    [J]. PROCEEDINGS OF RECENT INNOVATIONS IN COMPUTING, ICRIC 2019, 2020, 597 : 209 - 221