On the Use of Evaluation Measures for Defect Prediction Studies

被引：14

作者：

Moussa, Rebecca ^{[1
]}

Sarro, Federica ^{[1
]}

机构：

[1] UCL, London, England

来源：

PROCEEDINGS OF THE 31ST ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2022 | 2022年

基金：

欧洲研究理事会;

关键词：

Software Defect Prediction; Evaluation Measures; STATIC CODE ATTRIBUTES;

D O I：

10.1145/3533767.3534405

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Software defect prediction research has adopted various evaluation measures to assess the performance of prediction models. In this paper, we further stress on the importance of the choice of appropriate measures in order to correctly assess strengths and weaknesses of a given defect prediction model, especially given that most of the defect prediction tasks suffer from data imbalance. Investigating 111 previous studies published between 2010 and 2020, we found out that over a half either use only one evaluation measure, which alone cannot express all the characteristics of model performance in presence of imbalanced data, or a set of binary measures which are prone to be biased when used to assess models especially when trained with imbalanced data. We also unveil the magnitude of the impact of assessing popular defect prediction models with several evaluation measures based, for the first time, on both statistical significance test and effect size analyses. Our results reveal that the evaluation measures produce a different ranking of the classification models in 82% and 85% of the cases studied according to the Wilcoxon statistical significance test and (A) over cap (12) effect size, respectively. Further, we observe a very high rank disruption (between 64% to 92% on average) for each of the measures investigated. This signifies that, in the majority of the cases, a prediction technique that would be believed to be better than others when using a given evaluation measure becomes worse when using a different one. We conclude by providing some recommendations for the selection of appropriate evaluation measures based on factors which are specific to the problem at hand such as the class distribution of the training data, the way in which the model has been built and will be used. Moreover, we recommend to include in the set of evaluation measures, at least one able to capture the full picture of the confusion matrix, such as MCC. This will enable researchers to assess whether proposals made in previous work can be applied for purposes different than the ones they were originally intended for. Besides, we recommend to report, whenever possible, the raw confusion matrix to allow other researchers to compute any measure of interest thereby making it feasible to draw meaningful observations across different studies.

引用

页码：101 / 113

页数：13

共 50 条

[41] On the use of checklist measures of coping in studies of adaptation to cancer
Somerfield, MR
JOURNAL OF PSYCHOSOCIAL ONCOLOGY, 1996, 14 (01) : 21 - 40
[42] Evaluation of measures of urinary albumin excretion in epidemiologic studies
Dyer, AR
Greenland, P
Elliott, P
Daviglus, ML
Claeys, G
Kesteloot, H
Ueshima, H
Stamler, J
AMERICAN JOURNAL OF EPIDEMIOLOGY, 2004, 160 (11) : 1122 - 1131
[43] Measures of line jaggedness and their use in foods textural evaluation
Peleg, M
CRITICAL REVIEWS IN FOOD SCIENCE AND NUTRITION, 1997, 37 (06) : 491 - 518
[44] An evaluation of short anxiety measures for use in the emergency department
Coleman, Keli D.
Chow, Yvonne
Jacobson, Ashley
Hainsworth, Keri R.
Drendel, Amy L.
AMERICAN JOURNAL OF EMERGENCY MEDICINE, 2021, 50 : 679 - 682
[45] Evaluation of Participatory Strategies on the Use of Ergonomic Measures and Costs
Visser, Steven
van der Molen, Henk F.
Sluiter, Judith K.
Frings-Dresen, Monique H. W.
PROCEEDINGS OF THE 20TH CONGRESS OF THE INTERNATIONAL ERGONOMICS ASSOCIATION (IEA 2018), VOL 8: ERGONOMICS AND HUMAN FACTORS IN MANUFACTURING, AGRICULTURE, BUILDING AND CONSTRUCTION, SUSTAINABLE DEVELOPMENT AND MINING, 2019, 825 : 435 - 437
[46] Comments on "Researcher Bias: The Use of Machine Learning in Software Defect Prediction"
Tantithamthavorn, Chakkrit
McIntosh, Shane
Hassan, Ahmed E.
Matsumoto, Kenichi
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2016, 42 (11) : 1092 - 1094
[47] Multicenter evaluation of the relationship between myocardial perfusion defect location and defect size in the prediction of cardiac death
Shehata, AR
Travin, MI
Miller, K
Kesler, K
BorgesNeto, S
Lauer, MS
Heller, GV
CIRCULATION, 1996, 94 (08) : 3834 - 3834
[48] An Evaluation Approach for Selecting Suitable Defect Prediction Method at Early Phases
Ozakinci, Rana
Tarhan, Ayca
2019 45TH EUROMICRO CONFERENCE ON SOFTWARE ENGINEERING AND ADVANCED APPLICATIONS (SEAA 2019), 2019, : 199 - 203
[49] IH:mpirical Evaluation of the Impact of Class Overlap on Software Defect Prediction
Gong, Lina
Jiang, Shujuan
Wang, Rongcun
Jiang, Li
34TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2019), 2019, : 710 - 721
[50] Feature Sets in Just-in-Time Defect Prediction: An Empirical Evaluation
Bludau, Peter
Pretschner, Alexander
PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING, PROMISE 2022, 2022, : 22 - 31

← 1 2 3 4 5 →