Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods

被引：152

作者：

Myrtveit, I ^{[1
]}

Stensrud, E ^{[1
]}

Olsson, UH ^{[1
]}

机构：

[1] Norwegian Sch Management, N-1301 Sandvika, Norway

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2001年 / 27卷 / 11期

关键词：

software effort prediction; cost estimation; missing data; imputation methods; listwise deletion; mean imputation; similar response pattern imputation; full information maximum likelihood; log-log regression; ERP;

D O I：

10.1109/32.965340

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Missing data are often encountered in data sets used to construct effort prediction models. Thus far, the common practice has been to ignore observations with missing data. This may result in biased prediction models. In this paper, we evaluate four missing data techniques (MDTs) in the context of software cost modeling: listwise deletion (LID), mean imputation (MI), similar response pattern imputation (SRPI), and full information maximum likelihood (FIML). We apply the MDTs to an ERP data set, and thereafter construct regression-based prediction models using the resulting data sets. The evaluation suggests that only FIML is appropriate when the data are not missing completely at random (MCAR). Unlike FIML, prediction models constructed on LD, MI and SRPI data sets will be biased unless the data are MCAR. Furthermore, compared to LID, MI and SRPI seem appropriate only if the resulting LID data set is too small to enable the construction of a meaningful regression-based prediction model.

引用

页码：999 / 1013

页数：15

共 50 条

[1] Empirical likelihood-based inference under imputation for missing response data
Wang, QH
Rao, JNK
[J]. ANNALS OF STATISTICS, 2002, 30 (03): : 896 - 924
[2] Empirical likelihood-based hot deck imputation methods
Xue, Yijie
Lazar, Nicole A.
[J]. JOURNAL OF NONPARAMETRIC STATISTICS, 2012, 24 (03) : 629 - 646
[3] Missing Data and Imputation Methods
Schober, Patrick
Vetter, Thomas R.
[J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
[4] Empirical likelihood-based inference in linear models with missing data
Wang, QH
Rao, JNK
[J]. SCANDINAVIAN JOURNAL OF STATISTICS, 2002, 29 (03) : 563 - 576
[5] Methods for imputation of missing values in air quality data sets
Junninen, H
Niska, H
Tuppurainen, K
Ruuskanen, J
Kolehmainen, M
[J]. ATMOSPHERIC ENVIRONMENT, 2004, 38 (18) : 2895 - 2907
[6] A unified theory on empirical likelihood methods for missing data
Chen, Sixia
[J]. STATISTICS AND ITS INTERFACE, 2013, 6 (03) : 325 - 338
[7] Generalized empirical likelihood methods for analyzing longitudinal data
Wang, Suojin
Qian, Lianfen
Carroll, Raymond J.
[J]. BIOMETRIKA, 2010, 97 (01) : 79 - 93
[8] Imputation methods for missing data in educational diagnostic evaluation
Fernandez-Alonso, Ruben
Suarez-Alvarez, Javier
Muniz, Jose
[J]. PSICOTHEMA, 2012, 24 (01) : 167 - 175
[9] Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data
Yuan, KH
Bentler, PM
[J]. SOCIOLOGICAL METHODOLOGY 2000, VOL 30, 2000, 30 : 165 - 200
[10] Empirical Likelihood-based Inferences in Varying Coefficient Models with Missing Data
Xiao-hui LIU
[J]. Acta Mathematicae Applicatae Sinica, 2015, 31 (03) : 823 - 840

← 1 2 3 4 5 →