Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods

被引:152
|
作者
Myrtveit, I [1 ]
Stensrud, E [1 ]
Olsson, UH [1 ]
机构
[1] Norwegian Sch Management, N-1301 Sandvika, Norway
关键词
software effort prediction; cost estimation; missing data; imputation methods; listwise deletion; mean imputation; similar response pattern imputation; full information maximum likelihood; log-log regression; ERP;
D O I
10.1109/32.965340
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Missing data are often encountered in data sets used to construct effort prediction models. Thus far, the common practice has been to ignore observations with missing data. This may result in biased prediction models. In this paper, we evaluate four missing data techniques (MDTs) in the context of software cost modeling: listwise deletion (LID), mean imputation (MI), similar response pattern imputation (SRPI), and full information maximum likelihood (FIML). We apply the MDTs to an ERP data set, and thereafter construct regression-based prediction models using the resulting data sets. The evaluation suggests that only FIML is appropriate when the data are not missing completely at random (MCAR). Unlike FIML, prediction models constructed on LD, MI and SRPI data sets will be biased unless the data are MCAR. Furthermore, compared to LID, MI and SRPI seem appropriate only if the resulting LID data set is too small to enable the construction of a meaningful regression-based prediction model.
引用
收藏
页码:999 / 1013
页数:15
相关论文
共 50 条
  • [1] Empirical likelihood-based inference under imputation for missing response data
    Wang, QH
    Rao, JNK
    [J]. ANNALS OF STATISTICS, 2002, 30 (03): : 896 - 924
  • [2] Empirical likelihood-based hot deck imputation methods
    Xue, Yijie
    Lazar, Nicole A.
    [J]. JOURNAL OF NONPARAMETRIC STATISTICS, 2012, 24 (03) : 629 - 646
  • [3] Missing Data and Imputation Methods
    Schober, Patrick
    Vetter, Thomas R.
    [J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
  • [4] Empirical likelihood-based inference in linear models with missing data
    Wang, QH
    Rao, JNK
    [J]. SCANDINAVIAN JOURNAL OF STATISTICS, 2002, 29 (03) : 563 - 576
  • [5] Methods for imputation of missing values in air quality data sets
    Junninen, H
    Niska, H
    Tuppurainen, K
    Ruuskanen, J
    Kolehmainen, M
    [J]. ATMOSPHERIC ENVIRONMENT, 2004, 38 (18) : 2895 - 2907
  • [6] A unified theory on empirical likelihood methods for missing data
    Chen, Sixia
    [J]. STATISTICS AND ITS INTERFACE, 2013, 6 (03) : 325 - 338
  • [7] Generalized empirical likelihood methods for analyzing longitudinal data
    Wang, Suojin
    Qian, Lianfen
    Carroll, Raymond J.
    [J]. BIOMETRIKA, 2010, 97 (01) : 79 - 93
  • [8] Imputation methods for missing data in educational diagnostic evaluation
    Fernandez-Alonso, Ruben
    Suarez-Alvarez, Javier
    Muniz, Jose
    [J]. PSICOTHEMA, 2012, 24 (01) : 167 - 175
  • [9] Three likelihood-based methods for mean and covariance structure analysis with nonnormal missing data
    Yuan, KH
    Bentler, PM
    [J]. SOCIOLOGICAL METHODOLOGY 2000, VOL 30, 2000, 30 : 165 - 200
  • [10] Empirical Likelihood-based Inferences in Varying Coefficient Models with Missing Data
    Xiao-hui LIU
    [J]. Acta Mathematicae Applicatae Sinica, 2015, 31 (03) : 823 - 840