Categorical missing data imputation for software cost estimation by multinomial logistic regression

被引:44
|
作者
Sentas, P [1 ]
Angelis, L [1 ]
机构
[1] Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki 54124, Greece
关键词
software effort prediction; cost estimation; missing data; imputation; multinomial logistic regression;
D O I
10.1016/j.jss.2005.02.026
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A common problem in software cost estimation is the manipulation of incomplete or missing data in databases used for the development of prediction models. In such cases, the most popular and simple method of handling missing data is to ignore either the projects or the attributes with missing observations. This technique causes the loss of valuable information and therefore may lead to inaccurate cost estimation models. On the other hand, there are various imputation methods used to estimate the missing values in a data set. These methods are applied mainly on numerical data and produce continuous estimates. However, it is well known that the majority of the cost data sets contain software projects with mostly categorical attributes with many missing values. It is therefore reasonable to use some estimating method producing categorical rather than continuous values. The purpose of this paper is to investigate the possibility of using such a method for estimating categorical missing values in software cost databases. Specifically, the method known as multinomial logistic regression (MLR) is suggested for imputation and is applied on projects of the ISBSG multi-organizational software database. Comparisons of NILR with other techniques for handling missing data, such as listwise deletion (LD), mean imputation (MI), expectation maximization (EM) and regression imputation (RI) under different patterns and percentages of missing data, show the high efficiency of the proposed method. (C) 2005 Elsevier Inc. All rights reserved.
引用
收藏
页码:404 / 414
页数:11
相关论文
共 50 条
  • [41] REGRESSION IMPUTATION OF MISSING VALUES IN LONGITUDINAL DATA SETS
    SCHNEIDERMAN, ED
    KOWALSKI, CJ
    WILLIS, SM
    INTERNATIONAL JOURNAL OF BIO-MEDICAL COMPUTING, 1993, 32 (02): : 121 - 133
  • [42] Imputation Methods for Multiple Regression with Missing Heteroscedastic Data
    Asif, Muhammad
    Samart, Klairung
    THAILAND STATISTICIAN, 2022, 20 (01): : 1 - 15
  • [43] Handling Missing Data in Presence of Categorical Variables: a New Imputation Procedure
    Ferrari, Pier Alda
    Barbiero, Alessandro
    Manzi, Giancarlo
    NEW PERSPECTIVES IN STATISTICAL MODELING AND DATA ANALYSIS, 2011, : 473 - 480
  • [44] Latent class based multiple imputation approach for missing categorical data
    Gebregziabher, Mulugeta
    DeSantis, Stacia M.
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 2010, 140 (11) : 3252 - 3262
  • [45] CONDITIONAL LOGISTIC-REGRESSION WITH MISSING DATA
    GIBBONS, LE
    HOSMER, DW
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 1991, 20 (01) : 109 - 120
  • [46] Semiparametric Multinomial Logistic Regression for Multivariate Point Pattern Data
    Hessellund, Kristian Bjorn
    Xu, Ganggang
    Guan, Yongtao
    Waagepetersen, Rasmus
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2022, 117 (539) : 1500 - 1515
  • [47] REGRESSION ESTIMATION OF MISSING DATA
    OGRADY, KE
    BEHAVIOR RESEARCH METHODS & INSTRUMENTATION, 1982, 14 (03): : 359 - 360
  • [48] Multiple imputation of unordered categorical missing data: A comparison of the multivariate normal imputation and multiple imputation by chained equations
    Karangwa, Innocent
    Kotze, Danelle
    Blignaut, Renette
    BRAZILIAN JOURNAL OF PROBABILITY AND STATISTICS, 2016, 30 (04) : 521 - 539
  • [49] Fast sparse multinomial logistic regression and big data parallelism
    Zhang, Liping (zhanglp@cqupt.edu.cn), 1600, Universidad Central de Venezuela (55):
  • [50] Mixtures of logistic normal multinomial regression models for microbiome data
    Dai, Wenshu
    Fang, Yuan
    Subedi, Sanjeena
    JOURNAL OF APPLIED STATISTICS, 2024,