Impact of imputation of missing values on classification error for discrete data

被引:223
|
作者
Farhangfar, Alireza [2 ]
Kurgan, Lukasz [1 ]
Dy, Jennifer [3 ]
机构
[1] Univ Alberta, ECERF, Dept Elect & Comp Engn, Edmonton, AB T6G 2V4, Canada
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB T6G 2V4, Canada
[3] Northeastern Univ, Dept Elect & Comp Engn, Boston, MA 02115 USA
关键词
missing values; classification; imputation of missing values; single imputation; multiple imputations;
D O I
10.1016/j.patcog.2008.05.019
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Numerous industrial and research databases include missing values. It is not uncommon to encounter databases that have LIP to a half of the entries missing, making it very difficult to mine them using data analysis methods that can work only with complete data. A common way of dealing with this problem is to impute (fill-in) the missing values. This paper evaluates how the choice of different imputation methods affects the performance of classifiers that are subsequently used with the imputed data. The experiments here focus on discrete data. This paper studies the effect of missing data imputation using five single imputation methods (a mean method, a Hot deck method, a Naive-Bayes method, and the latter two methods with a recently proposed imputation framework) and one multiple imputation method (a polytomous regression based method) on classification accuracy for six popular classifiers (RIPPER, C4.5, K-nearest-neighbor, support vector machine with polynomial and RBF kernels, and Naive-Bayes) on 15 datasets. This experimental study shows that imputation with the tested methods on average improves classification accuracy when compared to classification without imputation. Although the results show that there is no universally best imputation method, Naive-Bayes imputation is shown to give the best results for the RIPPER classifier for datasets with high amount (i.e., 40% and 50%) of missing data, polytomous regression imputation is shown to be the best for support vector machine classifier with polynomial kernel, and the application of the imputation framework is shown to be superior for the support vector machine with RBF kernel and K-nearest-neighbor. The analysis of the quality of the imputation with respect to varying amounts of missing data (i.e., between 5% and 50%) shows that all imputation methods, except for the mean imputation, improve classification error for data with more than 10% of missing data. Finally, some classifiers such as C4.5 and Naive-Bayes were found to be missing data resistant, i.e., they can Produce accurate classification in the presence of missing data, while other classifiers such as K-nearest-neighbor, SVMs and RIPPER benefit from the imputation. (C) 2008 Elsevier Ltd. All rights reserved.
引用
收藏
页码:3692 / 3705
页数:14
相关论文
共 50 条
  • [1] Simple data imputation for missing feature values in binary classification
    Chatterjee, Avishek
    Woodruff, Henry
    Vallieres, Martin
    Seuntjens, Jan
    [J]. MEDICAL PHYSICS, 2019, 46 (11) : 5378 - 5378
  • [2] Effectiveness of Simple Data Imputation for Missing Feature Values in Binary Classification
    Chatterjee, A.
    Woodruff, H.
    Lobbes, M.
    van Wijk, Y.
    Beuque, M.
    Seuntjens, J.
    Lambin, P.
    [J]. MEDICAL PHYSICS, 2020, 47 (06) : E609 - E609
  • [3] Impact of missing data imputation methods on gene expression clustering and classification
    de Souto, Marcilio C. P.
    Jaskowiak, Pablo A.
    Costa, Ivan G.
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [4] The impact of heterogeneous distance functions on missing data imputation and classification performance
    Santos, Miriam Seoane
    Abreu, Pedro Henriques
    Fernandez, Alberto
    Luengo, Julian
    Santos, Joao
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 111
  • [5] Imputation of continuous missing values in profile data
    Yang, Luo
    Wang, Kaibo
    [J]. QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL, 2022, 38 (07) : 3644 - 3662
  • [6] Impact of missing data imputation methods on gene expression clustering and classification
    Marcilio CP de Souto
    Pablo A Jaskowiak
    Ivan G Costa
    [J]. BMC Bioinformatics, 16
  • [7] Impact of Missing Data on Correlation Coefficient Values: Deletion and Imputation Methods for Data Preparation
    Shantal, Mohammed
    Othman, Zalinda
    Abu Bakar, Azuraliza
    [J]. MALAYSIAN JOURNAL OF FUNDAMENTAL AND APPLIED SCIENCES, 2023, 19 (06): : 1052 - 1067
  • [8] Adaptive imputation of missing values for incomplete pattern classification
    Liu, Zhun-ga
    Pan, Quan
    Dezert, Jean
    Martin, Arnaud
    [J]. PATTERN RECOGNITION, 2016, 52 : 85 - 95
  • [9] Impact of Imputation of Missing Values on Genetic Programming based Multiple Feature Construction for Classification
    Cao Truong Tran
    Andreae, Peter
    Zhang, Mengjie
    [J]. 2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 2398 - 2405
  • [10] Evaluating the Impact of Missing Data Imputation
    Pantanowitz, Adam
    Marwala, Tshildzi
    [J]. ADVANCED DATA MINING AND APPLICATIONS, PROCEEDINGS, 2009, 5678 : 577 - 586