Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study

被引:3
|
作者
Gabr, Menna Ibrahim [1 ]
Helmy, Yehia Mostafa [1 ]
Elzanfaly, Doaa Saad [2 ,3 ]
机构
[1] Helwan Univ, Fac Commerce & Business Adm, Dept Business Informat Syst BIS, Cairo 11795, Egypt
[2] Helwan Univ, Fac Comp & Artificial Intelligence, Dept Informat Syst, Cairo 11795, Egypt
[3] British Univ Egypt, Fac Informat Comp Sci, Dept Informat Syst, Cairo 11837, Egypt
关键词
data quality; data completeness; missing patterns; imputation techniques; supervised; classifiers; performance measures; NEURAL-NETWORK; CLASSIFICATION; PERFORMANCE; IMPACT;
D O I
10.3390/bdcc7010055
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor is the missing data pattern and ratio, followed by the imputation method, and then the type, size, and balance of the dataset. The sensitivity of the classifiers when data are missing due to the Missing Completely At Random (MCAR) pattern is less than their sensitivity when data are missing due to the Missing Not At Random (MNAR) pattern. Furthermore, using the MCC as an evaluation measure better reflects the variation in the sensitivity of the classifiers to the missing data.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers
    Garciarena, Unai
    Santana, Roberto
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2017, 89 : 52 - 65
  • [2] Missing Data Imputation for Supervised Learning
    Poulos, Jason
    Valle, Rafael
    [J]. APPLIED ARTIFICIAL INTELLIGENCE, 2018, 32 (02) : 186 - 196
  • [3] Imputation methods for missing data in educational diagnostic evaluation
    Fernandez-Alonso, Ruben
    Suarez-Alvarez, Javier
    Muniz, Jose
    [J]. PSICOTHEMA, 2012, 24 (01) : 167 - 175
  • [4] Missing Data and Imputation Methods
    Schober, Patrick
    Vetter, Thomas R.
    [J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
  • [5] CLASSIFIERS ACCURACY IMPROVEMENT BASED ON MISSING DATA IMPUTATION
    Jordanov, Ivan
    Petrov, Nedyalko
    Petrozziello, Alessio
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH, 2018, 8 (01) : 31 - 48
  • [6] Evaluation of missing data imputation methods for human osteometric measurements
    Liu, Xiaoming
    Pang, Jinyong
    [J]. AMERICAN JOURNAL OF BIOLOGICAL ANTHROPOLOGY, 2024, 183 : 103 - 104
  • [7] An evaluation of methods for imputation of missing trace element data in groundwaters
    Dickson, Bruce L.
    Giblin, Angela M.
    [J]. GEOCHEMISTRY-EXPLORATION ENVIRONMENT ANALYSIS, 2007, 7 : 173 - 178
  • [8] A quantitative study of the effect of missing data in classifiers
    Liu, P
    Lei, L
    Wu, NJ
    [J]. FIFTH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY - PROCEEDINGS, 2005, : 28 - 32
  • [9] A Comparative Study of Missing Value Imputation Methods for Education Data
    Keerin, Phimmarin
    [J]. 29TH INTERNATIONAL CONFERENCE ON COMPUTERS IN EDUCATION (ICCE 2021), VOL II, 2021, : 109 - 117
  • [10] Missing data and imputation methods in partition of variables
    da Silva, AL
    Saporta, G
    Bacelar-Nicolau, H
    [J]. CLASSIFICATION, CLUSTERING, AND DATA MINING APPLICATIONS, 2004, : 631 - 637