Dealing with Missing Data and Uncertainty in the Context of Data Mining

被引:10
|
作者
Aleryani, Aliya [1 ,2 ]
Wang, Wenjia [1 ]
De La Iglesia, Beatriz [1 ]
机构
[1] Univ East Anglia, Norwich NR4 7TJ, Norfolk, England
[2] King Khalid Univ, Abha 61421, Saudi Arabia
基金
英国经济与社会研究理事会;
关键词
Missing data; Classification algorithms; Complete case analysis; Single imputation; CLASSIFICATION; IMPUTATION;
D O I
10.1007/978-3-319-92639-1_24
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Missing data is an issue in many real-world datasets yet robust methods for dealing with missing data appropriately still need development. In this paper we conduct an investigation of how some methods for handling missing data perform when the uncertainty increases. Using benchmark datasets from the UCI Machine Learning repository we generate datasets for our experimentation with increasing amounts of data Missing Completely At Random (MCAR) both at the attribute level and at the record level. We then apply four classification algorithms: C4.5, Random Forest, Naive Bayes and Support Vector Machines (SVMs). We measure the performance of each classifiers on the basis of complete case analysis, simple imputation and then we study the performance of the algorithms that can handle missing data. We find that complete case analysis has a detrimental effect because it renders many datasets infeasible when missing data increases, particularly for high dimensional data. We find that increasing missing data does have a negative effect on the performance of all the algorithms tested but the different algorithms tested either using preprocessing in the form of simple imputation or handling the missing data do not show a significant difference in performance.
引用
收藏
页码:289 / 301
页数:13
相关论文
共 50 条
  • [1] Dealing With Missing Data
    Sainani, Kristin L.
    [J]. PM&R, 2015, 7 (09) : 990 - 994
  • [2] Special issue on dealing with uncertainty in data mining and information extraction
    Chen, GQ
    Xu, Y
    [J]. INFORMATION SCIENCES, 2005, 173 (04) : 277 - 279
  • [3] Dealing with deficient and missing data
    Dohoo, Ian R.
    [J]. PREVENTIVE VETERINARY MEDICINE, 2015, 122 (1-2) : 221 - 228
  • [4] Innovations in dealing with missing data or missing reports
    Meng, Xiao-Li
    [J]. STATISTICA SINICA, 2006, 16 (04) : 1061 - 1070
  • [5] Data mining of missing persons data
    Blackmore, K
    Bossomaier, T
    Foy, S
    Thomson, D
    [J]. CLASSIFICATION AND CLUSTERING FOR KNOWLEDGE DISCOVERY, 2005, 4 : 305 - 314
  • [6] Data mining and the impact of missing data
    Brown, ML
    Kros, JF
    [J]. INDUSTRIAL MANAGEMENT & DATA SYSTEMS, 2003, 103 (8-9) : 611 - 621
  • [7] Missing Data in Collaborative Data Mining
    Anton, Carmen Ana
    Matei, Oliviu
    Avram, Anca
    [J]. COMPUTATIONAL STATISTICS AND MATHEMATICAL MODELING METHODS IN INTELLIGENT SYSTEMS, VOL. 2, 2019, 1047 : 100 - 109
  • [8] ORGANIZING DATA AND DEALING WITH UNCERTAINTY
    BELL, CB
    [J]. AMERICAN STATISTICIAN, 1980, 34 (04): : 236 - 236
  • [9] Dealing with missing software project data
    Cartwright, MH
    Shepperd, MJ
    Song, Q
    [J]. NINTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM, PROCEEDINGS, 2003, : 154 - 165
  • [10] Dealing with missing data: Part II
    Walczak, B
    Massart, DL
    [J]. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2001, 58 (01) : 29 - 42