A survey on missing data in machine learning

被引:8
|
作者
Emmanuel, Tlamelo [1 ]
Maupong, Thabiso [1 ]
Mpoeleng, Dimane [1 ]
Semong, Thabo [1 ]
Mphago, Banyatsang [1 ]
Tabona, Oteng [1 ]
机构
[1] Botswana Int Univ Sci & Technol, Dept Comp Sci & Informat Syst, Palapye, Botswana
关键词
Missing data; Imputation; Machine learning; ABSOLUTE ERROR MAE; DATA IMPUTATION; MULTIPLE IMPUTATION; HOT DECK; CLASSIFICATION; VALUES; OPTIMIZATION; ALGORITHMS; REGRESSION; PREDICTION;
D O I
10.1186/s40537-021-00516-9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.
引用
收藏
页数:37
相关论文
共 50 条
  • [1] A survey on missing data in machine learning
    Tlamelo Emmanuel
    Thabiso Maupong
    Dimane Mpoeleng
    Thabo Semong
    Banyatsang Mphago
    Oteng Tabona
    [J]. Journal of Big Data, 8
  • [2] Data Management for Machine Learning: A Survey
    Chai, Chengliang
    Wang, Jiayi
    Luo, Yuyu
    Niu, Zeping
    Li, Guoliang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (05) : 4646 - 4667
  • [3] A survey on machine learning for data fusion
    Meng, Tong
    Jing, Xuyang
    Yan, Zheng
    Pedrycz, Witold
    [J]. INFORMATION FUSION, 2020, 57 : 115 - 129
  • [4] Regularized extreme learning machine for regression with missing data
    Yu, Qi
    Miche, Yoan
    Eirola, Emil
    van Heeswijk, Mark
    Severin, Eric
    Lendasse, Amaury
    [J]. NEUROCOMPUTING, 2013, 102 : 45 - 51
  • [5] Analysis of Machine Learning Based Imputation of Missing Data
    Rizvi, Syed Tahir Hussain
    Latif, Muhammad Yasir
    Amin, Muhammad Saad
    Telmoudi, Achraf Jabeur
    Shah, Nasir Ali
    [J]. CYBERNETICS AND SYSTEMS, 2023,
  • [6] Missing Data Imputation using Machine Learning Algorithm for Supervised Learning
    Cenitta, D.
    Arjunan, R. Vijaya
    Prema, K., V
    [J]. 2021 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2021,
  • [7] Extreme learning machine for missing data using multiple imputations
    Sovilj, Dusan
    Eirola, Emil
    Miche, Yoan
    Bjork, Kaj-Mikael
    Nian, Rui
    Akusok, Anton
    Lendasse, Amaury
    [J]. NEUROCOMPUTING, 2016, 174 : 220 - 231
  • [8] Machine Learning Based Missing Data Imputation in Categorical Datasets
    Ishaq, Muhammad
    Zahir, Sana
    Iftikhar, Laila
    Bulbul, Mohammad Farhad
    Rho, Seungmin
    Lee, Mi Young
    [J]. IEEE ACCESS, 2024, 12 : 88332 - 88344
  • [9] Sample-Based Extreme Learning Machine with Missing Data
    Gao, Hang
    Liu, Xin-Wang
    Peng, Yu-Xing
    Jian, Song-Lei
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2015, 2015
  • [10] Machine-learning-based particle identification with missing data
    Kasak, Milosz
    Deja, Kamil
    Karwowska, Maja
    Jakubowska, Monika
    Graczykowski, Lukasz
    Janik, Malgorzata
    [J]. EUROPEAN PHYSICAL JOURNAL C, 2024, 84 (07):