Imputation of data Missing Not at Random: Artificial generation and benchmark analysis

被引:0
|
作者
Pereira, Ricardo Cardoso [1 ]
Abreu, Pedro Henriques [1 ]
Rodrigues, Pedro Pereira [2 ]
Figueiredo, Mario A. T. [3 ,4 ]
机构
[1] Univ Coimbra, Ctr Informat & Syst, Dept Informat Engn, P-3030290 Coimbra, Portugal
[2] Univ Porto, Fac Med MEDCIDS, Ctr Hlth Technol & Serv Res, P-4200319 Porto, Portugal
[3] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
[4] Inst Telecomunicacoes, P-1049001 Lisbon, Portugal
关键词
Missing data; Missing Not at Random; Imputation; Artificial generation; Benchmark analysis; AUTOENCODERS; TUTORIAL;
D O I
10.1016/j.eswa.2024.123654
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Experimental assessment of different missing data imputation methods often compute error rates between the original values and the estimated ones. This experimental setup relies on complete datasets that are injected with missing values. The injection process is straightforward for the Missing Completely At Random and Missing At Random mechanisms; however, the Missing Not At Random mechanism poses a major challenge, since the available artificial generation strategies are limited. Furthermore, the studies focused on this latter mechanism tend to disregard a comprehensive baseline of state-of-the-art imputation methods. In this work, both challenges are addressed: four new Missing Not At Random generation strategies are introduced and a benchmark study is conducted to compare six imputation methods in an experimental setup that covers 10 datasets and five missingness levels (10% to 80%). The overall findings are that, for most missing rates and datasets, the best imputation method to deal with Missing Not At Random values is the Multiple Imputation by Chained Equations, whereas for higher missingness rates autoencoders show promising results.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Composite Imputation Method for the Multiple Linear Regression with Missing at Random Data
    Thongsri, Thidarat
    Samart, Klairung
    [J]. INTERNATIONAL JOURNAL OF MATHEMATICS AND COMPUTER SCIENCE, 2022, 17 (01): : 51 - 62
  • [32] Missing data imputation, matching and other applications of random recursive partitioning
    Iacus, Stefano A.
    Porro, Giuseppe
    [J]. COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2007, 52 (02) : 773 - 789
  • [33] A comparative analysis of missing data imputation techniques on sedimentation data
    Loh, Wing Son
    Ling, Lloyd
    Chin, Ren Jie
    Lai, Sai Hin
    Loo, Kar Kuan
    Sen Seah, Choon
    [J]. AIN SHAMS ENGINEERING JOURNAL, 2024, 15 (06)
  • [34] Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias if they are mis-specified
    Curnow, Elinor
    Capenter, James R.
    Heron, Jon E.
    Cornish, Rosie P.
    Rach, Stefan
    Didelez, Vanessa
    Langeheine, Malte
    Tilling, Kate
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2023, 160 : 100 - 109
  • [35] Missing data imputation: focusing on single imputation
    Zhang, Zhongheng
    [J]. ANNALS OF TRANSLATIONAL MEDICINE, 2016, 4 (01)
  • [36] Guided multiple imputation of missing data - Using a subsample to strengthen the missing-at-random assumption
    Fraser, Gary
    Ru Yan
    [J]. EPIDEMIOLOGY, 2007, 18 (02) : 246 - 252
  • [37] Treatment of missing values with imputation for the analysis of otologic data
    Laurikkala, J
    Kentala, E
    Juhola, M
    Pyykkö, I
    [J]. MEDICAL INFORMATICS EUROPE '99, 1999, 68 : 428 - 431
  • [38] Analysis and Impact Evaluation of Missing Data Imputation in Day-ahead PV Generation Forecasting
    Kim, Taeyoung
    Ko, Woong
    Kim, Jinho
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (01):
  • [39] Analysis of Machine Learning Based Imputation of Missing Data
    Rizvi, Syed Tahir Hussain
    Latif, Muhammad Yasir
    Amin, Muhammad Saad
    Telmoudi, Achraf Jabeur
    Shah, Nasir Ali
    [J]. CYBERNETICS AND SYSTEMS, 2023,
  • [40] Symbolic Missing Data Imputation in Principal Component Analysis
    Zuccolotto, Paola
    [J]. Statistical Analysis and Data Mining, 2011, 4 (02): : 171 - 183