Imputation of data Missing Not at Random: Artificial generation and benchmark analysis

被引:0
|
作者
Pereira, Ricardo Cardoso [1 ]
Abreu, Pedro Henriques [1 ]
Rodrigues, Pedro Pereira [2 ]
Figueiredo, Mario A. T. [3 ,4 ]
机构
[1] Univ Coimbra, Ctr Informat & Syst, Dept Informat Engn, P-3030290 Coimbra, Portugal
[2] Univ Porto, Fac Med MEDCIDS, Ctr Hlth Technol & Serv Res, P-4200319 Porto, Portugal
[3] Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal
[4] Inst Telecomunicacoes, P-1049001 Lisbon, Portugal
关键词
Missing data; Missing Not at Random; Imputation; Artificial generation; Benchmark analysis; AUTOENCODERS; TUTORIAL;
D O I
10.1016/j.eswa.2024.123654
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Experimental assessment of different missing data imputation methods often compute error rates between the original values and the estimated ones. This experimental setup relies on complete datasets that are injected with missing values. The injection process is straightforward for the Missing Completely At Random and Missing At Random mechanisms; however, the Missing Not At Random mechanism poses a major challenge, since the available artificial generation strategies are limited. Furthermore, the studies focused on this latter mechanism tend to disregard a comprehensive baseline of state-of-the-art imputation methods. In this work, both challenges are addressed: four new Missing Not At Random generation strategies are introduced and a benchmark study is conducted to compare six imputation methods in an experimental setup that covers 10 datasets and five missingness levels (10% to 80%). The overall findings are that, for most missing rates and datasets, the best imputation method to deal with Missing Not At Random values is the Multiple Imputation by Chained Equations, whereas for higher missingness rates autoencoders show promising results.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] From Missing Data Imputation to Data Generation
    Neves, Diogo Telmo
    Alves, Joao
    Naik, Marcel Ganesh
    Proenca, Alberto Jose
    Prasser, Fabian
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2022, 61
  • [2] Multiple imputation of ordinal missing not at random data
    Hammon, Angelina
    [J]. ASTA-ADVANCES IN STATISTICAL ANALYSIS, 2023, 107 (04) : 671 - 692
  • [3] Multiple imputation of ordinal missing not at random data
    Angelina Hammon
    [J]. AStA Advances in Statistical Analysis, 2023, 107 : 671 - 692
  • [4] Siamese Autoencoder Architecture for the Imputation of Data Missing Not at Random
    Pereira, Ricardo Cardoso
    Abreu, Pedro Henriques
    Rodrigues, Pedro Pereira
    [J]. JOURNAL OF COMPUTATIONAL SCIENCE, 2024, 78
  • [5] Identifiable Generative Models for Missing Not at Random Data Imputation
    Ma, Chao
    Zhang, Cheng
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Deep Generative Imputation Model for Missing Not At Random Data
    Chen, Jialei
    Xu, Yuanbo
    Wang, Pengyang
    Yang, Yongjian
    [J]. PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 316 - 325
  • [7] Multiple imputation of binary multilevel missing not at random data
    Hammon, Angelina
    Zinn, Sabine
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS, 2020, 69 (03) : 547 - 564
  • [8] Efficient random imputation for missing data in complex surveys
    Chen, J
    Rao, JNK
    Sitter, RR
    [J]. STATISTICA SINICA, 2000, 10 (04) : 1153 - 1169
  • [9] Missing data analysis and imputation via latent Gaussian Markov random felds
    Department of Mathematics, School of Industrial Engineering, Albacete, Universidad de Castilla-La Mancha, Spain
    不详
    不详
    [J]. SORT, 2 (217-243):
  • [10] Imputation of missing well log data by random forest and its uncertainty analysis
    Feng, Runhai
    Grana, Dario
    Balling, Niels
    [J]. COMPUTERS & GEOSCIENCES, 2021, 152