From Missing Data Imputation to Data Generation

被引:22
|
作者
Neves, Diogo Telmo [1 ,2 ]
Alves, Joao [2 ]
Naik, Marcel Ganesh [1 ]
Proenca, Alberto Jose [2 ]
Prasser, Fabian [1 ]
机构
[1] Charite Univ Med Berlin, Berlin Inst Hlth, Berlin, Germany
[2] Univ Minho, Ctr ALGORITMI, Braga, Portugal
关键词
Tabular Data; Missing Data; Data Imputation; Data Generation; Generative Adversarial Networks (GANs); NETWORKS;
D O I
10.1016/j.jocs.2022.101640
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Real datasets often lack values, compromising the quality of data analyses. Adequate data may be synthetically imputed to replace missing values - a technique known as missing data imputation - avoiding deletion of incomplete observations. Several data imputation methods have been proposed and generative methods based on Artificial Neural Networks (ANN) are successful alternatives to discriminative methods. In this extended version of our work presented at the International Conference on Computational Science Neves et al. (2021), we propose three novel data imputation methods based on Generative Adversarial Networks (GAN): SGAIN, WSGAIN-CP, and WSGAIN-GP.We further studied how data imputation methods can be used to generate fully synthetic datasets. Among other benefits, the generation of synthetic data can help to mitigate legal, ethical, and data privacy issues, as well as to augment original data. In this context, we introduce tabulator, which is a novel meta-method for synthetic data generation that uses the data imputation methods as back-end engines for tabular data generation.We evaluated our data imputation methods using datasets with different amputation rates following the Missing Completely At Random (MCAR) setting. The results show that our methods are en-par or outperform state-of-the-art imputation methods in terms of response time and the quality of imputed data. We further evaluated and compared our data generation methods, which were derived from tabulator, with a state-ofthe-art approach, the Conditional Tabular GAN (CTGAN). The evaluation results show that our tabulator methods outperform CTGAN in many cases, for example regarding the accuracy of machine learning tasks (e.g., prediction or classification) performed on the synthetic output data.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] IMPUTATION OF MISSING DATA
    Lunt, M.
    [J]. ANNALS OF THE RHEUMATIC DISEASES, 2014, 73 : 49 - 49
  • [2] Missing Data: data replacement and imputation
    Hutcheson, Graeme
    Pampaka, Maria
    [J]. JOURNAL OF MODELLING IN MANAGEMENT, 2012, 7 (02)
  • [3] Imputation of data Missing Not at Random: Artificial generation and benchmark analysis
    Pereira, Ricardo Cardoso
    Abreu, Pedro Henriques
    Rodrigues, Pedro Pereira
    Figueiredo, Mario A. T.
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
  • [4] Missing Data and Imputation Methods
    Schober, Patrick
    Vetter, Thomas R.
    [J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
  • [5] Missing Data and Multiple Imputation
    Cummings, Peter
    [J]. JAMA PEDIATRICS, 2013, 167 (07) : 656 - 661
  • [6] Missing Data Imputation: A Survey
    Kelkar, Bhagyashri Abhay
    [J]. INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY, 2022, 14 (01)
  • [7] MISSING DATA, IMPUTATION, AND THE BOOTSTRAP
    EFRON, B
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (426) : 463 - 475
  • [8] Missing data imputation: focusing on single imputation
    Zhang, Zhongheng
    [J]. ANNALS OF TRANSLATIONAL MEDICINE, 2016, 4 (01)
  • [9] Missing data, imputation, and endogeneity
    McDonough, Ian K.
    Millimet, Daniel L.
    [J]. JOURNAL OF ECONOMETRICS, 2017, 199 (02) : 141 - 155
  • [10] Imputation of Missing Healthcare Data
    Chowdhury, Mohaimanul Hoque
    Islam, Muhammad Kamrul
    Khan, Shahidul Islam
    [J]. 2017 20TH INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2017,