From Missing Data Imputation to Data Generation

被引：22

作者：

Neves, Diogo Telmo ^{[1
,2
]}

Alves, Joao ^{[2
]}

Naik, Marcel Ganesh ^{[1
]}

Proenca, Alberto Jose ^{[2
]}

Prasser, Fabian ^{[1
]}

机构：

[1] Charite Univ Med Berlin, Berlin Inst Hlth, Berlin, Germany

[2] Univ Minho, Ctr ALGORITMI, Braga, Portugal

来源：

JOURNAL OF COMPUTATIONAL SCIENCE | 2022年 / 61卷

关键词：

Tabular Data; Missing Data; Data Imputation; Data Generation; Generative Adversarial Networks (GANs); NETWORKS;

D O I：

10.1016/j.jocs.2022.101640

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Real datasets often lack values, compromising the quality of data analyses. Adequate data may be synthetically imputed to replace missing values - a technique known as missing data imputation - avoiding deletion of incomplete observations. Several data imputation methods have been proposed and generative methods based on Artificial Neural Networks (ANN) are successful alternatives to discriminative methods. In this extended version of our work presented at the International Conference on Computational Science Neves et al. (2021), we propose three novel data imputation methods based on Generative Adversarial Networks (GAN): SGAIN, WSGAIN-CP, and WSGAIN-GP.We further studied how data imputation methods can be used to generate fully synthetic datasets. Among other benefits, the generation of synthetic data can help to mitigate legal, ethical, and data privacy issues, as well as to augment original data. In this context, we introduce tabulator, which is a novel meta-method for synthetic data generation that uses the data imputation methods as back-end engines for tabular data generation.We evaluated our data imputation methods using datasets with different amputation rates following the Missing Completely At Random (MCAR) setting. The results show that our methods are en-par or outperform state-of-the-art imputation methods in terms of response time and the quality of imputed data. We further evaluated and compared our data generation methods, which were derived from tabulator, with a state-ofthe-art approach, the Conditional Tabular GAN (CTGAN). The evaluation results show that our tabulator methods outperform CTGAN in many cases, for example regarding the accuracy of machine learning tasks (e.g., prediction or classification) performed on the synthetic output data.

引用

页数：16

共 50 条

[1] IMPUTATION OF MISSING DATA
Lunt, M.
[J]. ANNALS OF THE RHEUMATIC DISEASES, 2014, 73 : 49 - 49
[2] Missing Data: data replacement and imputation
Hutcheson, Graeme
Pampaka, Maria
[J]. JOURNAL OF MODELLING IN MANAGEMENT, 2012, 7 (02)
[3] Imputation of data Missing Not at Random: Artificial generation and benchmark analysis
Pereira, Ricardo Cardoso
Abreu, Pedro Henriques
Rodrigues, Pedro Pereira
Figueiredo, Mario A. T.
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 249
[4] Missing Data and Imputation Methods
Schober, Patrick
Vetter, Thomas R.
[J]. ANESTHESIA AND ANALGESIA, 2020, 131 (05): : 1419 - 1420
[5] Missing Data and Multiple Imputation
Cummings, Peter
[J]. JAMA PEDIATRICS, 2013, 167 (07) : 656 - 661
[6] Missing Data Imputation: A Survey
Kelkar, Bhagyashri Abhay
[J]. INTERNATIONAL JOURNAL OF DECISION SUPPORT SYSTEM TECHNOLOGY, 2022, 14 (01)
[7] MISSING DATA, IMPUTATION, AND THE BOOTSTRAP
EFRON, B
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1994, 89 (426) : 463 - 475
[8] Missing data imputation: focusing on single imputation
Zhang, Zhongheng
[J]. ANNALS OF TRANSLATIONAL MEDICINE, 2016, 4 (01)
[9] Missing data, imputation, and endogeneity
McDonough, Ian K.
Millimet, Daniel L.
[J]. JOURNAL OF ECONOMETRICS, 2017, 199 (02) : 141 - 155
[10] Imputation of Missing Healthcare Data
Chowdhury, Mohaimanul Hoque
Islam, Muhammad Kamrul
Khan, Shahidul Islam
[J]. 2017 20TH INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2017,

← 1 2 3 4 5 →