Learning with Missing Data

被引：2

作者：

Escobar, Carlos A. ^{[1
]}

Arinez, Jorge ^{[1
]}

Macias, Daniela ^{[2
]}

Morales-Menendez, Ruben ^{[2
]}

机构：

[1] Gen Motors, Global Res & Dev, Warren, MI 48092 USA

[2] Tecnol Monterrey, Escuela Ingn & Ciencias, Monterrey, NL, Mexico

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2020年

关键词：

machine learning; incomplete data; preprocessing; manufacturing; MULTIPLE IMPUTATION;

D O I：

10.1109/BigData50022.2020.9377785

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Many real-world data sets contain missing values, therefore, learning with incomplete data sets is a common challenge faced by data scientists. Handling them in an intelligent way is important to develop robust data models, since there is no perfect approach to compensate for the missing values. Deleting the rows with empty cells is a commonly used approach, this naive method may lead to estimates with larger standard errors due to reduced sample size. On the other hand, imputing the missing records is a better approach, but it should be used with great caution, as it relies on often unrealistic specific assumptions which can potentially bias results. In this paper, a new greedy-like algorithm is proposed to maximize the number of records. The algorithm can be used to generate various maximized sub-sets by varying the number of columns (features) that can be used for learning. It salvages more records than the naive method, and it avoids the bias induced by imputation. The learning algorithms would be able to learn from real sub-sets without the bias induced by artificial data. Finally, the proposed algorithm is applied to a case study, the COVID-19 Open Research data set (CORD-19) that was prepared and posted by The White House and a coalition of leading research groups as a call to action to the world's artificial intelligence experts to answer high priority scientific questions. This data set contains missing records, therefore, resulting maximized sub-sets from this analysis can be further investigated by the research community.

引用

页码：5037 / 5045

页数：9

共 50 条

[1] Robust learning with missing data
Ramoni, M
Sebastiani, P
[J]. MACHINE LEARNING, 2001, 45 (02) : 147 - 170
[2] Robust Learning with Missing Data
Marco Ramoni
Paola Sebastiani
[J]. Machine Learning, 2001, 45 : 147 - 170
[3] The Limits of Learning with Missing Data
Bullins, Brian
Hazan, Elad
Koren, Tomer
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[4] Learning with Missing or Incomplete Data
Gabrys, Bogdan
[J]. IMAGE ANALYSIS AND PROCESSING - ICIAP 2009, PROCEEDINGS, 2009, 5716 : 1 - 4
[5] A survey on missing data in machine learning
Tlamelo Emmanuel
Thabiso Maupong
Dimane Mpoeleng
Thabo Semong
Banyatsang Mphago
Oteng Tabona
[J]. Journal of Big Data, 8
[6] Active Learning for Handling Missing Data
Tharwat, Alaa
Schenck, Wolfram
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 15
[7] A survey on missing data in machine learning
Emmanuel, Tlamelo
Maupong, Thabiso
Mpoeleng, Dimane
Semong, Thabo
Mphago, Banyatsang
Tabona, Oteng
[J]. JOURNAL OF BIG DATA, 2021, 8 (01)
[8] Learning process models with missing data
Bridewell, Will
Langley, Pat
Racunas, Steve
Borrett, Stuart
[J]. MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 557 - 565
[9] Missing Data Imputation for Supervised Learning
Poulos, Jason
Valle, Rafael
[J]. APPLIED ARTIFICIAL INTELLIGENCE, 2018, 32 (02) : 186 - 196
[10] Learning Invariant Representations with Missing Data
Goldstein, Mark
Puli, Aahlad
Ranganath, Rajesh
Jacobsen, Jorn-Henrik
Chau, Olina
Saporta, Adriel
Miller, Andrew C.
[J]. CONFERENCE ON CAUSAL LEARNING AND REASONING, VOL 177, 2022, 177

← 1 2 3 4 5 →