Learning with Missing Data

被引:2
|
作者
Escobar, Carlos A. [1 ]
Arinez, Jorge [1 ]
Macias, Daniela [2 ]
Morales-Menendez, Ruben [2 ]
机构
[1] Gen Motors, Global Res & Dev, Warren, MI 48092 USA
[2] Tecnol Monterrey, Escuela Ingn & Ciencias, Monterrey, NL, Mexico
关键词
machine learning; incomplete data; preprocessing; manufacturing; MULTIPLE IMPUTATION;
D O I
10.1109/BigData50022.2020.9377785
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many real-world data sets contain missing values, therefore, learning with incomplete data sets is a common challenge faced by data scientists. Handling them in an intelligent way is important to develop robust data models, since there is no perfect approach to compensate for the missing values. Deleting the rows with empty cells is a commonly used approach, this naive method may lead to estimates with larger standard errors due to reduced sample size. On the other hand, imputing the missing records is a better approach, but it should be used with great caution, as it relies on often unrealistic specific assumptions which can potentially bias results. In this paper, a new greedy-like algorithm is proposed to maximize the number of records. The algorithm can be used to generate various maximized sub-sets by varying the number of columns (features) that can be used for learning. It salvages more records than the naive method, and it avoids the bias induced by imputation. The learning algorithms would be able to learn from real sub-sets without the bias induced by artificial data. Finally, the proposed algorithm is applied to a case study, the COVID-19 Open Research data set (CORD-19) that was prepared and posted by The White House and a coalition of leading research groups as a call to action to the world's artificial intelligence experts to answer high priority scientific questions. This data set contains missing records, therefore, resulting maximized sub-sets from this analysis can be further investigated by the research community.
引用
收藏
页码:5037 / 5045
页数:9
相关论文
共 50 条
  • [1] Robust learning with missing data
    Ramoni, M
    Sebastiani, P
    [J]. MACHINE LEARNING, 2001, 45 (02) : 147 - 170
  • [2] Robust Learning with Missing Data
    Marco Ramoni
    Paola Sebastiani
    [J]. Machine Learning, 2001, 45 : 147 - 170
  • [3] The Limits of Learning with Missing Data
    Bullins, Brian
    Hazan, Elad
    Koren, Tomer
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [4] Learning with Missing or Incomplete Data
    Gabrys, Bogdan
    [J]. IMAGE ANALYSIS AND PROCESSING - ICIAP 2009, PROCEEDINGS, 2009, 5716 : 1 - 4
  • [5] A survey on missing data in machine learning
    Tlamelo Emmanuel
    Thabiso Maupong
    Dimane Mpoeleng
    Thabo Semong
    Banyatsang Mphago
    Oteng Tabona
    [J]. Journal of Big Data, 8
  • [6] Active Learning for Handling Missing Data
    Tharwat, Alaa
    Schenck, Wolfram
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 15
  • [7] A survey on missing data in machine learning
    Emmanuel, Tlamelo
    Maupong, Thabiso
    Mpoeleng, Dimane
    Semong, Thabo
    Mphago, Banyatsang
    Tabona, Oteng
    [J]. JOURNAL OF BIG DATA, 2021, 8 (01)
  • [8] Learning process models with missing data
    Bridewell, Will
    Langley, Pat
    Racunas, Steve
    Borrett, Stuart
    [J]. MACHINE LEARNING: ECML 2006, PROCEEDINGS, 2006, 4212 : 557 - 565
  • [9] Missing Data Imputation for Supervised Learning
    Poulos, Jason
    Valle, Rafael
    [J]. APPLIED ARTIFICIAL INTELLIGENCE, 2018, 32 (02) : 186 - 196
  • [10] Learning Invariant Representations with Missing Data
    Goldstein, Mark
    Puli, Aahlad
    Ranganath, Rajesh
    Jacobsen, Jorn-Henrik
    Chau, Olina
    Saporta, Adriel
    Miller, Andrew C.
    [J]. CONFERENCE ON CAUSAL LEARNING AND REASONING, VOL 177, 2022, 177