When Can We Ignore Missing Data in Model Training?

被引:0
|
作者
Zhen, Cheng [1 ]
Chabada, Amandeep Singh [1 ]
Termehchy, Arash [1 ]
机构
[1] Oregon State Univ, Corvallis, OR 97331 USA
关键词
data cleaning; machine learning; irrelevant and redundant data;
D O I
10.1145/3595360.3595854
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Imputing missing data is typically expensive, and as a result, people seek to avoid it when possible. To address this issue, we introduce a method that determines when data cleaning is unnecessary for machine learning (ML). If a model can minimize the loss function regardless of the missing data's actual values, then data cleaning is not required. We offer efficient algorithms for checking this condition in multiple ML problems, and by analyzing the algorithms, we show that data cleaning is unnecessary when dealing with irrelevant and redundant data. Our preliminary experiments demonstrate that our algorithms can significantly reduce cleaning costs compared to a benchmark method, without incurring much computational overhead in many cases.
引用
收藏
页数:4
相关论文
共 50 条
  • [41] Anaemia in rheumatoid arthritis: can we afford to ignore it?
    Bloxham, E.
    Vagadia, V.
    Scott, K.
    Francis, G.
    Saravanan, V.
    Heycock, C.
    Rynne, M.
    Hamilton, J.
    Kelly, C. A.
    [J]. POSTGRADUATE MEDICAL JOURNAL, 2011, 87 (1031) : 596 - 600
  • [42] Percentage of progressors in imaging: can we ignore regressors?
    Sepriano, Alexandre
    Ramiro, Sofia
    Landewe, Robert
    Dougados, Maxime
    van der Heijde, Desiree
    [J]. RMD OPEN, 2019, 5 (01):
  • [43] Civilized Barbarism: What We Miss When We Ignore Colonial Violence
    MacDonald, Paul K.
    [J]. INTERNATIONAL ORGANIZATION, 2023, 77 (04) : 721 - 753
  • [44] When we can trust computers (and when we can't)
    Coveney, Peter, V
    Highfield, Roger R.
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2021, 379 (2197):
  • [45] What if we ignore the random effects when analyzing RNA-seq data in a multifactor experiment
    Cui, Shiqi
    Ji, Tieming
    Li, Jilong
    Cheng, Jianlin
    Qiu, Jing
    [J]. STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2016, 15 (02) : 87 - 105
  • [46] We can share data only when rules are respected
    Sharpe, Michael
    [J]. NEW SCIENTIST, 2017, 233 (3114) : 53 - 53
  • [47] Diabetes: A Cinderella Subject We Can't Afford to Ignore
    Chan, Juliana C. N.
    Luk, Andrea O. Y.
    [J]. PLOS MEDICINE, 2016, 13 (07)
  • [48] The Importance of Cohen κ Coefficients in Clinical Research: Can We Ignore It?
    Xie, Yong
    Wang, Jian
    Zou, Yinghua
    [J]. RADIOLOGY, 2024, 311 (01)
  • [49] A Model Validation Procedure when Covariate Data are Missing at Random
    Jin, Lei
    Wang, Suojin
    [J]. SCANDINAVIAN JOURNAL OF STATISTICS, 2010, 37 (03) : 403 - 421
  • [50] Contemporary role of lymphoscintigraphy: we can no longer afford to ignore!
    Lee, B. B.
    Laredo, J.
    [J]. PHLEBOLOGY, 2011, 26 (05) : 177 - 178