When Can We Ignore Missing Data in Model Training?

被引:0
|
作者
Zhen, Cheng [1 ]
Chabada, Amandeep Singh [1 ]
Termehchy, Arash [1 ]
机构
[1] Oregon State Univ, Corvallis, OR 97331 USA
关键词
data cleaning; machine learning; irrelevant and redundant data;
D O I
10.1145/3595360.3595854
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Imputing missing data is typically expensive, and as a result, people seek to avoid it when possible. To address this issue, we introduce a method that determines when data cleaning is unnecessary for machine learning (ML). If a model can minimize the loss function regardless of the missing data's actual values, then data cleaning is not required. We offer efficient algorithms for checking this condition in multiple ML problems, and by analyzing the algorithms, we show that data cleaning is unnecessary when dealing with irrelevant and redundant data. Our preliminary experiments demonstrate that our algorithms can significantly reduce cleaning costs compared to a benchmark method, without incurring much computational overhead in many cases.
引用
收藏
页数:4
相关论文
共 50 条
  • [1] Can we afford to ignore missing data in cost-effectiveness analyses?
    Marshall, Andrea
    Billingham, Lucinda J.
    Bryan, Stirling
    [J]. EUROPEAN JOURNAL OF HEALTH ECONOMICS, 2009, 10 (01): : 1 - 3
  • [2] Can we afford to ignore missing data in cost-effectiveness analyses?
    Andrea Marshall
    Lucinda J. Billingham
    Stirling Bryan
    [J]. The European Journal of Health Economics, 2009, 10 : 1 - 3
  • [3] When Should We Ignore Examples with Missing Values?
    Lin, Wei-Chao
    Ke, Shih-Wen
    Tsai, Chih-Fong
    [J]. INTERNATIONAL JOURNAL OF DATA WAREHOUSING AND MINING, 2017, 13 (04) : 53 - 63
  • [4] Can we ignore spatial dependence when evaluating mergers?
    Kvasnicka, Michal
    [J]. EMPIRICAL ECONOMICS, 2022, 62 (03) : 1323 - 1344
  • [5] When can we ignore measurement error in the running variable?
    Dong, Yingying
    Kolesar, Michal
    [J]. JOURNAL OF APPLIED ECONOMETRICS, 2023, 38 (05) : 735 - 750
  • [6] Shear-banding: When can we ignore diffusion?
    Wilson, Helen J.
    [J]. XVTH INTERNATIONAL CONGRESS ON RHEOLOGY - THE SOCIETY OF RHEOLOGY 80TH ANNUAL MEETING, PTS 1 AND 2, 2008, 1027 : 195 - 197
  • [7] Can we ignore spatial dependence when evaluating mergers?
    Michal Kvasnička
    [J]. Empirical Economics, 2022, 62 : 1323 - 1344
  • [8] Test Data in General Practice Are Not Missing at Random - Can We Identify When They Are?
    Sammon, Cormac J.
    Nightingale, Alison L.
    Miller, Anne
    Mahtani, Kamal R.
    Holt, Tim A.
    McHugh, Neil
    Luqmani, Raashid A.
    de Vries, Corinne S.
    [J]. PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2013, 22 : 305 - 306
  • [9] Why collect data when we can model it?
    Silberstein, RP
    [J]. MODSIM 2003: INTERNATIONAL CONGRESS ON MODELLING AND SIMULATION, VOLS 1-4: VOL 1: NATURAL SYSTEMS, PT 1; VOL 2: NATURAL SYSTEMS, PT 2; VOL 3: SOCIO-ECONOMIC SYSTEMS; VOL 4: GENERAL SYSTEMS, 2003, : 915 - 920
  • [10] When can we ignore the problem of imperfect detection in comparative studies?
    Archaux, Frederic
    Henry, Pierre-Yves
    Gimenez, Olivier
    [J]. METHODS IN ECOLOGY AND EVOLUTION, 2012, 3 (01): : 188 - 194