data cleaning;
machine learning;
irrelevant and redundant data;
D O I:
10.1145/3595360.3595854
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
Imputing missing data is typically expensive, and as a result, people seek to avoid it when possible. To address this issue, we introduce a method that determines when data cleaning is unnecessary for machine learning (ML). If a model can minimize the loss function regardless of the missing data's actual values, then data cleaning is not required. We offer efficient algorithms for checking this condition in multiple ML problems, and by analyzing the algorithms, we show that data cleaning is unnecessary when dealing with irrelevant and redundant data. Our preliminary experiments demonstrate that our algorithms can significantly reduce cleaning costs compared to a benchmark method, without incurring much computational overhead in many cases.
机构:Axel Hochkirch is chair of the IUCN SSC Invertebrate Conservation Subcommittee and co-chair of the IUCN SSC Grasshopper Specialist Group. He works at Trier University in Germany.,
机构:
Univ N Carolina, Gillings Sch Global Publ Hlth, Campus Box 7469, Chapel Hill, NC 27599 USAUniv N Carolina, Gillings Sch Global Publ Hlth, Campus Box 7469, Chapel Hill, NC 27599 USA
Tilson, Hugh
JOURNAL OF PUBLIC HEALTH MANAGEMENT AND PRACTICE,
2015,
21
: S173
-
S174