Data Cleaning: Overview and Emerging Challenges

被引:205
|
作者
Chu, Xu [1 ]
Ilyas, Ihab F. [1 ]
Krishnan, Sanjay [2 ]
Wang, Jiannan [3 ]
机构
[1] Univ Waterloo, Waterloo, ON, Canada
[2] Univ Calif Berkeley, Berkeley, CA USA
[3] Simon Fraser Univ, Burnaby, BC, Canada
关键词
VIOLATIONS; QUERY;
D O I
10.1145/2882903.2912574
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.
引用
收藏
页码:2201 / 2206
页数:6
相关论文
共 50 条
  • [1] The challenges of emerging illness in urban environments: An overview
    McCally, M
    Garg, A
    Oleskey, C
    [J]. JOURNAL OF URBAN HEALTH-BULLETIN OF THE NEW YORK ACADEMY OF MEDICINE, 2001, 78 (02): : 350 - 358
  • [2] The challenges of emerging illness in urban environments: An overview
    Michael McCally
    Anjali Garg
    Christopher Oleskey
    [J]. Journal of Urban Health, 2001, 78 : 350 - 358
  • [3] Broad Data:Challenges on the emerging Web of data
    Hendler, James
    [J]. COMPANION PROCEEDINGS OF THE SECOND ACM IKDD CONFERENCE ON DATA SCIENCES (CODS), 2015,
  • [4] Editorial overview: Exposomics, emerging exposures and analytical challenges
    Jobst, Karl J.
    Pollitt, Krystal Godri
    [J]. CURRENT OPINION IN ENVIRONMENTAL SCIENCE & HEALTH, 2020, 15 : A1 - A3
  • [5] Emerging challenges of manpower development in the construction industry: An overview
    Chandrasekar, S.
    [J]. 1997, Assoc Cement Co Ltd, Bombay, India (71):
  • [6] Overview and emerging challenges in mechanical dicing of silicon wafers
    Ganesh, V. P.
    Lee, Charles
    [J]. EPTC 2006: 8TH ELECTRONIC PACKAGING TECHNOLOGY CONFERENCE, VOLS 1 AND 2, 2006, : 15 - 21
  • [7] Open science in psychophysiology: An overview of challenges and emerging solutions
    Garrett-Ruffin, Sherona
    Hindash, Alexandra Cowden
    Kaczkurkin, Antonia N.
    Mears, Ryan P.
    Morales, Santiago
    Paul, Katharina
    Pavlov, Yuri G.
    Keil, Andreas
    [J]. INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2021, 162 : 69 - 78
  • [8] An Overview of Big Data Opportunity and Challenges
    Pant, Pooja
    Tanwar, Rajneesh
    [J]. SMART TRENDS IN INFORMATION TECHNOLOGY AND COMPUTER COMMUNICATIONS, SMARTCOM 2016, 2016, 628 : 691 - 697
  • [9] Crowdsourced Data Management: Overview and Challenges
    Li, Guoliang
    Zheng, Yudian
    Fan, Ju
    Wang, Jiannan
    Cheng, Reynold
    [J]. SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 1711 - 1716
  • [10] An overview of the mock LISA data challenges
    Arnaud, Keith A.
    Babak, Stanislav
    Baker, John G.
    Benacquista, Matthew J.
    Cornish, Neil J.
    Cutler, Curt
    Larson, Shane L.
    Sathyaprakash, B. S.
    Vallisneri, Michele
    Vecchio, Alberto
    Vinet, Jean-Yves
    [J]. LASER INTERFEROMETER SPACE ANTENNA, 2006, 873 : 619 - +