Cleaning Data with Constraints and Experts

被引:5
|
作者
Assadi, Ahmad [1 ]
Milo, Tova [1 ]
Novgorodov, Slava [1 ]
机构
[1] Tel Aviv Univ, Tel Aviv, Israel
基金
欧洲研究理事会;
关键词
D O I
10.1145/3201463.3201464
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Popular techniques for data cleaning use integrity constraints to identify errors in the data and to automatically resolve them, e. g. by using predefined priorities among possible updates and finding a minimal repair that will resolve violations. Such automatic solutions however cannot ensure precision of the repairs since they do not have enough evidence about the actual errors and may in fact lead to wrong results with respect to the ground truth. It has thus been suggested to use domain experts to examine the potential updates and choose which should be applied to the database. However, the sheer volume of the databases and the large number of possible updates that may resolve a given constraint violation, may make such a manual examination prohibitory expensive. The goal of the DANCE system presented here is to help to optimize the experts work and reduce as much as possible the number of questions (updates verification) they need to address. Given a constraint violation, our algorithm identifies the suspicious tuples whose update may contribute (directly or indirectly) to the constraint resolution, as well as the possible dependencies among them. Using this information it builds a graph whose nodes are the suspicious tuples and whose weighted edges capture the likelihood of an error in one tuple to occur and affect the other. PageRankstyle algorithm then allows us to identify the most beneficial tuples to ask about first. Incremental graph maintenance is used to assure interactive response time. We implemented our solution in the DANCE system and show its effectiveness and efficiency through a comprehensive suite of experiments.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] DANCE: Data Cleaning with Constraints and Experts
    Assadi, Ahmad
    Milo, Tova
    Novgorodov, Slava
    [J]. 2017 IEEE 33RD INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2017), 2017, : 1409 - 1410
  • [2] A Revival of Integrity Constraints for Data Cleaning
    Fan, Wenfei
    Geerts, Floris
    Jia, Xibei
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2008, 1 (02): : 1522 - 1523
  • [3] A Dynamic Path Data Cleaning Algorithm Based on Constraints for RFID Data Cleaning
    Hu, Kongfa
    Li, Long
    Hu, Chengjun
    Xie, Jiadong
    Lu, Zhipeng
    [J]. 2014 11TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2014, : 537 - 541
  • [4] SCREEN: Stream Data Cleaning under Speed Constraints
    Song, Shaoxu
    Zhang, Aoqian
    Wang, Jianmin
    Yu, Philip S.
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 827 - 841
  • [5] Stream Data Cleaning under Speed and Acceleration Constraints
    Song, Shaoxu
    Gao, Fei
    Zhang, Aoqian
    Wang, Jianmin
    Yu, Philip S.
    [J]. ACM TRANSACTIONS ON DATABASE SYSTEMS, 2021, 46 (03):
  • [6] CoCo: Interactive Exploration of Conformance Constraints for Data Understanding and Data Cleaning
    Fariha, Anna
    Tiwari, Ashish
    Meliou, Alexandra
    Radhakrishna, Arjun
    Gulwani, Sumit
    [J]. SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2706 - 2710
  • [7] Time Series Data Cleaning Based on Dynamic Speed Constraints
    Ding, Guohui
    Li, Chenyang
    Wei, Ru
    Sun, Shasha
    Liu, Zhaoyu
    Fan, Chunlong
    [J]. WEB INFORMATION SYSTEMS ENGINEERING, WISE 2020, PT II, 2020, 12343 : 475 - 487
  • [8] Time Series Data Cleaning under Multi-speed Constraints
    Gao, Fei
    Song, Shao-Xu
    Wang, Jian-Min
    [J]. Ruan Jian Xue Bao/Journal of Software, 2021, 32 (03): : 689 - 711
  • [9] A Method for Cleaning Power Grid Operation Data Based on Spatiotemporal Correlation Constraints
    Wang, Changgang
    Mu, Gang
    Cao, Yu
    [J]. IEEE ACCESS, 2020, 8 : 224741 - 224749
  • [10] Time Series Data Cleaning Method Based on Optimized ELM Prediction Constraints
    Ding, Guohui
    Zhu, Yueyi
    Li, Chenyang
    Wang, Jinwei
    Wei, Ru
    Liu, Zhaoyu
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2023, 19 (02): : 149 - 163