An Efficient and Robust Approach for Discovering Data Quality Rules

被引:11
|
作者
Yeh, Peter Z. [1 ]
Puri, Colin A. [1 ]
机构
[1] Accenture Technol Labs, San Jose, CA USA
关键词
D O I
10.1109/ICTAI.2010.43
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Poor quality data is a growing problem that affects many enterprises across all aspects of their business ranging from operational efficiency to revenue protection. Moreover, this problem is costly to fix because significant effort and resources are required to identify a comprehensive set of rules that can detect (and correct) data defects along various data quality dimensions such as consistency, conformity, and more. Hence, many organizations employ only basic data quality rules that check for null values, format, etc. in efforts such as data profiling and data cleansing; and ignore rules that are needed to detect deeper problems such as inconsistent values across interdependent attributes. This oversight can lead to numerous problems such as inaccurate reporting of key metrics used to inform critical decisions or derive business insights. In this paper, we present an approach that efficiently and robustly discovers data quality rules - in particular conditional functional dependencies - for detecting inconsistencies in data and hence improves data quality along the critical dimension of consistency. We evaluate our approach empirically on several real-world data sets. We show that our approach performs well on these data sets for metrics such as precision and recall. We also compare our approach to an established solution and show that our approach outperforms this solution for the same metrics. Finally, we show that our approach scales efficiently with the number of records, the number of attributes, and the domain size.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Framework for Discovering Association Rules in a Fuzzy Data Cube
    Somodevilla, Maria J.
    Torres, Ivo H. Pineda
    Zecua, Jose Tecuapacho
    NINTH MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, PROCEEDINGS, 2008, : 126 - 131
  • [22] DISCOVERING EXPERT-SYSTEM RULES IN DATA SETS
    KOCH, T
    FEHSENFELD, B
    EXPERT SYSTEMS WITH APPLICATIONS, 1995, 8 (02) : 287 - 294
  • [23] Discovering Data Quality Problems The Case of Repurposed Data
    Zhang, Ruojing
    Indulska, Marta
    Sadiq, Shazia
    BUSINESS & INFORMATION SYSTEMS ENGINEERING, 2019, 61 (05) : 575 - 593
  • [24] Discovering Data Quality ProblemsThe Case of Repurposed Data
    Ruojing Zhang
    Marta Indulska
    Shazia Sadiq
    Business & Information Systems Engineering, 2019, 61 : 575 - 593
  • [25] Spark solutions for discovering fuzzy association rules in Big Data
    Fernandez-Basso, Carlos
    Dolores Ruiz, M.
    Martin-Bautista, Maria J.
    INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2021, 137 : 94 - 112
  • [26] An Efficient Approach for Discovering and Maintaining Sequential Patterns
    Yen, Show - Jane
    Lee, Yue -shi
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (01) : 201 - 213
  • [27] A big-stepped probability approach for discovering default rules
    Benferhat, S
    Dubois, D
    Lagrue, S
    Prade, H
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2003, 11 : 1 - 14
  • [28] DISCOVERING EMOTIONAL LOGIC RULES FROM PHYSIOLOGICAL DATA OF INDIVIDUALS
    Costadopoulos, Nectarios
    Islam, Md Zahidul
    Tien, David
    PROCEEDINGS OF 2019 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), 2019, : 528 - 534
  • [29] Applying data-mining techniques for discovering association rules
    Mu-Jung Huang
    Hsiu-Shu Sung
    Tsu-Jen Hsieh
    Ming-Cheng Wu
    Shao-Hsi Chung
    Soft Computing, 2020, 24 : 8069 - 8075
  • [30] Discovering temporal association rules for time-lag data
    Chen, GQ
    Ai, J
    Yu, W
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON E-BUSINESS (ICEB2002), 2002, : 324 - 328