A Statistical Perspective on Discovering Functional Dependencies in Noisy Data

被引:13
|
作者
Zhang, Yunjia [1 ]
Guo, Zhihan [1 ]
Rekatsinas, Theodoros [1 ]
机构
[1] UW Madison, Madison, WI 53706 USA
关键词
Functional Dependencies; Structure Learning; COVARIANCE ESTIMATION; NETWORKS; MODELS;
D O I
10.1145/3318464.3389749
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We study the problem of discovering functional dependencies (FD) from a noisy data set. We adopt a statistical perspective and draw connections between FD discovery and structure learning in probabilistic graphical models. We show that discovering FDs from a noisy data set is equivalent to learning the structure of a model over binary random variables, where each random variable corresponds to a functional of the data set attributes. We build upon this observation to introduce FDX a conceptually simple framework in which learning functional dependencies corresponds to solving a sparse regression problem. We show that FDX can recover true functional dependencies across a diverse array of real-world and synthetic data sets, even in the presence of noisy or missing data. We find that FDX scales to large data instances with millions of tuples and hundreds of attributes while it yields an average F-1 improvement of 2x against state-of-the-art FD discovery methods.
引用
收藏
页码:861 / 876
页数:16
相关论文
共 50 条
  • [21] Discovering Approximate Functional Dependencies using Smoothed Mutual Information
    Pennerath, Frederic
    Mandros, Panagiotis
    Vreeken, Jilles
    [J]. KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 1254 - 1264
  • [22] Discovering fuzzy functional dependencies as semantic knowledge in large databases
    Wang, X
    Chen, GQ
    [J]. SHAPING BUSINESS STRATEGY IN A NETWORKED WORLD, VOLS 1 AND 2, PROCEEDINGS, 2004, : 1136 - 1139
  • [23] PANDA - Discovering Part Name in Noisy Text Data
    Kao, Anne
    Niraula, Nobal B.
    Whyatt, Daniel
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON PROGNOSTICS AND HEALTH MANAGEMENT (ICPHM), 2018,
  • [24] Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
    Mandros, Panagiotis
    Boley, Mario
    Vreeken, Jilles
    [J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 6206 - 6210
  • [25] Discovering dependencies among data quality dimensions: A validation of instrument
    Shariat Panahy, Payam Hassany
    Sidi, Fatimah
    Affendey, Lilly Suriani
    Jabar, Marzanah A.
    Ibrahim, Hamidah
    Mustapha, Aida
    [J]. Journal of Applied Sciences, 2013, 13 (01) : 95 - 102
  • [26] Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
    Mandros, Panagiotis
    Boley, Mario
    Vreeken, Jilles
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 317 - 326
  • [27] FD_Mine: Discovering functional dependencies in a database using equivalences
    Yao, H
    Hamilton, HJ
    Butz, CJ
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 729 - 732
  • [28] ROUGH SET BASED ALGORITHM OF DISCOVERING FUNCTIONAL DEPENDENCIES FOR RELATION DATABASE
    Qu, Ying
    Fu, Xiao-Bing
    [J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 10867 - +
  • [29] Discovering Relaxed Functional Dependencies Based on Multi-Attribute Dominance
    Caruccio, Loredana
    Deufemia, Vincenzo
    Naumann, Felix
    Polese, Giuseppe
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (09) : 3212 - 3228
  • [30] Discovering Relaxed Functional Dependencies based on Multi-attribute Dominance
    Caruccio, Loredana
    Deufemia, Vincenzo
    Naumann, Felix
    Polese, Giuseppe
    [J]. 2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 2354 - 2355