Auto-Detect: Data-Driven Error Detection in Tables

被引:19
|
作者
Huang, Zhipeng [1 ,2 ]
He, Yeye [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Microsoft Res, Redmond, WA USA
关键词
OUTLIER DETECTION;
D O I
10.1145/3183713.3196889
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose AUTO-DETECT, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test AUTO-DETECT on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, AUTO-DETECT makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research(1).
引用
收藏
页码:1377 / 1392
页数:16
相关论文
共 50 条
  • [1] Automatic meter error detection with a data-driven approach
    Chu, Ruimin
    Chik, Li
    Chan, Jeffrey
    Gutzmann, Kurt
    Li, Xiaodong
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [2] Data-driven bounded-error fault detection
    Suarez Fabrega, Antonio J.
    Bravo Caro, Jose Manuel
    Abad Herrera, Pedro J.
    Gasca, Rafael M.
    [J]. INTERNATIONAL JOURNAL OF ADAPTIVE CONTROL AND SIGNAL PROCESSING, 2014, 28 (12) : 1299 - 1324
  • [3] Learning to Detect: A Data-driven Approach for Network Intrusion Detection
    Tauscher, Zachary
    Jiang, Yushan
    Zhang, Kai
    Wang, Jian
    Song, Houbing
    [J]. 2021 IEEE INTERNATIONAL PERFORMANCE, COMPUTING, AND COMMUNICATIONS CONFERENCE (IPCCC), 2021,
  • [4] Auto-detect of Machine Vision and Its Application in Assembling Inspection
    Wang, Jing
    Yang, Xiaoyi
    [J]. 2011 9TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2011), 2011, : 18 - 22
  • [5] Adaptive data-driven error detection in swarm robotics with statistical classifiers
    Lau, HuiKeng
    Bate, Lain
    Cairns, Paul
    Timmis, Jon
    [J]. ROBOTICS AND AUTONOMOUS SYSTEMS, 2011, 59 (12) : 1021 - 1035
  • [6] Data-Driven Road Detection
    Alvarez, Jose M.
    Salzmann, Mathieu
    Barnes, Nick
    [J]. 2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 1134 - 1141
  • [7] Data-driven approach to pronunciation error detection for computer assisted language teaching
    Liang, Min-Siong
    Hong, Zien-Yong
    Lyu, Ren-Yuan
    Chiang, Yuang-Chin
    [J]. 7TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, 2007, : 359 - +
  • [8] Accuracy Analysis of Polynomial Model and Auto Regressive Model for Data-driven Fault Detection
    Sun, Bowen
    He, Zhangming
    Xu, Shuqing
    Zhou, Haiyin
    Wang, Jiongqi
    [J]. PROCEEDINGS OF 2018 IEEE 7TH DATA DRIVEN CONTROL AND LEARNING SYSTEMS CONFERENCE (DDCLS), 2018, : 446 - 451
  • [9] Uni-Detect: A Unified Approach to Automated Error Detection in Tables
    Wang, Pei
    He, Yeye
    [J]. SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, : 811 - 828
  • [10] Design of a fuzzy model based on vibration signal analysis to auto-detect the gear faults
    Hashemi, Mohammad
    Safizadeh, Mir Saeed
    [J]. INDUSTRIAL LUBRICATION AND TRIBOLOGY, 2013, 65 (03) : 194 - 201