Auto-Detect: Data-Driven Error Detection in Tables

被引:19
|
作者
Huang, Zhipeng [1 ,2 ]
He, Yeye [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Microsoft Res, Redmond, WA USA
关键词
OUTLIER DETECTION;
D O I
10.1145/3183713.3196889
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose AUTO-DETECT, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test AUTO-DETECT on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, AUTO-DETECT makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research(1).
引用
收藏
页码:1377 / 1392
页数:16
相关论文
共 50 条
  • [31] Data-Driven Attack Detection for Linear Systems
    Krishnan, Vishaal
    Pasqualetti, Fabio
    [J]. IEEE CONTROL SYSTEMS LETTERS, 2021, 5 (02): : 671 - 676
  • [32] Data-Driven Detection of Recursive Program Schemes
    Hofmann, Martin
    Schmid, Ute
    [J]. ECAI 2010 - 19TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2010, 215 : 1063 - +
  • [33] A Survey on Data-driven Network Intrusion Detection
    Chou, Dylan
    Jiang, Meng
    [J]. ACM COMPUTING SURVEYS, 2022, 54 (09)
  • [34] Data-Driven Soiling Detection in PV Modules
    Kalimeris, Alexandros
    Psarros, Ioannis
    Giannopoulos, Giorgos
    Terrovitis, Manolis
    Papastefanatos, George
    Kotsis, Gregory
    [J]. IEEE JOURNAL OF PHOTOVOLTAICS, 2023, 13 (03): : 461 - 466
  • [35] Data-Driven Network Intelligence for Anomaly Detection
    Xu, Shengjie
    Qian, Yi
    Hu, Rose Qingyang
    [J]. IEEE NETWORK, 2019, 33 (03): : 88 - 95
  • [36] Data-driven bottleneck detection on Tehran highways
    Mirzahossein, Hamid
    Nobakht, Pedram
    Gholampour, Iman
    [J]. Transportation Engineering, 2024, 18
  • [37] A Data-Driven Passive Islanding Detection Scheme
    De, Sourav
    Reddy, Motakatla Venkateswara
    Sodhi, Ranjana
    [J]. IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS, 2024, 60 (02) : 3698 - 3709
  • [38] Data-Driven Anomaly Detection in Autonomous Platoon
    Ucar, Seyhan
    Ergen, Sinem Coleri
    Ozkasap, Oznur
    [J]. 2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [39] An algorithm for data-driven shifting bottleneck detection
    Subramaniyan, Mukund
    Skoogh, Anders
    Gopalakrishnan, Maheshwaran
    Salomonsson, Hans
    Hanna, Atieh
    Lamkull, Dan
    [J]. COGENT ENGINEERING, 2016, 3 (01):
  • [40] Data-driven recombination detection in viral genomes
    Alfonsi, Tommaso
    Bernasconi, Anna
    Chiara, Matteo
    Ceri, Stefano
    [J]. NATURE COMMUNICATIONS, 2024, 15 (01)