Auto-Detect: Data-Driven Error Detection in Tables

被引:19
|
作者
Huang, Zhipeng [1 ,2 ]
He, Yeye [2 ]
机构
[1] Univ Hong Kong, Hong Kong, Peoples R China
[2] Microsoft Res, Redmond, WA USA
关键词
OUTLIER DETECTION;
D O I
10.1145/3183713.3196889
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given a single column of values, existing approaches typically employ regex-like rules to detect errors by finding anomalous values inconsistent with others. Such techniques make local decisions based only on values in the given input column, without considering a more global notion of compatibility that can be inferred from large corpora of clean tables. We propose AUTO-DETECT, a statistics-based technique that leverages co-occurrence statistics from large corpora for error detection, which is a significant departure from existing rule-based methods. Our approach can automatically detect incompatible values, by leveraging an ensemble of judiciously selected generalization languages, each of which uses different generalizations and is sensitive to different types of errors. Errors so detected are based on global statistics, which is robust and aligns well with human intuition of errors. We test AUTO-DETECT on a large set of public Wikipedia tables, as well as proprietary enterprise Excel files. While both of these test sets are supposed to be of high-quality, AUTO-DETECT makes surprising discoveries of over tens of thousands of errors in both cases, which are manually verified to be of high precision (over 0.98). Our labeled benchmark set on Wikipedia tables is released for future research(1).
引用
收藏
页码:1377 / 1392
页数:16
相关论文
共 50 条
  • [21] Unified approach to data-driven quantum error mitigation
    Lowe, Angus
    Gordon, Max Hunter
    Czarnik, Piotr
    Arrasmith, Andrew
    Coles, Patrick J.
    Cincio, Lukasz
    [J]. PHYSICAL REVIEW RESEARCH, 2021, 3 (03):
  • [22] Error Quantification for the Assessment of Data-Driven Turbulence Models
    Hammond, James
    Marioni, Yuri Frey
    Montomoli, Francesco
    [J]. FLOW TURBULENCE AND COMBUSTION, 2022, 109 (01) : 1 - 26
  • [23] Robust Data-Driven Error Compensation for a Battery Model
    Gesner, Philipp
    Kirschbaum, Frank
    Jakobi, Richard
    Horstkoetter, Ivo
    Baeker, Bernard
    [J]. IFAC PAPERSONLINE, 2021, 54 (07): : 256 - 261
  • [24] Mining Mavericks - A data-driven approach to detect spend leakage
    Priyadarshi
    Chaugule, Anish
    Natu, Maitreya
    [J]. 2021 IEEE 8TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2021,
  • [25] Data-driven Approach to Detect and Predict Adverse Drug Reactions
    Ho, Tu-Bao
    Ly Le
    Dang Tran Thai
    Taewijit, Siriwon
    [J]. CURRENT PHARMACEUTICAL DESIGN, 2016, 22 (23) : 3498 - 3526
  • [26] A Data-Driven Method to Detect the Abnormal Instances in an Electricity Market
    Zamani-Dehkordi, Payam
    Rakai, Logan
    Zareipour, Hamidreza
    [J]. 2015 IEEE 14TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2015, : 1050 - 1055
  • [27] Data-Driven Anomaly Detection in High-Voltage Transformer Bushings with LSTM Auto-Encoder
    Mitiche, Imene
    McGrail, Tony
    Boreham, Philip
    Nesbitt, Alan
    Morison, Gordon
    [J]. SENSORS, 2021, 21 (21)
  • [28] A data-driven method to detect adverse drug events from prescription data
    Zhan, Chen
    Roughead, Elizabeth
    Liu, Lin
    Pratt, Nicole
    Li, Jiuyong
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 85 : 10 - 20
  • [29] Data-Driven Fault Detection of Electrical Machine
    Xu, Zhao
    Hu, Jinwen
    Hu, Changhua
    Nadarajan, Sivakumar
    Goh, Chi-keong
    Gupta, Amit
    [J]. 2018 15TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV), 2018, : 515 - 520
  • [30] Vehicle Emission Detection in Data-Driven Methods
    He, Zheng
    Ye, Gang
    Jiang, Hui
    Fu, Youming
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2020, 2020