Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study

被引:3
|
作者
Paramonov, Viacheslav [1 ,2 ]
Shigarov, Alexey [1 ,2 ]
Vetrova, Varvara [1 ,3 ]
机构
[1] Russian Acad Sci, Matrosov Inst Syst Dynam & Control Theory, Siberian Branch, Irkutsk, Russia
[2] Irkutsk State Univ, Inst Math & Informat Technol, Irkutsk, Russia
[3] Univ Canterbury, Sch Math & Stat, Christchurch, New Zealand
基金
俄罗斯科学基金会;
关键词
Table understanding; Data transformation; Table extraction; Table analysis; Spreadsheet; Table header; Heuristics; Case study; Rules;
D O I
10.1007/978-3-030-88304-1_7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spreadsheet tables are one of the most commonly used formats to organise and store sets of statistical, financial, accounting and other types of data. This form of data representation is widely used in science, education, engineering, and business. The key feature of spreadsheet tables that they are generally created by people in order to be further used by other people rather than by automated programs. During spreadsheet creation, commonly, no consideration is given to the possibility of further automated data processing. This leads to a large variety of possible spreadsheet table structures and further complicates automated extraction of table content and table understanding. One of the key factors that influence on the quality of table understanding by machines is the correctness of the header structure, for example, position and relation between cells. In this paper, we present a case study of a tabular data extraction approach and estimate its performance on a variety of datasets. The rule-driven software platform TabbyXL was used for tabular data extraction and canonicalisation. The experiment was conducted on real-world tables of SAUS200 (The 2010 Statistical Abstract of the United States) corpora. For the evaluation, we used spreadsheet tables as they are presented in SAUS; the same tables, but with an automatically corrected header structure; and tables where the structure of the header was corrected by experts. The case study results demonstrate the importance of header structure correctness for automated table processing and understanding. The ground-truth preparation procedures, example of rules describing relationships between table elements, and results of the evaluation are presented in the paper.
引用
收藏
页码:84 / 95
页数:12
相关论文
共 50 条
  • [11] Ontology Driven Information Extraction from Tables Using Connectivity Analysis
    Bahulkar, Ashwin
    Reddy, Sreedhar
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2013 CONFERENCES, 2013, 8185 : 642 - 658
  • [12] CEP Rule Extraction from Unlabeled Data in IoT
    Simsek, Mehmet Ulvi
    Ozdemir, Suat
    2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 429 - 433
  • [13] Automating the extraction of data from HTML']HTML tables with unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) : 3 - 28
  • [14] Lightweight Process Support with Spreadsheet-Driven Processes: A Case Study in the Finance Domain
    Stach, Michael
    Pryss, Ruediger
    Schnitzlein, Maximilian
    Mohring, Tim
    Jurisch, Martin
    Reichert, Manfred
    BUSINESS PROCESS MANAGEMENT WORKSHOPS (BPM 2017), 2018, 308 : 323 - 334
  • [15] Extraction and Multidimensional Analysis of Data from Unstructured Data Sources: A Case Study
    Lima, Rui
    Cruz, Estrela Ferreira
    PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON ENTERPRISE INFORMATION SYSTEMS (ICEIS), VOL 1, 2019, : 190 - 199
  • [16] Big Data-Driven Feature Extraction and Clustering Based on Statistical Methods
    Maddumala, Venkata Rao
    Arunkumar, R.
    TRAITEMENT DU SIGNAL, 2020, 37 (03) : 387 - 394
  • [17] LEM2-Based Rule Induction from Data Tables with Imprecise Evaluations
    Inuiguchi, Masahiro
    Tsuji, Masahiko
    Kusunoki, Yoshifumi
    Tsurumi, Masayo
    ROUGH SETS AND KNOWLEDGE TECHNOLOGY, 2011, 6954 : 201 - +
  • [18] From Big Data to Information: Statistical Issues Through a Case Study
    Signorelli, Serena
    Biffignandi, Silvia
    CLASSIFICATION, (BIG) DATA ANALYSIS AND STATISTICAL LEARNING, 2018, : 3 - 11
  • [19] Extended association rule extraction from process operational data
    Zhang, L
    He, XR
    PROCESS SYSTEMS ENGINEERING 2003, PTS A AND B, 2003, 15 : 1429 - 1434
  • [20] Rule extraction for glaucoma detection with summary data from StratusOCT
    Huang, Mei-Ling
    Chen, Hsin-Yi
    Lin, Jian-Cheng
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2007, 48 (01) : 244 - 250