Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study

被引:3
|
作者
Paramonov, Viacheslav [1 ,2 ]
Shigarov, Alexey [1 ,2 ]
Vetrova, Varvara [1 ,3 ]
机构
[1] Russian Acad Sci, Matrosov Inst Syst Dynam & Control Theory, Siberian Branch, Irkutsk, Russia
[2] Irkutsk State Univ, Inst Math & Informat Technol, Irkutsk, Russia
[3] Univ Canterbury, Sch Math & Stat, Christchurch, New Zealand
基金
俄罗斯科学基金会;
关键词
Table understanding; Data transformation; Table extraction; Table analysis; Spreadsheet; Table header; Heuristics; Case study; Rules;
D O I
10.1007/978-3-030-88304-1_7
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spreadsheet tables are one of the most commonly used formats to organise and store sets of statistical, financial, accounting and other types of data. This form of data representation is widely used in science, education, engineering, and business. The key feature of spreadsheet tables that they are generally created by people in order to be further used by other people rather than by automated programs. During spreadsheet creation, commonly, no consideration is given to the possibility of further automated data processing. This leads to a large variety of possible spreadsheet table structures and further complicates automated extraction of table content and table understanding. One of the key factors that influence on the quality of table understanding by machines is the correctness of the header structure, for example, position and relation between cells. In this paper, we present a case study of a tabular data extraction approach and estimate its performance on a variety of datasets. The rule-driven software platform TabbyXL was used for tabular data extraction and canonicalisation. The experiment was conducted on real-world tables of SAUS200 (The 2010 Statistical Abstract of the United States) corpora. For the evaluation, we used spreadsheet tables as they are presented in SAUS; the same tables, but with an automatically corrected header structure; and tables where the structure of the header was corrected by experts. The case study results demonstrate the importance of header structure correctness for automated table processing and understanding. The ground-truth preparation procedures, example of rules describing relationships between table elements, and results of the evaluation are presented in the paper.
引用
收藏
页码:84 / 95
页数:12
相关论文
共 50 条
  • [1] Rule-based spreadsheet data transformation from arbitrary to relational tables
    Shigarov, Alexey O.
    Mikhailov, Andrey A.
    INFORMATION SYSTEMS, 2017, 71 : 123 - 136
  • [2] Heuristic Algorithm for Automatic Extraction Relational Data from Spreadsheet Hierarchical Tables
    Awad, Arwa
    Moawad, Ibrahim
    Elgohary, Rania
    Roushdy, Mohamed
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (10) : 743 - 748
  • [3] TabbyXL: Rule-Based Spreadsheet Data Extraction and Transformation
    Shigarov, Alexey
    Khristyuk, Vasiliy
    Mikhailov, Andrey
    Paramonov, Viacheslav
    INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2019, 2019, 1078 : 59 - 75
  • [4] Rule Extraction from Incomplete Decision Tables
    Li, Renpu
    Zhang, Dedong
    Zhao, Yongsheng
    Zhang, Fuzeng
    2009 WASE INTERNATIONAL CONFERENCE ON INFORMATION ENGINEERING, ICIE 2009, VOL I, 2009, : 639 - 642
  • [5] Software Development for Rule-Based Spreadsheet Data Extraction and Transformation
    Shigarov, Alexy
    Khristyuk, Vasiliy
    Mikhailov, Andrey
    Paramonov, Viacheslav
    2019 42ND INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2019, : 1132 - 1137
  • [6] An OLAP data model driven approach to process statistical tables
    Luk, WS
    Leung, P
    Sixteenth International Workshop on Database and Expert Systems Applications, Proceedings, 2005, : 1054 - 1058
  • [7] AUTOMATIC RULE EXTRACTION FROM STATISTICAL DATA AND FUZZY TREE SEARCH.
    Morishima, Shigeo
    Harashima, Hiroshi
    Systems and Computers in Japan, 1988, 19 (05) : 26 - 37
  • [8] Tabby XL: Software platform for rule-based spreadsheet data extraction and transformation
    Shigarov, A.
    Khristyuk, V
    Mikhailov, A.
    SOFTWAREX, 2019, 10
  • [9] Data Extraction from Web Tables: the Devil is in the Details
    Nagy, George
    Seth, Sharad
    Jin, Dongpu
    Embley, David W.
    Machado, Spencer
    Krishnamoorthy, Mukkai
    11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 242 - 246
  • [10] Converse approximation and rule extraction from decision tables in rough set theory
    Qian, Yuhua
    Liang, Jiye
    Dang, Chuangyin
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2008, 55 (08) : 1754 - 1765