Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study

被引：3

作者：

Paramonov, Viacheslav ^{[1
,2
]}

Shigarov, Alexey ^{[1
,2
]}

Vetrova, Varvara ^{[1
,3
]}

机构：

[1] Russian Acad Sci, Matrosov Inst Syst Dynam & Control Theory, Siberian Branch, Irkutsk, Russia

[2] Irkutsk State Univ, Inst Math & Informat Technol, Irkutsk, Russia

[3] Univ Canterbury, Sch Math & Stat, Christchurch, New Zealand

来源：

INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2021 | 2021年 / 1486卷

基金：

俄罗斯科学基金会;

关键词：

Table understanding; Data transformation; Table extraction; Table analysis; Spreadsheet; Table header; Heuristics; Case study; Rules;

D O I：

10.1007/978-3-030-88304-1_7

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Spreadsheet tables are one of the most commonly used formats to organise and store sets of statistical, financial, accounting and other types of data. This form of data representation is widely used in science, education, engineering, and business. The key feature of spreadsheet tables that they are generally created by people in order to be further used by other people rather than by automated programs. During spreadsheet creation, commonly, no consideration is given to the possibility of further automated data processing. This leads to a large variety of possible spreadsheet table structures and further complicates automated extraction of table content and table understanding. One of the key factors that influence on the quality of table understanding by machines is the correctness of the header structure, for example, position and relation between cells. In this paper, we present a case study of a tabular data extraction approach and estimate its performance on a variety of datasets. The rule-driven software platform TabbyXL was used for tabular data extraction and canonicalisation. The experiment was conducted on real-world tables of SAUS200 (The 2010 Statistical Abstract of the United States) corpora. For the evaluation, we used spreadsheet tables as they are presented in SAUS; the same tables, but with an automatically corrected header structure; and tables where the structure of the header was corrected by experts. The case study results demonstrate the importance of header structure correctness for automated table processing and understanding. The ground-truth preparation procedures, example of rules describing relationships between table elements, and results of the evaluation are presented in the paper.

引用

页码：84 / 95

页数：12

共 50 条

[31] Greedy rule generation from discrete data and its use in neural network rule extraction
Odajima, Koichi
Hayashi, Yoichi
Tianxia, Gong
Setiono, Rudy
NEURAL NETWORKS, 2008, 21 (07) : 1020 - 1028
[32] Greedy Rule Generation from discrete data and its use in neural network rule extraction
Odajima, Koichi
Hayashi, Yoichi
Setiono, Rudy
2006 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORK PROCEEDINGS, VOLS 1-10, 2006, : 1833 - +
[33] Synergies from spreadsheet LP used with the theory of constraints - a case study
Mabin, VJ
Gibson, J
JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 1998, 49 (09) : 918 - 927
[34] Roughfication of numeric decision tables:: The case study of gene expression data
Slezak, Dominik
Wroblewski, Jakub
ROUGH SETS AND KNOWLEDGE TECHNOLOGY, PROCEEDINGS, 2007, 4481 : 316 - +
[35] Building Extraction from Lidar Data Using Statistical Methods
Sadeq, Haval Abdul-Jabbar
PHOTOGRAMMETRIC ENGINEERING AND REMOTE SENSING, 2021, 87 (01): : 33 - 42
[36] Towards Ontology Engineering Based on Transformation of Conceptual Models and Spreadsheet Data: A Case Study
Dorodnykh, Nikita O.
Yurin, Aleksandr Yu.
INTELLIGENT SYSTEMS APPLICATIONS IN SOFTWARE ENGINEERING, VOL 1, 2019, 1046 : 233 - 247
[37] Rule-based information extraction from patients' clinical data
Mykowiecka, Agnieszka
Marciniak, Malgorzata
Kupsc, Anna
JOURNAL OF BIOMEDICAL INFORMATICS, 2009, 42 (05) : 923 - 936
[38] Assessing flexible models and rule extraction from censored survival data
Lisboa, Paulo J. G.
Biganzoli, Elia M.
Taktak, Azzam F.
Etchells, Terence A.
Jarman, Ian H.
Aung, M. S. Hane
Ambrogi, Federico
2007 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1-6, 2007, : 1663 - +
[39] An integrated method for cancer classification and rule extraction from microarray data
Huang, Liang-Tsung
JOURNAL OF BIOMEDICAL SCIENCE, 2009, 16
[40] A data-driven study of image feature extraction and fusion
Wang, Zhiyu
Cui, Peng
Li, Fangtao
Chang, Edward
Yang, Shiqiang
INFORMATION SCIENCES, 2014, 281 : 536 - 558

← 1 2 3 4 5 →