Numerical Tuple Extraction from Tables with Pre-training

被引：1

作者：

Yang, Qingping ^{[1
,3
]}

Cao, Yixuan ^{[1
,3
]}

Luo, Ping ^{[1
,2
,3
]}

机构：

[1] Univ Chinese Acad Sci, CAS, Inst Comp Technol, Beijing, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

tuple extraction; tabular representation; pre-training;

D O I：

10.1145/3534678.3539460

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Tables are omnipresent on the web and in various vertical domains, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding this valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition. Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts. To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline.

引用

页码：2233 / 2241

页数：9

共 50 条

[31] Simulated SAR for ATR pre-training
Willis, Christopher J.
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING IN DEFENSE APPLICATIONS III, 2021, 11870
[32] Robot Learning with Sensorimotor Pre-training
Radosavovic, Ilija
Shi, Baifeng
Fu, Letian
Goldberg, Ken
Darrell, Trevor
Malik, Jitendra
CONFERENCE ON ROBOT LEARNING, VOL 229, 2023, 229
[33] Structural Pre-training for Dialogue Comprehension
Zhang, Zhuosheng
Zhao, Hai
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5134 - 5145
[34] Pre-Training Methods for Question Reranking
Campese, Stefano
Lauriola, Ivano
Moschitti, Alessandro
PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 469 - 476
[35] Pre-training Assessment Through the Web
Kenneth Wong
Reggie Kwan
Jimmy SF Chan
厦门大学学报(自然科学版), 2002, (S1) : 297 - 297
[36] Structure-inducing pre-training
McDermott, Matthew B. A.
Yap, Brendan
Szolovits, Peter
Zitnik, Marinka
NATURE MACHINE INTELLIGENCE, 2023, 5 (06) : 612 - +
[37] Unsupervised Pre-Training for Detection Transformers
Dai, Zhigang
Cai, Bolun
Lin, Yugeng
Chen, Junying
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12772 - 12782
[38] On Masked Pre-training and the Marginal Likelihood
Moreno-Munoz, Pablo
Recasens, Pol G.
Hauberg, Soren
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[39] Speech Pre-training with Acoustic Piece
Ren, Shuo
Liu, Shujie
Wu, Yu
Zhou, Long
Wei, Furu
INTERSPEECH 2022, 2022, : 2648 - 2652
[40] Ontology Pre-training for Poison Prediction
Glauer, Martin
Neuhaus, Fabian
Mossakowski, Till
Hastings, Janna
ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2023, 2023, 14236 : 31 - 45

← 1 2 3 4 5 →