Numerical Tuple Extraction from Tables with Pre-training

被引：0

作者：

Yang, Qingping ^{[1
,3
]}

Cao, Yixuan ^{[1
,3
]}

Luo, Ping ^{[1
,2
,3
]}

机构：

[1] Univ Chinese Acad Sci, CAS, Inst Comp Technol, Beijing, Peoples R China

[2] Peng Cheng Lab, Shenzhen, Peoples R China

[3] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

tuple extraction; tabular representation; pre-training;

D O I：

10.1145/3534678.3539460

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Tables are omnipresent on the web and in various vertical domains, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding this valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition. Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts. To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline.

引用

下载

页码：2233 / 2241

页数：9

共 50 条

[41] Ontology Pre-training for Poison Prediction
Glauer, Martin
Neuhaus, Fabian
Mossakowski, Till
Hastings, Janna
ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2023, 2023, 14236 : 31 - 45
[42] Realistic Channel Models Pre-training
Huangfu, Yourui
Wang, Jian
Xu, Chen
Li, Rong
Ge, Yiqun
Wang, Xianbin
Zhang, Huazi
Wang, Jun
2019 IEEE GLOBECOM WORKSHOPS (GC WKSHPS), 2019,
[43] KEEP: An Industrial Pre-Training Framework for Online Recommendation via Knowledge Extraction and Plugging
Zhang, Yujing
Chan, Zhangming
Xu, Shuhao
Bian, Weijie
Han, Shuguang
Deng, Hongbo
Zheng, Bo
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3684 - 3693
[44] Structure-inducing pre-training
Matthew B. A. McDermott
Brendan Yap
Peter Szolovits
Marinka Zitnik
Nature Machine Intelligence, 2023, 5 : 612 - 621
[45] Automated Commit Intelligence by Pre-training
Liu, Shangqing
Li, Yanzhou
Xie, Xiaofei
Ma, Wei
Meng, Guozhu
Liu, Yang
ACM Transactions on Software Engineering and Methodology, 2024, 33 (08)
[46] Unsupervised Pre-Training for Voice Activation
Kolesau, Aliaksei
Sesok, Dmitrij
APPLIED SCIENCES-BASEL, 2020, 10 (23): : 1 - 13
[47] Pre-Training Without Natural Images
Hirokatsu Kataoka
Kazushige Okayasu
Asato Matsumoto
Eisuke Yamagata
Ryosuke Yamada
Nakamasa Inoue
Akio Nakamura
Yutaka Satoh
International Journal of Computer Vision, 2022, 130 : 990 - 1007
[48] Pre-training Universal Language Representation
Li, Yian
Zhao, Hai
59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5122 - 5133
[49] DILBERT: Customized Pre-Training for Domain Adaptation with Category Shift, with an Application to Aspect Extraction
Lekhtman, Entony
Ziser, Yftah
Reichart, Roi
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 219 - 230
[50] Improving Relation Extraction through Syntax-induced Pre-training with Dependency Masking
Tian, Yuanhe
Song, Yan
Xia, Fei
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1875 - 1886

← 1 2 3 4 5 →