Numerical Tuple Extraction from Tables with Pre-training

被引:1
|
作者
Yang, Qingping [1 ,3 ]
Cao, Yixuan [1 ,3 ]
Luo, Ping [1 ,2 ,3 ]
机构
[1] Univ Chinese Acad Sci, CAS, Inst Comp Technol, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022 | 2022年
基金
中国国家自然科学基金;
关键词
tuple extraction; tabular representation; pre-training;
D O I
10.1145/3534678.3539460
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tables are omnipresent on the web and in various vertical domains, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding this valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition. Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts. To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline.
引用
收藏
页码:2233 / 2241
页数:9
相关论文
共 50 条
  • [21] Pre-training phenotyping classifiers
    Dligach, Dmitriy
    Afshar, Majid
    Miller, Timothy
    JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 113 (113)
  • [22] Rethinking Pre-training and Self-training
    Zoph, Barret
    Ghiasi, Golnaz
    Lin, Tsung-Yi
    Cui, Yin
    Liu, Hanxiao
    Cubuk, Ekin D.
    Le, Quoc V.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [23] Triples Extraction in Epidemiological Investigation Field based on Improved Pre-training Model
    Li, Hao
    Yan, Huaicheng
    Bai, Ke
    Sun, Jiazheng
    Tang, Nanxi
    Li, Zhichen
    2024 14TH ASIAN CONTROL CONFERENCE, ASCC 2024, 2024, : 147 - 152
  • [24] Research on Effectiveness of the Pre-Training Model in Improving the Performance of Spectral Feature Extraction
    Ren, Ju-xiang
    Liu, Zhong-bao
    SPECTROSCOPY AND SPECTRAL ANALYSIS, 2024, 44 (12) : 3480 - 3484
  • [25] Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction
    Luo, Da
    Gan, Yanglei
    Hou, Rui
    Lin, Run
    Liu, Qiao
    Cai, Yuxiang
    Gao, Wannian
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 18742 - 18750
  • [26] Span-Based Joint Entity and Relation Extraction with Transformer Pre-Training
    Eberts, Markus
    Ulges, Adrian
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 2006 - 2013
  • [27] Pre-Training Without Natural Images
    Kataoka, Hirokatsu
    Okayasu, Kazushige
    Matsumoto, Asato
    Yamagata, Eisuke
    Yamada, Ryosuke
    Inoue, Nakamasa
    Nakamura, Akio
    Satoh, Yutaka
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (04) : 990 - 1007
  • [28] Dialogue-oriented Pre-training
    Xu, Yi
    Zhao, Hai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2663 - 2673
  • [29] A Pipelined Pre-training Algorithm for DBNs
    Ma, Zhiqiang
    Li, Tuya
    Yang, Shuangtao
    Zhang, Li
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017, 2017, 10565 : 48 - 59
  • [30] Improving fault localization with pre-training
    Zhang, Zhuo
    Li, Ya
    Xue, Jianxin
    Mao, Xiaoguang
    FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (01)