Numerical Tuple Extraction from Tables with Pre-training

被引:0
|
作者
Yang, Qingping [1 ,3 ]
Cao, Yixuan [1 ,3 ]
Luo, Ping [1 ,2 ,3 ]
机构
[1] Univ Chinese Acad Sci, CAS, Inst Comp Technol, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
tuple extraction; tabular representation; pre-training;
D O I
10.1145/3534678.3539460
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tables are omnipresent on the web and in various vertical domains, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding this valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition. Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts. To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline.
引用
收藏
页码:2233 / 2241
页数:9
相关论文
共 50 条
  • [1] Understanding tables with intermediate pre-training
    Eisenschlos, Julian Martin
    Krichene, Syrine
    Mueller, Thomas
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,
  • [2] Zero-shot Key Information Extraction from Mixed-Style Tables: Pre-training on Wikipedia
    Yang, Qingping
    Hu, Yingpeng
    Cao, Rongyu
    Li, Hongwei
    Luo, Ping
    [J]. Proceedings - IEEE International Conference on Data Mining, ICDM, 2021, 2021-December : 1451 - 1456
  • [3] Zero-shot Key Information Extraction from Mixed-Style Tables: Pre-training on Wikipedia
    Yang, Qingping
    Hu, Yingpeng
    Cao, Rongyu
    Li, Hongwei
    Luo, Ping
    [J]. 2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1451 - 1456
  • [4] A Method of Relation Extraction Using Pre-training Models
    Wang, Yu
    Sun, Yining
    Ma, Zuchang
    Gao, Lisheng
    Xu, Yang
    Wu, Yichen
    [J]. 2020 13TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID 2020), 2020, : 176 - 179
  • [5] GeoLayoutLM: Geometric Pre-training for Visual Information Extraction
    Luo, Chuwei
    Cheng, Changxu
    Zheng, Qi
    Yao, Cong
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7092 - 7101
  • [6] Relation Extraction with Weighted Contrastive Pre-training on Distant Supervision
    Wan, Zhen
    Cheng, Fei
    Liu, Qianying
    Mao, Zhuoyuan
    Song, Haiyue
    Kurohashi, Sadao
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2580 - 2585
  • [7] Pre-training a Neural Model to Overcome Data Scarcity in Relation Extraction from Text
    Jung, Seokwoo
    Myaeng, Sung-Hyon
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 176 - 180
  • [8] Multilingual Translation from Denoising Pre-Training
    Tang, Yuqing
    Tran, Chau
    Li, Xian
    Chen, Peng-Jen
    Goyal, Naman
    Chaudhary, Vishrav
    Gu, Jiatao
    Fan, Angela
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3450 - 3466
  • [9] Improving Information Extraction on Business Documents with Specific Pre-training Tasks
    Douzon, Thibault
    Duffner, Stefan
    Garcia, Christophe
    Espinas, Jeremy
    [J]. DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 111 - 125
  • [10] Multi-stage Pre-training over Simplified Multimodal Pre-training Models
    Liu, Tongtong
    Feng, Fangxiang
    Wang, Xiaojie
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2556 - 2565