Numerical Tuple Extraction from Tables with Pre-training

被引:0
|
作者
Yang, Qingping [1 ,3 ]
Cao, Yixuan [1 ,3 ]
Luo, Ping [1 ,2 ,3 ]
机构
[1] Univ Chinese Acad Sci, CAS, Inst Comp Technol, Beijing, Peoples R China
[2] Peng Cheng Lab, Shenzhen, Peoples R China
[3] Chinese Acad Sci, Key Lab Intelligent Informat Proc, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
tuple extraction; tabular representation; pre-training;
D O I
10.1145/3534678.3539460
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Tables are omnipresent on the web and in various vertical domains, storing massive amounts of valuable data. However, the great flexibility in the table layout hinders the machine from understanding this valuable data. In order to unlock and utilize knowledge from tables, extracting data as numerical tuples is the first and critical step. As a form of relational data, numerical tuples have direct and transparent relationships between their elements and are therefore easy for machines to use. Extracting numerical tuples requires a deep understanding of intricate correlations between cells. The correlations are presented implicitly in texts and visual appearances of tables, which can be roughly classified into Hierarchy and Juxtaposition. Although many studies have made considerable progress in data extraction from tables, most of them only consider hierarchical relationships but neglect the juxtapositions. Meanwhile, they only evaluate their methods on relatively small corpora. This paper proposes a new framework to extract numerical tuples from tables and evaluate it on a large test set. Specifically, we convert this task into a relation extraction problem between cells. To represent cells with their intricate correlations in tables, we propose a BERT-based pre-trained language model, TableLM, to encode tables with diverse layouts. To evaluate the framework, we collect a large finance dataset that includes 19,264 tables and 604K tuples. Extensive experiments on the dataset are conducted to demonstrate the superiority of our framework compared to a well-designed baseline.
引用
下载
收藏
页码:2233 / 2241
页数:9
相关论文
共 50 条
  • [41] Ontology Pre-training for Poison Prediction
    Glauer, Martin
    Neuhaus, Fabian
    Mossakowski, Till
    Hastings, Janna
    ADVANCES IN ARTIFICIAL INTELLIGENCE, KI 2023, 2023, 14236 : 31 - 45
  • [42] Realistic Channel Models Pre-training
    Huangfu, Yourui
    Wang, Jian
    Xu, Chen
    Li, Rong
    Ge, Yiqun
    Wang, Xianbin
    Zhang, Huazi
    Wang, Jun
    2019 IEEE GLOBECOM WORKSHOPS (GC WKSHPS), 2019,
  • [43] KEEP: An Industrial Pre-Training Framework for Online Recommendation via Knowledge Extraction and Plugging
    Zhang, Yujing
    Chan, Zhangming
    Xu, Shuhao
    Bian, Weijie
    Han, Shuguang
    Deng, Hongbo
    Zheng, Bo
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3684 - 3693
  • [44] Structure-inducing pre-training
    Matthew B. A. McDermott
    Brendan Yap
    Peter Szolovits
    Marinka Zitnik
    Nature Machine Intelligence, 2023, 5 : 612 - 621
  • [45] Automated Commit Intelligence by Pre-training
    Liu, Shangqing
    Li, Yanzhou
    Xie, Xiaofei
    Ma, Wei
    Meng, Guozhu
    Liu, Yang
    ACM Transactions on Software Engineering and Methodology, 2024, 33 (08)
  • [46] Unsupervised Pre-Training for Voice Activation
    Kolesau, Aliaksei
    Sesok, Dmitrij
    APPLIED SCIENCES-BASEL, 2020, 10 (23): : 1 - 13
  • [47] Pre-Training Without Natural Images
    Hirokatsu Kataoka
    Kazushige Okayasu
    Asato Matsumoto
    Eisuke Yamagata
    Ryosuke Yamada
    Nakamasa Inoue
    Akio Nakamura
    Yutaka Satoh
    International Journal of Computer Vision, 2022, 130 : 990 - 1007
  • [48] Pre-training Universal Language Representation
    Li, Yian
    Zhao, Hai
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 5122 - 5133
  • [49] DILBERT: Customized Pre-Training for Domain Adaptation with Category Shift, with an Application to Aspect Extraction
    Lekhtman, Entony
    Ziser, Yftah
    Reichart, Roi
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 219 - 230
  • [50] Improving Relation Extraction through Syntax-induced Pre-training with Dependency Masking
    Tian, Yuanhe
    Song, Yan
    Xia, Fei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1875 - 1886