Annotating Columns with Pre-trained Language Models

被引：19

作者：

Suhara, Yoshihiko ^{[1
]}

Li, Jinfeng ^{[1
]}

Li, Yuliang ^{[1
]}

Zhang, Dan ^{[1
]}

Demiralp, Cagatay ^{[2
]}

Chen, Chen ^{[1
]}

Tan, Wang-Chiew ^{[3
]}

机构：

[1] Megagon Labs, Mountain View, CA 94041 USA

[2] Sigma Comp, San Francisco, CA USA

[3] Meta AI, Menlo Pk, CA USA

来源：

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22) | 2022年

关键词：

table understanding; language models; multi-task learning; TABLES;

D O I：

10.1145/3514221.3517906

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Inferring meta information about tables, such as column headers or relationships between columns, is an active research topic in data management as we find many tables are missing some of this information. In this paper, we study the problem of annotating table columns (i.e., predicting column types and the relationships between columns) using only information from the table itself. We develop a multi-task learning framework (called DODUO) based on pre-trained language models, which takes the entire table as input and predicts column types/relations using a single model. Experimental results show that DODUO establishes new state-of-the-art performance on two benchmarks for the column type prediction and column relation prediction tasks with up to 4.0% and 11.9% improvements, respectively. We report that DODUO can already outperform the previous state-of-the-art performance with a minimal number of tokens, only 8 tokens per colunm. We release a toolbox(1) and confirm the effectiveness of DODUO on a real-world data science problem through a case study.

引用

页码：1493 / 1503

页数：11

共 50 条

[21] Deep Entity Matching with Pre-Trained Language Models
Li, Yuliang
Li, Jinfeng
Suhara, Yoshihiko
Doan, AnHai
Tan, Wang-Chiew
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (01): : 50 - 60
[22] Self-conditioning Pre-Trained Language Models
Suau, Xavier
Zappella, Luca
Apostoloff, Nicholas
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[23] A Survey of Knowledge Enhanced Pre-Trained Language Models
Hu, Linmei
Liu, Zeyi
Zhao, Ziwang
Hou, Lei
Nie, Liqiang
Li, Juanzi
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (04) : 1413 - 1430
[24] Context Analysis for Pre-trained Masked Language Models
Lai, Yi-An
Lalwani, Garima
Zhang, Yi
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3789 - 3804
[25] Exploring Lottery Prompts for Pre-trained Language Models
Chen, Yulin
Ding, Ning
Wang, Xiaobin
Hu, Shengding
Zheng, Hai-Tao
Liu, Zhiyuan
Xie, Pengjun
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15428 - 15444
[26] Empowering News Recommendation with Pre-trained Language Models
Wu, Chuhan
Wu, Fangzhao
Qi, Tao
Huang, Yongfeng
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1652 - 1656
[27] Pre-trained language models: What do they know?
Guimaraes, Nuno
Campos, Ricardo
Jorge, Alipio
[J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2024, 14 (01)
[28] Capturing Semantics for Imputation with Pre-trained Language Models
Mei, Yinan
Song, Shaoxu
Fang, Chenguang
Yang, Haifeng
Fang, Jingyun
Long, Jiang
[J]. 2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 61 - 72
[29] Memorisation versus Generalisation in Pre-trained Language Models
Tanzer, Michael
Ruder, Sebastian
Rei, Marek
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7564 - 7578
[30] Evaluating the Summarization Comprehension of Pre-Trained Language Models
Chernyshev, D. I.
Dobrov, B. V.
[J]. LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (08) : 3028 - 3039

← 1 2 3 4 5 →