Deep Entity Matching with Pre-Trained Language Models

被引：121

作者：

Li, Yuliang ^{[1
]}

Li, Jinfeng ^{[1
]}

Suhara, Yoshihiko ^{[1
]}

Doan, AnHai ^{[2
]}

Tan, Wang-Chiew ^{[1
]}

机构：

[1] Megagon Labs, Mountain View, CA 94041 USA

[2] Univ Wisconsin, Madison, WI USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2020年 / 14卷 / 01期

关键词：

Large dataset - Benchmarking - Computational linguistics;

D O I：

10.14778/3421424.3421431

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present DITTO, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Dirro's matching capability. DITTO allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. DITTO also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, DITTO adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, DITTO is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of DITTO by up to 9.8%. Perhaps more surprisingly, we establish that DITTO can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate DITTO'S effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, DITTO achieves a high F1 score of 96.5%.

引用

页码：50 / 60

页数：11

共 50 条

[41] A Close Look into the Calibration of Pre-trained Language Models
Chen, Yangyi
Yuan, Lifan
Cui, Ganqu
Liu, Zhiyuan
Ji, Heng
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1343 - 1367
[42] Self-conditioning Pre-Trained Language Models
Suau, Xavier
Zappella, Luca
Apostoloff, Nicholas
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[43] A Survey of Knowledge Enhanced Pre-Trained Language Models
Hu, Linmei
Liu, Zeyi
Zhao, Ziwang
Hou, Lei
Nie, Liqiang
Li, Juanzi
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (04) : 1413 - 1430
[44] Context Analysis for Pre-trained Masked Language Models
Lai, Yi-An
Lalwani, Garima
Zhang, Yi
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 3789 - 3804
[45] Exploring Lottery Prompts for Pre-trained Language Models
Chen, Yulin
Ding, Ning
Wang, Xiaobin
Hu, Shengding
Zheng, Hai-Tao
Liu, Zhiyuan
Xie, Pengjun
[J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15428 - 15444
[46] Empowering News Recommendation with Pre-trained Language Models
Wu, Chuhan
Wu, Fangzhao
Qi, Tao
Huang, Yongfeng
[J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1652 - 1656
[47] Pre-trained language models: What do they know?
Guimaraes, Nuno
Campos, Ricardo
Jorge, Alipio
[J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2024, 14 (01)
[48] Capturing Semantics for Imputation with Pre-trained Language Models
Mei, Yinan
Song, Shaoxu
Fang, Chenguang
Yang, Haifeng
Fang, Jingyun
Long, Jiang
[J]. 2021 IEEE 37TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2021), 2021, : 61 - 72
[49] Memorisation versus Generalisation in Pre-trained Language Models
Tanzer, Michael
Ruder, Sebastian
Rei, Marek
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7564 - 7578
[50] Evaluating the Summarization Comprehension of Pre-Trained Language Models
Chernyshev, D. I.
Dobrov, B. V.
[J]. LOBACHEVSKII JOURNAL OF MATHEMATICS, 2023, 44 (08) : 3028 - 3039

← 1 2 3 4 5 →