Deep Entity Matching with Pre-Trained Language Models

被引：121

作者：

Li, Yuliang ^{[1
]}

Li, Jinfeng ^{[1
]}

Suhara, Yoshihiko ^{[1
]}

Doan, AnHai ^{[2
]}

Tan, Wang-Chiew ^{[1
]}

机构：

[1] Megagon Labs, Mountain View, CA 94041 USA

[2] Univ Wisconsin, Madison, WI USA

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2020年 / 14卷 / 01期

关键词：

Large dataset - Benchmarking - Computational linguistics;

D O I：

10.14778/3421424.3421431

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

We present DITTO, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Dirro's matching capability. DITTO allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. DITTO also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, DITTO adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, DITTO is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of DITTO by up to 9.8%. Perhaps more surprisingly, we establish that DITTO can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate DITTO'S effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, DITTO achieves a high F1 score of 96.5%.

引用

页码：50 / 60

页数：11

共 50 条

[1] Probing the Robustness of Pre-trained Language Models for Entity Matching
Rastaghi, Mehdi Akbarian
Kamalloo, Ehsan
Rafiei, Davood
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3786 - 3790
[2] Schema-Agnostic Entity Matching using Pre-trained Language Models
Teong, Kai-Sheng
Soon, Lay-Ki
Su, Tin Tin
[J]. CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 2241 - 2244
[3] Interpretability of Entity Matching Based on Pre-trained Language Model
Liang, Zheng
Wang, Hong-Zhi
Dai, Jia-Jia
Shao, Xin-Yue
Ding, Xiao-Ou
Mu, Tian-Yu
[J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (03): : 1087 - 1108
[4] SETEM: Self-ensemble training with Pre-trained Language Models for Entity Matching
Ding, Huahua
Dai, Chaofan
Wu, Yahui
Ma, Wubin
Zhou, Haohao
[J]. KNOWLEDGE-BASED SYSTEMS, 2024, 293
[5] JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentration
Ye, Chen
Jiang, Shihao
Zhang, Hua
Wu, Yifan
Shi, Jiankai
Wang, Hongzhi
Dai, Guojun
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
[6] Entity Linking of Sound Recordings and Compositions with Pre-trained Language Models
Katakis, Nikiforos
Vikatos, Pantelis
[J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES (WEBIST), 2021, : 474 - 481
[7] Entity Resolution Based on Pre-trained Language Models with Two Attentions
Zhu, Liang
Liu, Hao
Song, Xin
Wei, Yonggang
Wang, Yu
[J]. WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 433 - 448
[8] Pre-Trained Language Models and Their Applications
Wang, Haifeng
Li, Jiwei
Wu, Hua
Hovy, Eduard
Sun, Yu
[J]. ENGINEERING, 2023, 25 : 51 - 65
[9] Somun: entity-centric summarization incorporating pre-trained language models
Inan, Emrah
[J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (10): : 5301 - 5311
[10] A Simple but Effective Pluggable Entity Lookup Table for Pre-trained Language Models
Ye, Deming
Lin, Yankai
Li, Peng
Sun, Maosong
Liu, Zhiyuan
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 523 - 529

← 1 2 3 4 5 →