Deep Entity Matching with Pre-Trained Language Models

被引:121
|
作者
Li, Yuliang [1 ]
Li, Jinfeng [1 ]
Suhara, Yoshihiko [1 ]
Doan, AnHai [2 ]
Tan, Wang-Chiew [1 ]
机构
[1] Megagon Labs, Mountain View, CA 94041 USA
[2] Univ Wisconsin, Madison, WI USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2020年 / 14卷 / 01期
关键词
Large dataset - Benchmarking - Computational linguistics;
D O I
10.14778/3421424.3421431
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We present DITTO, a novel entity matching system based on pre-trained Transformer-based language models. We fine-tune and cast EM as a sequence-pair classification problem to leverage such models with a simple architecture. Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets. We also developed three optimization techniques to further improve Dirro's matching capability. DITTO allows domain knowledge to be injected by highlighting important pieces of input information that may be of interest when making matching decisions. DITTO also summarizes strings that are too long so that only the essential information is retained and used for EM. Finally, DITTO adapts a SOTA technique on data augmentation for text to EM to augment the training data with (difficult) examples. This way, DITTO is forced to learn "harder" to improve the model's matching capability. The optimizations we developed further boost the performance of DITTO by up to 9.8%. Perhaps more surprisingly, we establish that DITTO can achieve the previous SOTA results with at most half the number of labeled data. Finally, we demonstrate DITTO'S effectiveness on a real-world large-scale EM task. On matching two company datasets consisting of 789K and 412K records, DITTO achieves a high F1 score of 96.5%.
引用
收藏
页码:50 / 60
页数:11
相关论文
共 50 条
  • [1] Probing the Robustness of Pre-trained Language Models for Entity Matching
    Rastaghi, Mehdi Akbarian
    Kamalloo, Ehsan
    Rafiei, Davood
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2022, 2022, : 3786 - 3790
  • [2] Schema-Agnostic Entity Matching using Pre-trained Language Models
    Teong, Kai-Sheng
    Soon, Lay-Ki
    Su, Tin Tin
    [J]. CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 2241 - 2244
  • [3] Interpretability of Entity Matching Based on Pre-trained Language Model
    Liang, Zheng
    Wang, Hong-Zhi
    Dai, Jia-Jia
    Shao, Xin-Yue
    Ding, Xiao-Ou
    Mu, Tian-Yu
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (03): : 1087 - 1108
  • [4] SETEM: Self-ensemble training with Pre-trained Language Models for Entity Matching
    Ding, Huahua
    Dai, Chaofan
    Wu, Yahui
    Ma, Wubin
    Zhou, Haohao
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 293
  • [5] JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentration
    Ye, Chen
    Jiang, Shihao
    Zhang, Hua
    Wu, Yifan
    Shi, Jiankai
    Wang, Hongzhi
    Dai, Guojun
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [6] Entity Linking of Sound Recordings and Compositions with Pre-trained Language Models
    Katakis, Nikiforos
    Vikatos, Pantelis
    [J]. PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND TECHNOLOGIES (WEBIST), 2021, : 474 - 481
  • [7] Entity Resolution Based on Pre-trained Language Models with Two Attentions
    Zhu, Liang
    Liu, Hao
    Song, Xin
    Wei, Yonggang
    Wang, Yu
    [J]. WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 433 - 448
  • [8] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    [J]. ENGINEERING, 2023, 25 : 51 - 65
  • [9] Somun: entity-centric summarization incorporating pre-trained language models
    Inan, Emrah
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (10): : 5301 - 5311
  • [10] A Simple but Effective Pluggable Entity Lookup Table for Pre-trained Language Models
    Ye, Deming
    Lin, Yankai
    Li, Peng
    Sun, Maosong
    Liu, Zhiyuan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 523 - 529