Entity Matching across Heterogeneous Sources

被引:21
|
作者
Yang, Yang [1 ]
Sun, Yizhou [3 ]
Tang, Jie [1 ,2 ]
Ma, Bo [4 ]
Li, Juanzi [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Tsinghua Natl Lab Informat Sci & Technol TNList, Beijing, Peoples R China
[3] Northeastern Univ, Dept Comp Sci, Boston, MA 02115 USA
[4] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Heterogeneous sources; Cross-lingual matching; Topic model;
D O I
10.1145/2783258.2783353
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.
引用
收藏
页码:1395 / 1404
页数:10
相关论文
共 50 条
  • [1] Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization
    Zhao, Huimin
    Ram, Sudha
    DATA & KNOWLEDGE ENGINEERING, 2008, 66 (03) : 368 - 381
  • [2] Semantic matching across heterogeneous data sources
    Zhao, Huimin
    COMMUNICATIONS OF THE ACM, 2007, 50 (01) : 45 - 50
  • [3] DEM: Deep Entity Matching Across Heterogeneous Information Networks
    Kong, Chao
    Chen, Bao-Xiang
    Zhang, Li-Ping
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2020, 35 (04) : 739 - 750
  • [4] DEM: Deep Entity Matching Across Heterogeneous Information Networks
    Chao Kong
    Bao-Xiang Chen
    Li-Ping Zhang
    Journal of Computer Science and Technology, 2020, 35 : 739 - 750
  • [5] EnAli: entity alignment across multiple heterogeneous data sources
    Kong, Chao
    Gao, Ming
    Xu, Chen
    Fu, Yunbin
    Qian, Weining
    Zhou, Aoying
    FRONTIERS OF COMPUTER SCIENCE, 2019, 13 (01) : 157 - 169
  • [6] EnAli: entity alignment across multiple heterogeneous data sources
    Chao Kong
    Ming Gao
    Chen Xu
    Yunbin Fu
    Weining Qian
    Aoying Zhou
    Frontiers of Computer Science, 2019, 13 : 157 - 169
  • [7] High-performance spatiotemporal trajectory matching across heterogeneous data sources
    Gong, Xuri
    Huang, Zhou
    Wang, Yaoli
    Wu, Lun
    Liu, Yu
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 105 : 148 - 161
  • [8] Matching Attributes Across Overlapping Heterogeneous Data Sources Using Mutual Information
    Zhao, Huimin
    JOURNAL OF DATABASE MANAGEMENT, 2010, 21 (04) : 91 - 110
  • [9] Hierarchical Matching Network for Heterogeneous Entity Resolution
    Fu, Cheng
    Han, Xianpei
    He, Jiaming
    Sun, Le
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 3665 - 3671
  • [10] Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution
    Nie, Hao
    Han, Xianpei
    He, Ben
    Sun, Le
    Chen, Bo
    Zhang, Wei
    Wu, Suhui
    Kong, Hao
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 629 - 638