Entity Matching across Heterogeneous Sources

被引:21
|
作者
Yang, Yang [1 ]
Sun, Yizhou [3 ]
Tang, Jie [1 ,2 ]
Ma, Bo [4 ]
Li, Juanzi [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Tsinghua Natl Lab Informat Sci & Technol TNList, Beijing, Peoples R China
[3] Northeastern Univ, Dept Comp Sci, Boston, MA 02115 USA
[4] Carnegie Mellon Univ, Dept Comp Sci, Pittsburgh, PA 15213 USA
基金
美国国家科学基金会;
关键词
Heterogeneous sources; Cross-lingual matching; Topic model;
D O I
10.1145/2783258.2783353
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Given an entity in a source domain, finding its matched entities from another (target) domain is an important task in many applications. Traditionally, the problem was usually addressed by first extracting major keywords corresponding to the source entity and then query relevant entities from the target domain using those keywords. However, the method would inevitably fails if the two domains have less or no overlapping in the content. An extreme case is that the source domain is in English and the target domain is in Chinese. In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem. The model integrates the topic extraction and entity matching, two core subtasks for dealing with the problem, into a unified model. Specifically, for handling the text disjointing problem, we use a cross-sampling process in our model to extract topics with terms coming from all the sources, and leverage existing matching relations through latent topic layers instead of at text layers. Benefit from the proposed model, we can not only find the matched documents for a query entity, but also explain why these documents are related by showing the common topics they share. Our experiments in two real-world applications show that the proposed model can extensively improve the matching performance (+19.8% and +7.1% in two applications respectively) compared with several alternative methods.
引用
收藏
页码:1395 / 1404
页数:10
相关论文
共 50 条
  • [21] A method for topological entity matching in the integration of heterogeneous CAD systems
    Li, Xiaoxia
    He, Fazhi
    Cai, Xiantao
    Zhang, Dejun
    Chen, Yilin
    INTEGRATED COMPUTER-AIDED ENGINEERING, 2013, 20 (01) : 15 - 30
  • [22] Research on Entities Matching across Heterogeneous Databases
    Qiang, Bao-hua
    Zhang, Long
    Xi, Jian-qing
    2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 10988 - +
  • [23] Matching LOFAR sources across radio bands
    Boehme, L.
    Schwarz, D. J.
    de Gasperin, F.
    Roettgering, H. J. A.
    Williams, W. L.
    ASTRONOMY & ASTROPHYSICS, 2023, 674
  • [24] Entity matching in heterogeneous databases: A distance-based decision model
    Dey, D
    Sarkar, S
    De, P
    PROCEEDINGS OF THE THIRTY-FIRST HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, VOL VII: SOFTWARE TECHNOLOGY TRACK, 1998, : 305 - 313
  • [25] Matching semantic Web Services across heterogeneous ontologies
    Guo, RQ
    Chen, DH
    Le, JJ
    FIFTH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY - PROCEEDINGS, 2005, : 264 - 268
  • [26] Group Identity Matching Across Heterogeneous Social Networks
    Qin, Hongchao
    Yuan, Ye
    Zhu, Feida
    Wang, Guoren
    WEB INFORMATION SYSTEMS ENGINEERING, WISE 2018, PT I, 2018, 11233 : 230 - 246
  • [27] SERIMI: Class-Based Matching for Instance Matching Across Heterogeneous Datasets
    Araujo, Samur
    Duc Thanh Tran
    de Vries, Arjen P.
    Schwabe, Daniel
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (05) : 1397 - 1410
  • [28] Early Integration Testing for Entity Reconciliation in the Context of Heterogeneous Data Sources
    Blanco, Raquel
    Enriquez, Jose G.
    Dominguez-Mayo, Francisco J.
    Escalona, M. J.
    Tuya, Javier
    IEEE TRANSACTIONS ON RELIABILITY, 2018, 67 (02) : 538 - 556
  • [29] Efficient m-closest entity matching over heterogeneous information networks
    Long, Wancheng
    Li, Xiaowen
    Wang, Liping
    Zhang, Fan
    Lin, Zhe
    Lin, Xuemin
    KNOWLEDGE-BASED SYSTEMS, 2023, 263
  • [30] ConnectionLens: Finding Connections Across Heterogeneous Data Sources
    Chanial, Camille
    Dziri, Redouane
    Galhardas, Helena
    Leblay, Julien
    Minh-Huong Le Nguyen
    Manolescu, Ioana
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (12): : 2030 - 2033