SeMBlock: A semantic-aware meta-blocking approach for entity resolution

被引:0
|
作者
Javdani, Delaram [1 ]
Rahmani, Hossein [1 ]
Weiss, Gerhard [2 ]
机构
[1] Iran Univ Sci & Technol, Sch Comp Engn, Tehran, Iran
[2] Maastricht Univ, Dept Data Sci & Knowledge Engn, Maastricht, Netherlands
来源
关键词
Data matching; entity resolution; meta-blocking; word embedding; locality-sensitive hashing; semantic similarity; big data integration; ALGORITHM;
D O I
10.3233/IDT-200207
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Entity resolution refers to the process of identifying, matching, and integrating records belonging to unique entities in a data set. However, a comprehensive comparison across all pairs of records leads to quadratic matching complexity. Therefore, blocking methods are used to group similar entities into small blocks before the matching. Available blocking methods typically do not consider semantic relationships among records. In this paper, we propose a Semantic-aware Meta-Blocking approach called SeMBlock. SeMBlock considers the semantic similarity of records by applying locality-sensitive hashing (LSH) based on word embedding to achieve fast and reliable blocking in a large-scale data environment. To improve the quality of the blocks created, SeMBlock builds a weighted graph of semantically similar records and prunes the graph edges. We extensively compare SeMBlock with 16 existing blocking methods, using three real-world data sets. The experimental results show that SeMBlock significantly outperforms all 16 methods with respect to two relevant measures, F-measure and pair-quality measure. F-measure and pair-quality measure of SeMBlock are approximately 7% and 27%, respectively, higher than recently released blocking methods.
引用
收藏
页码:461 / 468
页数:8
相关论文
共 41 条
  • [21] Semantic-aware Comment Analysis Approach for API Permission Mapping on Android
    Shim, Hyunseok
    Jung, Souhwan
    ACM International Conference Proceeding Series, 2020, : 61 - 69
  • [22] Semantic-aware Comment Analysis Approach for API Permission Mapping on Android
    Shim, Hyunseok
    Jung, Souhwan
    2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 61 - 69
  • [23] Semantic-Aware Sensing Information Transmission for Metaverse: A Contest Theoretic Approach
    Wang, Jiacheng
    Du, Hongyang
    Tian, Zengshan
    Niyato, Dusit
    Kang, Jiawen
    Shen, Xuemin
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2023, 22 (08) : 5214 - 5228
  • [24] Semantic-aware heterogeneous information network embedding with incompatible meta-paths
    Susu Zheng
    Donghai Guan
    Weiwei Yuan
    World Wide Web, 2022, 25 : 1 - 21
  • [25] Semantic-aware heterogeneous information network embedding with incompatible meta-paths
    Zheng, Susu
    Guan, Donghai
    Yuan, Weiwei
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (01): : 1 - 21
  • [26] Semantic-Aware Clustering-based Approach of Trajectory Data Stream Mining
    Tasnim, Samia
    Caldas, Juan
    Pissinou, Niki
    Iyengar, S. S.
    Ding, Ziqian
    2018 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS (ICNC), 2018, : 88 - 92
  • [27] Semantic-aware and Fine-grained App Review Bug Mining Approach
    Wang Y.-W.
    Wang J.-J.
    Shi L.
    Wang Q.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (04): : 1613 - 1629
  • [28] EMoDi: Entity-Enhanced Momentum-Difference Contrastive Learning for Semantic-Aware Verification of Scientific Information
    Yang, Ze
    Sun, Yimeng
    Nakaguchi, Takao
    Imai, Masaharu
    2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 142 - 151
  • [29] Multi-Resolution and Semantic-Aware Bidirectional Adapter for Multi-Scale Object Detection
    Li, Zekun
    Pan, Jin
    He, Peidong
    Zhang, Ziqi
    Zhao, Chunlu
    Li, Bing
    APPLIED SCIENCES-BASEL, 2023, 13 (23):
  • [30] Automated Analysis of Semantic-Aware Access Control Policies: a Logic-Based Approach
    Armando, Alessandro
    Carbone, Roberto
    Ranise, Silvio
    FIFTH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2011), 2011, : 356 - 363