Exploiting block co-occurrence to control block sizes for entity resolution

被引:6
|
作者
Nascimento, Dimas Cassimiro [1 ,2 ]
Pires, Carlos Eduardo Santos [2 ]
Mestre, Demetrio Gomes [2 ]
机构
[1] Univ Fed Rural Pernambuco, Garanhuns, Brazil
[2] Univ Fed Campina Grande, Campina Grande, Paraiba, Brazil
关键词
Deduplication; Entity resolution; Heuristics; Data quality; RECORD LINKAGE; ADAPTIVE BLOCKING;
D O I
10.1007/s10115-019-01347-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.
引用
收藏
页码:359 / 400
页数:42
相关论文
共 50 条
  • [1] Exploiting block co-occurrence to control block sizes for entity resolution
    Dimas Cassimiro Nascimento
    Carlos Eduardo Santos Pires
    Demetrio Gomes Mestre
    [J]. Knowledge and Information Systems, 2020, 62 : 359 - 400
  • [2] A Clustering-Based Framework to Control Block Sizes for Entity Resolution
    Fisher, Jeffrey
    Christen, Peter
    Wang, Qing
    Rahm, Erhard
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 279 - 288
  • [3] Image retrieval based on co-occurrence matrix using block classification characteristics
    Kim, TS
    Kim, SJ
    Lee, KI
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2005, PT 1, 2005, 3767 : 946 - 956
  • [4] Block-based ordinal co-occurrence matrices for texture similarity evaluation
    Partio, M
    Cramariuc, B
    Gabbouj, M
    [J]. 2005 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), VOLS 1-5, 2005, : 669 - 672
  • [5] Lexical co-occurrence and ambiguity resolution
    Witzel, Jeffrey
    Forster, Kenneth
    [J]. LANGUAGE COGNITION AND NEUROSCIENCE, 2014, 29 (02) : 158 - 185
  • [6] Image Retrieval Based on Weighted Block Color Histogram and Texton Co-occurrence Matrix
    Huang, Wenqing
    Dai, Jiazhe
    Wu, Qiang
    [J]. PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND MANAGEMENT INNOVATION, 2015, 28 : 848 - 853
  • [7] Exploiting Co-Occurrence of Low Frequent Terms in Patents
    Khattak, Akmal Saeed
    Heyer, Gerhard
    [J]. MAN-MACHINE INTERACTIONS 3, 2014, 242 : 459 - 466
  • [8] A Challenging Diagnostic Dilemma: Asymptomatic AV Block in COVID-19 and MRSA Co-Occurrence
    Boadla, Marlon R.
    Naeem, Azka
    Kumari, Sapna
    Uddin, Syed M. Mazhar
    Farooqui, Arafat
    Maheshwari, Sanjay
    Seitllari, Armando
    Haq, Zara
    Khan, Muhammad H.
    Epstein, David J.
    Singh, Sehajpreet
    Hollander, Gerald
    Kumar, Kelash
    [J]. JOURNAL OF COMMUNITY HOSPITAL INTERNAL MEDICINE PERSPECTIVES, 2024, 14 (02):
  • [9] Using Knowledge Graphs to Explain Entity Co-occurrence in Twitter
    Wang, Yiwei
    Carman, Mark James
    Li, Yuan-Fang
    [J]. CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2351 - 2354
  • [10] Adaptive Multiscale Block Compressed Sensing of Images based on Gray Level Co-Occurrence Matrix
    Li J.
    Guo J.
    Cao S.
    Zhao Y.
    [J]. Journal of Engineering Science and Technology Review, 2020, 13 (05): : 169 - 175