Exploiting block co-occurrence to control block sizes for entity resolution

被引:6
|
作者
Nascimento, Dimas Cassimiro [1 ,2 ]
Pires, Carlos Eduardo Santos [2 ]
Mestre, Demetrio Gomes [2 ]
机构
[1] Univ Fed Rural Pernambuco, Garanhuns, Brazil
[2] Univ Fed Campina Grande, Campina Grande, Paraiba, Brazil
关键词
Deduplication; Entity resolution; Heuristics; Data quality; RECORD LINKAGE; ADAPTIVE BLOCKING;
D O I
10.1007/s10115-019-01347-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.
引用
下载
收藏
页码:359 / 400
页数:42
相关论文
共 50 条
  • [21] Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer's disease
    Heo, Go Eun
    Xie, Qing
    Song, Min
    Lee, Jeong-Hoon
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2019, 19 (01)
  • [22] A Hybrid Semantic Relatedness Algorithm by Entity Co-Occurrence and Specialized Word Embeddings
    Heo, Go Eun
    Xie, Qing
    2019 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2019, : 478 - 479
  • [23] Entity Co-occurrence Graph-Based Clustering for Twitter Event Detection
    Manaskasemsak, Bundit
    Netsiwawichian, Natthakit
    Rungsawang, Arnon
    ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 2, AINA 2024, 2024, 200 : 344 - 355
  • [24] Block Sizes Control For an Efficient Real Time Record Linkage
    Benkhaled, Hamid Naceur
    Berrabah, Djamel
    Boufares, Faouzi
    PROCEEDINGS OF 2020 5TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND ARTIFICIAL INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS (CLOUDTECH'20), 2020, : 145 - 150
  • [25] A-OPTIMAL INCOMPLETE BLOCK-DESIGNS WITH UNEQUAL BLOCK SIZES FOR COMPARING TEST TREATMENTS WITH A CONTROL
    ANGELIS, L
    MOYSSIADIS, C
    JOURNAL OF STATISTICAL PLANNING AND INFERENCE, 1991, 28 (03) : 353 - 368
  • [26] Exploiting interaction of fine and coarse features and attribute co-occurrence for person attribute recognition
    Zhiyong Sun
    Junyong Ye
    Tongqing Wang
    Li Jiang
    Yang Li
    Multimedia Tools and Applications, 2021, 80 : 11887 - 11902
  • [27] Exploiting co-occurrence networks for classification of implicit inter-relationships in legal texts
    Sulis, Emilio
    Humphreys, Llio
    Vernero, Fabiana
    Amantea, Ilaria Angela
    Audrito, Davide
    Di Caro, Luigi
    INFORMATION SYSTEMS, 2022, 106
  • [28] Exploiting Co-occurrence Frequency of Emotions in Perceptual Evaluations To Train A Speech Emotion Classifier
    Chou, Huang-Cheng
    Lee, Chi-Chun
    Busso, Carlos
    INTERSPEECH 2022, 2022, : 161 - 165
  • [29] Exploiting interaction of fine and coarse features and attribute co-occurrence for person attribute recognition
    Sun, Zhiyong
    Ye, Junyong
    Wang, Tongqing
    Jiang, Li
    Li, Yang
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (08) : 11887 - 11902
  • [30] SC-Block: Supervised Contrastive Blocking Within Entity Resolution Pipelines
    Brinkmann, Alexander
    Shraga, Roee
    Bizer, Christina
    SEMANTIC WEB, PT I, ESWC 2024, 2024, 14664 : 121 - 142