Exploiting block co-occurrence to control block sizes for entity resolution

被引:6
|
作者
Nascimento, Dimas Cassimiro [1 ,2 ]
Pires, Carlos Eduardo Santos [2 ]
Mestre, Demetrio Gomes [2 ]
机构
[1] Univ Fed Rural Pernambuco, Garanhuns, Brazil
[2] Univ Fed Campina Grande, Campina Grande, Paraiba, Brazil
关键词
Deduplication; Entity resolution; Heuristics; Data quality; RECORD LINKAGE; ADAPTIVE BLOCKING;
D O I
10.1007/s10115-019-01347-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.
引用
收藏
页码:359 / 400
页数:42
相关论文
共 50 条
  • [31] Coarse-to-fine Foreground Segmentation based on Co-occurrence Pixel-Block and Spatio-Temporal Attention Model
    Liang, Dong
    Liu, Xinyu
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3807 - 3813
  • [32] Co-occurrence of apathy and impulse control disorders in Parkinson disease
    Scott, Bonnie M.
    Eisinger, Robert S.
    Burns, Matthew R.
    Lopes, Janine
    Okun, Michael S.
    Gunduz, Aysegul
    Bowers, Dawn
    [J]. NEUROLOGY, 2020, 95 (20) : E2769 - E2780
  • [33] Orientation control of Silicon containing Block-Co-Polymer with resolution beyond 10 nm
    Someya, Yasunobu
    Mizuochi, Ryuta
    Wakayama, Hiroyuki
    Tadokoro, Shinsuke
    Kozawa, Masami
    Sakamoto, Rikimaru
    [J]. ADVANCES IN PATTERNING MATERIALS AND PROCESSES XXXIV, 2017, 10146
  • [34] Semi-supervised Approach Based on Co-occurrence Coefficient for Named Entity Recognition on Twitter
    Van Cuong Tran
    Hwang, Dosam
    Jung, Jason J.
    [J]. PROCEEDINGS OF 2015 2ND NATIONAL FOUNDATION FOR SCIENCE AND TECHNOLOGY DEVELOPMENT CONFERENCE ON INFORMATION AND COMPUTER SCIENCE NICS 2015, 2015, : 141 - 146
  • [35] Exploiting automatically generated databases of traffic signs and road markings for contextual co-occurrence analysis
    Hazelhoff, Lykele
    Creusen, Ivo M.
    Woudsma, Thomas
    de With, Peter H. N.
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2015, 24 (06)
  • [36] Exploiting Word and Visual Word Co-occurrence for Sketch-based Clipart Image Retrieval
    Liu, Ching-Hsuan
    Lin, Yen-Liang
    Cheng, Wen-Feng
    Hsu, Winston H.
    [J]. MM'15: PROCEEDINGS OF THE 2015 ACM MULTIMEDIA CONFERENCE, 2015, : 867 - 870
  • [37] Block co-polymers for high resolution imaging applications
    Willson, Grant
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2014, 248
  • [38] Automatic Grading System for Diagnosis of Breast Cancer Exploiting Co-occurrence Shearlet Transform and Histogram Features
    Budak, U.
    Guzel, A. B.
    [J]. IRBM, 2020, 41 (02) : 106 - 114
  • [39] VisualTextualRank: An Extension of VisualRank to Large-Scale Video Shot Extraction Exploiting Tag Co-occurrence
    Do, Nga H.
    Yanai, Keiji
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2015, E98D (01): : 166 - 172
  • [40] Multi-label out-of-distribution detection via exploiting sparsity and co-occurrence of labels
    Wang, Lei
    Huang, Sheng
    Huangfu, Luwen
    Liu, Bo
    Zhang, Xiaohong
    [J]. IMAGE AND VISION COMPUTING, 2022, 126