Exploiting block co-occurrence to control block sizes for entity resolution

被引:6
|
作者
Nascimento, Dimas Cassimiro [1 ,2 ]
Pires, Carlos Eduardo Santos [2 ]
Mestre, Demetrio Gomes [2 ]
机构
[1] Univ Fed Rural Pernambuco, Garanhuns, Brazil
[2] Univ Fed Campina Grande, Campina Grande, Paraiba, Brazil
关键词
Deduplication; Entity resolution; Heuristics; Data quality; RECORD LINKAGE; ADAPTIVE BLOCKING;
D O I
10.1007/s10115-019-01347-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The problem of identifying duplicated entities in a dataset has gained increasing importance during the last decades. Due to the large size of the datasets, this problem can be very costly to be solved due to its intrinsic quadratic complexity. Both researchers and practitioners have developed a variety of techniques aiming to speed up a solution to this problem. One of these techniques is called blocking, an indexing technique that splits the dataset into a set of blocks, such that each block contains entities that share a common property evaluated by a blocking key function. In order to improve the efficacy of the blocking technique, multiple blocking keys may be used, and thus, a set of blocking results is generated. In this paper, we investigate how to control the size of the blocks generated by the use of multiple blocking keys and maintain reasonable quality results, which is measured by the quality of the produced blocks. By controlling the size of the blocks, we can reduce the overall cost of solving an entity resolution problem and facilitate the execution of a variety of tasks (e.g., real-time and privacy-preserving entity resolution). For doing so, we propose many heuristics which exploit the co-occurrence of entities among the generated blocks for pruning, splitting and merging blocks. The experimental results we carry out using four datasets confirm the adequacy of the proposed heuristics for generating block sizes within a predefined range threshold as well as maintaining reasonable blocking quality results.
引用
下载
收藏
页码:359 / 400
页数:42
相关论文
共 50 条
  • [41] Interest of the Multi-Resolution Analysis based on the Co-occurrence Matrix for Texture Classification
    Ben Othmen, M.
    Sayadi, A.
    Fnaiech, F.
    2008 IEEE MEDITERRANEAN ELECTROTECHNICAL CONFERENCE, VOLS 1 AND 2, 2008, : 831 - +
  • [42] Analysis and Comparison of Block-Splitting-Based Load Balancing Strategies for Parallel Entity Resolution
    Chen, Xiao
    Venkatarathnam, Nishanth Entoor
    Rapuru, Kirity
    Broneske, David
    Durand, Gabriel Campero
    Zoun, Roman
    Saake, Gunter
    22ND INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2020), 2020, : 446 - 455
  • [43] Learning Visual Co-Occurrence with Auto-Encoder for Image Super-Resolution
    Liang, Yudong
    Wang, Jinjun
    Zhang, Shizhou
    Gong, Yihong
    2014 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2014,
  • [44] Optimizing Read Performance of HBase through Dynamic Control of Data Block Sizes and KVCache
    Chae, Sangeun
    Kim, Wonbae
    Han, Daegyu
    Kim, Jeongmin
    Nam, Beomseok
    39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1495 - 1503
  • [45] Co-occurrence of bipolar disorder, obsessive compulsive disorder and impulse control disorders
    Keck, P. E., Jr.
    BIPOLAR DISORDERS, 2007, 9 : 4 - 4
  • [46] Soil sample sizes for DNA extraction substantially affect the examination of microbial diversity and co-occurrence patterns but not abundance
    Li, Ting
    Zhang, Song
    Hu, Jinming
    Hou, Haiyan
    Li, Kexin
    Fan, Qiuping
    Wang, Fang
    Li, Linfeng
    Cui, Xiaoyong
    Liu, Dong
    Che, Rongxiao
    SOIL BIOLOGY & BIOCHEMISTRY, 2023, 177
  • [47] Identifying the causes of the bullwhip effect by exploiting control block diagram manipulation with analogical reasoning
    Naim, Mohamed M.
    Spiegler, Virginia L.
    Wikner, Joakim
    Towill, Denis R.
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2017, 263 (01) : 240 - 246
  • [48] Probabilistic Named Entity Recognition for non-standard format entities using co-occurrence word embeddings
    Al-Ani, Jabir Alshehabi
    Fasli, Maria
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 2077 - 2086
  • [49] Exploiting TTP Co-Occurrence via GloVe-Based Embedding With MITRE ATT&CK Framework
    Shin, Chanho
    Lee, Insup
    Choi, Changhee
    IEEE ACCESS, 2023, 11 : 100823 - 100831
  • [50] Detection of Iris Presentation Attacks Using Feature Fusion of Thepade's Sorted Block Truncation Coding with Gray-Level Co-Occurrence Matrix Features
    Khade, Smita
    Gite, Shilpa
    Thepade, Sudeep D.
    Pradhan, Biswajeet
    Alamri, Abdullah
    SENSORS, 2021, 21 (21)