Efficient Record Linkage in Data Streams

被引:2
|
作者
Karapiperis, Dimitrios [1 ]
Gkoulalas-Divanis, Aris [2 ]
Verykios, Vassilios S. [3 ]
机构
[1] Int Hellen Univ, Thermi, Greece
[2] IBM Watson Hlth, Cambridge, MA USA
[3] Hellen Open Univ, Patras, Greece
关键词
Entity resolution; record linkage; data streams; ENTITY RESOLUTION;
D O I
10.1109/BigData50022.2020.9378127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, a vast amount of information is collected in real-time on a daily basis via users' handheld devices, web based applications, and customer service interactions (among many others). The sheer volume of this data and the unprecedented rate at which it becomes available for processing, potentially combined with other attributes that are commonly met in traditional data sets, calls for novel online record linkage' techniques that can handle streams of data to discover records that refer to the same real-world entity. This paper introduces UniBlock, an online record linkage approach, supported by a novel data structure, that can adapt to any blocking algorithm to separate the most frequently accessed blocks from the rest, and maintain these blocks in main memory. In UniBlock, this separation is performed in a randomized way, where the probability of eviction of a block is inversely proportional to its,frequency of access, empowering our approach with simplicity and effectiveness. Additionally, UniBlock provides accurate estimations of the proportion of matching record pairs in the underlying data sets in sublinear running time. Through experimental evaluation, we show that our approach outperforms the state-of-the-art methods in both accuracy and efficiency, being able to scale well to data streams.
引用
收藏
页码:523 / 532
页数:10
相关论文
共 50 条
  • [31] Efficient clustering of uncertain data streams
    Jin, Cheqing
    Yu, Jeffrey Xu
    Zhou, Aoying
    Cao, Feng
    KNOWLEDGE AND INFORMATION SYSTEMS, 2014, 40 (03) : 509 - 539
  • [32] Efficient Sequential and Parallel Algorithms for Incremental Record Linkage Using Complete Linkage Clustering
    Baihan, Abdullah
    Rajasekaran, Sanguthevar
    2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 926 - 930
  • [33] Fast Bayesian Record Linkage for Streaming Data Contexts
    Taylor, Ian
    Kaplan, Andee
    Betancourt, Brenda
    JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2024, 33 (03) : 833 - 844
  • [34] A Unified Record Linkage Strategy for Web Service Data
    Kan, Qin
    Yang, Yujiu
    Zhen, Shiqiang
    Liu, Wenhuang
    THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS, 2010, : 253 - 256
  • [35] A Bayesian record linkage model incorporating relational data
    Sosa, Juan
    Rodriguez, Abel
    APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 2023, 39 (06) : 755 - 771
  • [36] Record linkage strategies, outpatient procedures, and administrative data
    Roos, LL
    Walld, R
    Wajda, A
    Bond, R
    Hartford, K
    MEDICAL CARE, 1996, 34 (06) : 570 - 582
  • [37] A FORMALIZATION OF RECORD LINKAGE AND ITS APPLICATION TO DATA PROTECTION
    Torra, Vicenc
    Stokes, Klara
    INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2012, 20 (06) : 907 - 919
  • [38] Effective record linkage for mining campaign contribution data
    C. Giraud-Carrier
    J. Goodliffe
    B. M. Jones
    S. Cueva
    Knowledge and Information Systems, 2015, 45 : 389 - 416
  • [39] Improved quality of tuberculosis data using record linkage
    Bartholomay, Patricia
    de Oliveira, Gisele Pinto
    Pinheiro, Rejane Sobrino
    Nogales Vasconcelos, Ana Maria
    CADERNOS DE SAUDE PUBLICA, 2014, 30 (11): : 2459 - 2469
  • [40] Effective record linkage for mining campaign contribution data
    Giraud-Carrier, C.
    Goodliffe, J.
    Jones, B. M.
    Cueva, S.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (02) : 389 - 416