Efficient Record Linkage in Data Streams

被引:2
|
作者
Karapiperis, Dimitrios [1 ]
Gkoulalas-Divanis, Aris [2 ]
Verykios, Vassilios S. [3 ]
机构
[1] Int Hellen Univ, Thermi, Greece
[2] IBM Watson Hlth, Cambridge, MA USA
[3] Hellen Open Univ, Patras, Greece
关键词
Entity resolution; record linkage; data streams; ENTITY RESOLUTION;
D O I
10.1109/BigData50022.2020.9378127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, a vast amount of information is collected in real-time on a daily basis via users' handheld devices, web based applications, and customer service interactions (among many others). The sheer volume of this data and the unprecedented rate at which it becomes available for processing, potentially combined with other attributes that are commonly met in traditional data sets, calls for novel online record linkage' techniques that can handle streams of data to discover records that refer to the same real-world entity. This paper introduces UniBlock, an online record linkage approach, supported by a novel data structure, that can adapt to any blocking algorithm to separate the most frequently accessed blocks from the rest, and maintain these blocks in main memory. In UniBlock, this separation is performed in a randomized way, where the probability of eviction of a block is inversely proportional to its,frequency of access, empowering our approach with simplicity and effectiveness. Additionally, UniBlock provides accurate estimations of the proportion of matching record pairs in the underlying data sets in sublinear running time. Through experimental evaluation, we show that our approach outperforms the state-of-the-art methods in both accuracy and efficiency, being able to scale well to data streams.
引用
收藏
页码:523 / 532
页数:10
相关论文
共 50 条
  • [21] RECORD LINKAGE OF PRESCRIPTIONS AND DIAGNOSES RELATED DATA
    LEUFKENS, HGM
    BUURMA, H
    ARNOU, PG
    VANDERWAART, MAC
    PHARMACEUTISCH WEEKBLAD-SCIENTIFIC EDITION, 1987, 9 (02) : 141 - 141
  • [22] Linking individual data: Methods of record linkage
    RumeauRouquette, C
    REVUE D EPIDEMIOLOGIE ET DE SANTE PUBLIQUE, 1997, 45 (03): : 248 - 256
  • [23] An Ensemble Approach for Record Matching in Data Linkage
    Poon, Simon K.
    Poon, Josiah
    Lam, Mary K.
    Yin, Qinglan
    Sze, Daniel M-Y.
    Wu, Justin C. Y.
    Mok, Vincent C. T.
    Ching, Jessica Y. L.
    Chan, Kam-Leung
    Cheung, William H. N.
    Lau, Alexander Y.
    DIGITAL HEALTH INNOVATION FOR CONSUMERS, CLINICIANS, CONNECTIVITY AND COMMUNITY, 2016, 227 : 113 - 119
  • [24] A Probabilistic Record Linkage Model for Survival Data
    Hof, Michel H.
    Ravelli, Anita C.
    Zwinderman, Aeilko H.
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (520) : 1504 - 1515
  • [25] Linkage of routinely collected data in practice: the Centre for Health Record Linkage
    Irvine, Katie A.
    Moore, Elizabeth A.
    PUBLIC HEALTH RESEARCH & PRACTICE, 2015, 25 (04):
  • [26] Adaptive Sorted Neighborhood Methods for Efficient Record Linkage
    Yan, Su
    Lee, Dongwon
    Kan, Min-Yen
    Giles, C. Lee
    PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, : 185 - +
  • [27] An unsupervised blocking technique for more efficient record linkage
    O'Hare, Kevin
    Jurek-Loughrey, Anna
    de Campos, Cassio
    DATA & KNOWLEDGE ENGINEERING, 2019, 122 (181-195) : 181 - 195
  • [28] A Suite of Efficient Randomized Algorithms for Streaming Record Linkage
    Karapiperis, Dimitrios
    Tjortjis, Christos
    Verykios, Vassilios S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 2803 - 2813
  • [29] Efficient subspace search in data streams
    Fouche, Edouard
    Kalinke, Florian
    Boehm, Klemens
    INFORMATION SYSTEMS, 2021, 97 (97)
  • [30] Efficient clustering of uncertain data streams
    Cheqing Jin
    Jeffrey Xu Yu
    Aoying Zhou
    Feng Cao
    Knowledge and Information Systems, 2014, 40 : 509 - 539