Efficient Record Linkage in Data Streams

被引:2
|
作者
Karapiperis, Dimitrios [1 ]
Gkoulalas-Divanis, Aris [2 ]
Verykios, Vassilios S. [3 ]
机构
[1] Int Hellen Univ, Thermi, Greece
[2] IBM Watson Hlth, Cambridge, MA USA
[3] Hellen Open Univ, Patras, Greece
关键词
Entity resolution; record linkage; data streams; ENTITY RESOLUTION;
D O I
10.1109/BigData50022.2020.9378127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, a vast amount of information is collected in real-time on a daily basis via users' handheld devices, web based applications, and customer service interactions (among many others). The sheer volume of this data and the unprecedented rate at which it becomes available for processing, potentially combined with other attributes that are commonly met in traditional data sets, calls for novel online record linkage' techniques that can handle streams of data to discover records that refer to the same real-world entity. This paper introduces UniBlock, an online record linkage approach, supported by a novel data structure, that can adapt to any blocking algorithm to separate the most frequently accessed blocks from the rest, and maintain these blocks in main memory. In UniBlock, this separation is performed in a randomized way, where the probability of eviction of a block is inversely proportional to its,frequency of access, empowering our approach with simplicity and effectiveness. Additionally, UniBlock provides accurate estimations of the proportion of matching record pairs in the underlying data sets in sublinear running time. Through experimental evaluation, we show that our approach outperforms the state-of-the-art methods in both accuracy and efficiency, being able to scale well to data streams.
引用
收藏
页码:523 / 532
页数:10
相关论文
共 50 条
  • [1] Efficient record linkage in large data sets
    Jin, L
    Li, C
    Mehrotra, S
    EIGHTH INTERNATIONAL CONFERENCE ON DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2003, : 137 - 146
  • [2] Efficient Private Record Linkage
    Yakout, Mohamed
    Atallah, Mikhail J.
    Elmagarmid, Ahmed
    ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 1283 - 1286
  • [3] RECORD LINKAGE AND DATA PROTECTION
    不详
    LANCET, 1985, 1 (8423): : 294 - 294
  • [4] Adaptive filtering for efficient record linkage
    Cu, LF
    Baxter, R
    Proceedings of the Fourth SIAM International Conference on Data Mining, 2004, : 477 - 481
  • [5] Poster: Efficient Record Linkage Techniques
    Mamun, Abdullah-Al
    Aseltine, Robert
    Rajasekaran, Sanguthevar
    2014 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2014,
  • [6] Efficient Techniques for Online Record Linkage
    Dey, Debabrata
    Mookerjee, Vijay S.
    Liu, Dengpan
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (03) : 373 - 387
  • [7] Supporting efficient record linkage for large data sets using mapping techniques
    Li, Chen
    Jin, Liang
    Mehrotra, Sharad
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2006, 9 (04): : 557 - 584
  • [8] Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques
    Chen Li
    Liang Jin
    Sharad Mehrotra
    World Wide Web, 2006, 9 : 557 - 584
  • [9] Efficient Record Linkage Algorithms Using Complete Linkage Clustering
    Mamun, Abdullah-Al
    Aseltine, Robert
    Rajasekaran, Sanguthevar
    PLOS ONE, 2016, 11 (04):
  • [10] Data quality and record linkage techniques
    Malik, Waqas Ahmed
    Unwin, Antony
    PSYCHOMETRIKA, 2008, 73 (01) : 165 - 166