Efficient Record Linkage in Data Streams

被引：2

作者：

Karapiperis, Dimitrios ^{[1
]}

Gkoulalas-Divanis, Aris ^{[2
]}

Verykios, Vassilios S. ^{[3
]}

机构：

[1] Int Hellen Univ, Thermi, Greece

[2] IBM Watson Hlth, Cambridge, MA USA

[3] Hellen Open Univ, Patras, Greece

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2020年

关键词：

Entity resolution; record linkage; data streams; ENTITY RESOLUTION;

D O I：

10.1109/BigData50022.2020.9378127

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Nowadays, a vast amount of information is collected in real-time on a daily basis via users' handheld devices, web based applications, and customer service interactions (among many others). The sheer volume of this data and the unprecedented rate at which it becomes available for processing, potentially combined with other attributes that are commonly met in traditional data sets, calls for novel online record linkage' techniques that can handle streams of data to discover records that refer to the same real-world entity. This paper introduces UniBlock, an online record linkage approach, supported by a novel data structure, that can adapt to any blocking algorithm to separate the most frequently accessed blocks from the rest, and maintain these blocks in main memory. In UniBlock, this separation is performed in a randomized way, where the probability of eviction of a block is inversely proportional to its,frequency of access, empowering our approach with simplicity and effectiveness. Additionally, UniBlock provides accurate estimations of the proportion of matching record pairs in the underlying data sets in sublinear running time. Through experimental evaluation, we show that our approach outperforms the state-of-the-art methods in both accuracy and efficiency, being able to scale well to data streams.

引用

页码：523 / 532

页数：10

共 50 条

[31] Efficient clustering of uncertain data streams
Jin, Cheqing
Yu, Jeffrey Xu
Zhou, Aoying
Cao, Feng
KNOWLEDGE AND INFORMATION SYSTEMS, 2014, 40 (03) : 509 - 539
[32] Efficient Sequential and Parallel Algorithms for Incremental Record Linkage Using Complete Linkage Clustering
Baihan, Abdullah
Rajasekaran, Sanguthevar
2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 926 - 930
[33] Fast Bayesian Record Linkage for Streaming Data Contexts
Taylor, Ian
Kaplan, Andee
Betancourt, Brenda
JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2024, 33 (03) : 833 - 844
[34] A Unified Record Linkage Strategy for Web Service Data
Kan, Qin
Yang, Yujiu
Zhen, Shiqiang
Liu, Wenhuang
THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS, 2010, : 253 - 256
[35] A Bayesian record linkage model incorporating relational data
Sosa, Juan
Rodriguez, Abel
APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 2023, 39 (06) : 755 - 771
[36] Record linkage strategies, outpatient procedures, and administrative data
Roos, LL
Walld, R
Wajda, A
Bond, R
Hartford, K
MEDICAL CARE, 1996, 34 (06) : 570 - 582
[37] A FORMALIZATION OF RECORD LINKAGE AND ITS APPLICATION TO DATA PROTECTION
Torra, Vicenc
Stokes, Klara
INTERNATIONAL JOURNAL OF UNCERTAINTY FUZZINESS AND KNOWLEDGE-BASED SYSTEMS, 2012, 20 (06) : 907 - 919
[38] Effective record linkage for mining campaign contribution data
C. Giraud-Carrier
J. Goodliffe
B. M. Jones
S. Cueva
Knowledge and Information Systems, 2015, 45 : 389 - 416
[39] Improved quality of tuberculosis data using record linkage
Bartholomay, Patricia
de Oliveira, Gisele Pinto
Pinheiro, Rejane Sobrino
Nogales Vasconcelos, Ana Maria
CADERNOS DE SAUDE PUBLICA, 2014, 30 (11): : 2459 - 2469
[40] Effective record linkage for mining campaign contribution data
Giraud-Carrier, C.
Goodliffe, J.
Jones, B. M.
Cueva, S.
KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 45 (02) : 389 - 416

← 1 2 3 4 5 →