A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

被引：344

作者：

Christen, Peter ^{[1
]}

机构：

[1] Australian Natl Univ, Res Sch Comp Sci, Coll Engn & Comp Sci, Canberra, ACT 0200, Australia

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2012年 / 24卷 / 09期

关键词：

Data linkage; data matching; entity resolution; index techniques; blocking; experimental evaluation; scalability; BLOCKING;

D O I：

10.1109/TKDE.2011.127

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.

引用

页码：1537 / 1555

页数：19

共 50 条

[1] Scalable Record Linkage
Wolcott, Luke
Clements, William
Saripalli, Prasad
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 4268 - 4275
[2] A Bayesian Approach to Graphical Record Linkage and Deduplication
Steorts, Rebecca C.
Hall, Rob
Fienberg, Stephen E.
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2016, 111 (516) : 1660 - 1672
[3] reclin2: a Toolkit for Record Linkage and Deduplication
van der Laan, D. Jan
R JOURNAL, 2022, 14 (02): : 320 - 328
[4] Clustering-Based Scalable Indexing for Multi-party Privacy-Preserving Record Linkage
Ranbaduge, Thilina
Vatsalan, Dinusha
Christen, Peter
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PART II, 2015, 9078 : 549 - 561
[5] A Survey on Scalable Image Indexing and Searching
Suchitra, S.
Chitrakala, S.
2013 FOURTH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATIONS AND NETWORKING TECHNOLOGIES (ICCCNT), 2013,
[6] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
Rozinek, Ondrej
Borkovcova, Monika
Mares, Jan
Lecture Notes in Networks and Systems, 2024, 990 LNNS : 181 - 191
[7] Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data
Rozinek, Ondrej
Borkovcova, Monika
Mares, Jan
GOOD PRACTICES AND NEW PERSPECTIVES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 6, WORLDCIST 2024, 2024, 990 : 181 - 191
[8] A Survey and Comparative Study of Data Deduplication Techniques
Malhotra, Jyoti
Bakal, Jagdish
2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC), 2015,
[9] A Record Linkage-Based Data Deduplication Framework with DataCleaner Extension
Azeroual, Otmane
Jha, Meena
Nikiforova, Anastasija
Sha, Kewei
Alsmirat, Mohammad
Jha, Sanjay
MULTIMODAL TECHNOLOGIES AND INTERACTION, 2022, 6 (04)
[10] Scalable Blocking for Privacy Preserving Record Linkage
Karakasidis, Alexandros
Koloniari, Georgia
Verykios, Vassilios S.
KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 527 - 536

← 1 2 3 4 5 →