A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

被引:344
|
作者
Christen, Peter [1 ]
机构
[1] Australian Natl Univ, Res Sch Comp Sci, Coll Engn & Comp Sci, Canberra, ACT 0200, Australia
关键词
Data linkage; data matching; entity resolution; index techniques; blocking; experimental evaluation; scalability; BLOCKING;
D O I
10.1109/TKDE.2011.127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.
引用
收藏
页码:1537 / 1555
页数:19
相关论文
共 50 条
  • [31] Scalable Block Scheduling for Efficient Multi-Database Record Linkage
    Ranbaduge, Thilina
    Vatsalan, Dinusha
    Christen, Peter
    2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2016, : 1161 - 1166
  • [32] Indexing techniques for file sharing in scalable peer-to-peer networks
    Annexstein, FS
    Berman, KA
    Jovanovic, MA
    Ponnavaikko, K
    ELEVENTH INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATIONS AND NETWORKS, PROCEEDINGS, 2002, : 10 - 15
  • [33] A taxonomy of privacy-preserving record linkage techniques
    Vatsalan, Dinusha
    Christen, Peter
    Verykios, Vassilios S.
    INFORMATION SYSTEMS, 2013, 38 (06) : 946 - 969
  • [34] Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis
    Chen Feng
    Chun-Dian Li
    Rui Li
    Journal of Computer Science and Technology, 2018, 33 : 169 - 189
  • [35] Indexing pictorial documents by their content: A survey of current techniques
    DeMarsicoi, M
    Cinque, L
    Levialdi, S
    IMAGE AND VISION COMPUTING, 1997, 15 (02) : 119 - 141
  • [36] Indexing Techniques of Distributed Ordered Tables: A Survey and Analysis
    Feng, Chen
    Li, Chun-Dian
    Li, Rui
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2018, 33 (01) : 169 - 189
  • [37] A survey on deduplication systems
    Godavari, Amdewar
    Sudhakar, Chapram
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2024, 15 (02) : 143 - 159
  • [38] Missing Data due to Record Linkage of Register and Survey Information An Empirical Comparison of Selected Missing Data Techniques
    Krug, Gerhard
    METHODS DATA ANALYSES, 2010, 4 (01): : 27 - A17
  • [39] A Genetic Programming Approach to Record Deduplication
    de Carvalho, Moises G.
    Laender, Alberto H. F.
    Goncalves, Marcos Andre
    da Silva, Altigran S.
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (03) : 399 - 412
  • [40] Modern Privacy-Preserving Record Linkage Techniques: An Overview
    Gkoulalas-Divanis, Aris
    Vatsalan, Dinusha
    Karapiperis, Dimitrios
    Kantarcioglu, Murat
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2021, 16 : 4966 - 4987