A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication

被引:344
|
作者
Christen, Peter [1 ]
机构
[1] Australian Natl Univ, Res Sch Comp Sci, Coll Engn & Comp Sci, Canberra, ACT 0200, Australia
关键词
Data linkage; data matching; entity resolution; index techniques; blocking; experimental evaluation; scalability; BLOCKING;
D O I
10.1109/TKDE.2011.127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many application areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today's databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of 12 variations of 6 indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. No such detailed survey has so far been published.
引用
收藏
页码:1537 / 1555
页数:19
相关论文
共 50 条
  • [21] A survey of image data indexing techniques
    Saurabh Sharma
    Vishal Gupta
    Mamta Juneja
    Artificial Intelligence Review, 2019, 52 : 1189 - 1266
  • [22] COMPUTER RECORD LINKAGE ON A SURVEY POPULATION
    BELLOC, NB
    ARELLANO, MG
    HEALTH SERVICES REPORT, 1973, 88 (04): : 344 - 350
  • [23] Query indexing and velocity constrained indexing: Scalable techniques for continuous queries on moving objects
    Prabhakar, S
    Xia, YN
    Kalashnikov, DV
    Aref, WG
    Hambrusch, SE
    IEEE TRANSACTIONS ON COMPUTERS, 2002, 51 (10) : 1124 - 1140
  • [24] RECORD LINKAGE TECHNIQUES IN STUDIES OF AETIOLOGY OF CANCER
    ACHESON, ED
    PROCEEDINGS OF THE ROYAL SOCIETY OF MEDICINE-LONDON, 1968, 61 (07): : 726 - &
  • [25] A scalable privacy-preserving framework for temporal record linkage
    Ranbaduge, Thilina
    Christen, Peter
    KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (01) : 45 - 78
  • [26] A scalable privacy-preserving framework for temporal record linkage
    Thilina Ranbaduge
    Peter Christen
    Knowledge and Information Systems, 2020, 62 : 45 - 78
  • [27] Scalable Load Balancing for MapReduce-based Record Linkage
    Yan, Wei
    Xue, Yuan
    Malin, Bradley
    2013 IEEE 32ND INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2013,
  • [28] Automatic training example selection for scalable unsupervised record linkage
    Christen, Peter
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2008, 5012 : 511 - 518
  • [29] ScaDS Research on Scalable Privacy-preserving Record Linkage
    Franke, Martin
    Gladbach, Marcel
    Sehili, Ziad
    Rohde, Florens
    Rahm, Erhard
    Datenbank-Spektrum, 2019, 19 (01): : 31 - 40
  • [30] A scalable parallel deduplication algorithm
    Santos, Walter
    Teixeira, Thiago
    Machado, Carla
    Meira, Wagner, Jr.
    Da Silva, Altigran S.
    Ferreira, Renato
    Guedes, Dorgival
    19TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING, PROCEEDINGS, 2007, : 79 - +