Efficient Record Linkage Algorithms Using Complete Linkage Clustering

被引:15
|
作者
Mamun, Abdullah-Al [1 ]
Aseltine, Robert [2 ]
Rajasekaran, Sanguthevar [1 ]
机构
[1] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT USA
[2] Univ Connecticut, Publ Hlth Res Inst, E Hartford, CT USA
来源
PLOS ONE | 2016年 / 11卷 / 04期
基金
美国国家科学基金会;
关键词
PARALLEL ALGORITHMS; INTEGRATION; IDENTIFICATION;
D O I
10.1371/journal.pone.0154446
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a subroutine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
引用
下载
收藏
页数:21
相关论文
共 50 条
  • [21] A Greedy Algorithm for Hierarchical Complete Linkage Clustering
    Althaus, Ernst
    Hildebrandt, Andreas
    Hildebrandt, Anna Katharina
    ALGORITHMS FOR COMPUTATIONAL BIOLOGY, 2014, 8542 : 25 - 34
  • [22] A Complete Linkage Algorithm for Clustering Dynamic Datasets
    Banerjee, Payel
    Chakrabarti, Amlan
    Ballabh, Tapas Kumar
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES INDIA SECTION A-PHYSICAL SCIENCES, 2024, : 471 - 486
  • [23] Improved Analysis of Complete-Linkage Clustering
    Anna Großwendt
    Heiko Röglin
    Algorithmica, 2017, 78 : 1131 - 1150
  • [24] Poisoning Complete-Linkage Hierarchical Clustering
    Biggio, Battista
    Bulo, Samuel Rota
    Pillai, Ignazio
    Mura, Michele
    Mequanint, Eyasu Zemene
    Pelillo, Marcello
    Roli, Fabio
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2014, 8621 : 42 - 52
  • [25] Improved Analysis of Complete-Linkage Clustering
    Grosswendt, Anna
    Roeglin, Heiko
    ALGORITHMS - ESA 2015, 2015, 9294 : 656 - 667
  • [26] A note on using the F-measure for evaluating record linkage algorithms
    David Hand
    Peter Christen
    Statistics and Computing, 2018, 28 : 539 - 547
  • [27] A note on using the F-measure for evaluating record linkage algorithms
    Hand, David
    Christen, Peter
    STATISTICS AND COMPUTING, 2018, 28 (03) : 539 - 547
  • [28] Learnable similarity functions and their applications to clustering and record linkage
    Bilenko, M
    PROCEEDING OF THE NINETEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE SIXTEENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2004, : 981 - 982
  • [29] Robust Temporal Graph Clustering for Group Record Linkage
    Nanayakkara, Charini
    Christen, Peter
    Ranbaduge, Thilina
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2019, PT II, 2019, 11440 : 526 - 538
  • [30] Efficient and Practical Approach for Private Record Linkage
    Yakout, Mohamed
    Atallah, Mikhail J.
    Elmagarmid, Ahmed
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2012, 3 (03): : 1 - 28