Efficient Record Linkage Algorithms Using Complete Linkage Clustering

被引:15
|
作者
Mamun, Abdullah-Al [1 ]
Aseltine, Robert [2 ]
Rajasekaran, Sanguthevar [1 ]
机构
[1] Univ Connecticut, Dept Comp Sci & Engn, Storrs, CT USA
[2] Univ Connecticut, Publ Hlth Res Inst, E Hartford, CT USA
来源
PLOS ONE | 2016年 / 11卷 / 04期
基金
美国国家科学基金会;
关键词
PARALLEL ALGORITHMS; INTEGRATION; IDENTIFICATION;
D O I
10.1371/journal.pone.0154446
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a subroutine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Efficient Sequential and Parallel Algorithms for Incremental Record Linkage Using Complete Linkage Clustering
    Baihan, Abdullah
    Rajasekaran, Sanguthevar
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 926 - 930
  • [2] Fast Algorithms for Complete Linkage Clustering
    D. Krznaric
    C. Levcopoulos
    [J]. Discrete & Computational Geometry, 1998, 19 : 131 - 145
  • [3] Fast algorithms for complete linkage clustering
    Krznaric, D
    Levcopoulos, C
    [J]. DISCRETE & COMPUTATIONAL GEOMETRY, 1998, 19 (01) : 131 - 145
  • [4] Efficient sequential and parallel algorithms for record linkage
    Abdullah-Al Mamun
    Mi, Tian
    Aseltine, Robert
    Rajasekaran, Sanguthevar
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2014, 21 (02) : 252 - 262
  • [5] Optimal algorithms for complete linkage clustering in d dimensions
    Krznaric, D
    Levcopoulos, C
    [J]. THEORETICAL COMPUTER SCIENCE, 2002, 286 (01) : 139 - 149
  • [6] A Suite of Efficient Randomized Algorithms for Streaming Record Linkage
    Karapiperis, Dimitrios
    Tjortjis, Christos
    Verykios, Vassilios S.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 2803 - 2813
  • [7] On the evaluation of record linkage: a proposal using fuzzy clustering
    Torra, Vicenc
    Jimenez, Javier
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2008, : 592 - 596
  • [8] COMPLETE LINKAGE AS A MULTIPLE STOPPING RULE FOR SINGLE LINKAGE CLUSTERING
    GLASBEY, CA
    [J]. JOURNAL OF CLASSIFICATION, 1987, 4 (01) : 103 - 109
  • [9] Document clustering as a record linkage problem
    Pittaras, Nikiforos
    Giannakopoulos, George
    Tsekouras, Leonidas
    Varlamis, Iraklis
    [J]. PROCEEDINGS OF THE ACM SYMPOSIUM ON DOCUMENT ENGINEERING (DOCENG 2018), 2018,
  • [10] Efficient Private Record Linkage
    Yakout, Mohamed
    Atallah, Mikhail J.
    Elmagarmid, Ahmed
    [J]. ICDE: 2009 IEEE 25TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2009, : 1283 - 1286