Exploring hybrid parallel systems for probabilistic record linkage

被引:2
|
作者
Boratto, Murilo [1 ]
Alonso, Pedro [2 ]
Pinto, Clicia [3 ]
Melo, Pedro [3 ]
Barreto, Marcos [3 ]
Denaxas, Spiros [4 ]
机构
[1] Univ Estado Bahia, Nucleo Arquitetura Comp & Sistemas Operacionais, Salvador, BA, Brazil
[2] Univ Politecn Valencia, Dept Informat Syst & Computat, Valencia, Spain
[3] Univ Fed Bahia, Lab Sistemas Distribuidos, Salvador, BA, Brazil
[4] UCL, Sch Comp Sci & Informat, Inst Hlth Informat Res, London, England
来源
JOURNAL OF SUPERCOMPUTING | 2019年 / 75卷 / 03期
基金
英国医学研究理事会; 比尔及梅琳达.盖茨基金会;
关键词
Probabilistic linkage; Public health; Performance evaluation; Multicore; Multi-GPU;
D O I
10.1007/s11227-018-2328-3
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000million records.
引用
收藏
页码:1137 / 1149
页数:13
相关论文
共 50 条
  • [1] Exploring hybrid parallel systems for probabilistic record linkage
    Murilo Boratto
    Pedro Alonso
    Clicia Pinto
    Pedro Melo
    Marcos Barreto
    Spiros Denaxas
    [J]. The Journal of Supercomputing, 2019, 75 : 1137 - 1149
  • [2] Probabilistic record linkage
    Sayers, Adrian
    Ben-Shlomo, Yoav
    Blom, Ashley W.
    Steele, Fiona
    [J]. INTERNATIONAL JOURNAL OF EPIDEMIOLOGY, 2016, 45 (03) : 954 - 964
  • [3] A hybrid approach to record linkage using a combination of deterministic and probabilistic methodology
    Ong, Toan C.
    Duca, Lindsey M.
    Kahn, Michael G.
    Crume, Tessa L.
    [J]. JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2020, 27 (04) : 505 - 513
  • [4] Validating distance-based record linkage with probabilistic record linkage
    Domingo-Ferrer, J
    Torra, V
    [J]. TOPICS IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2002, 2504 : 207 - 215
  • [5] RECORD LINKAGE SYSTEMS
    HAUSER, WA
    [J]. AMERICAN JOURNAL OF EPIDEMIOLOGY, 1979, 110 (03) : 371 - 371
  • [6] A study on the probabilistic record linkage and its application
    Choi, Yeonok
    Lee, Sangin
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2021, 34 (05) : 849 - 861
  • [7] A Probabilistic Record Linkage Model for Survival Data
    Hof, Michel H.
    Ravelli, Anita C.
    Zwinderman, Aeilko H.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (520) : 1504 - 1515
  • [8] Probabilistic Record Linkage for Disclosure Risk Assessment
    Shlomo, Natalie
    [J]. PRIVACY IN STATISTICAL DATABASES, PSD 2014, 2014, 8744 : 269 - 282
  • [9] Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage
    Tromp, Miranda
    Ravelli, Anita C.
    Bonsel, Gouke J.
    Hasman, Arie
    Reitsma, Johannes B.
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 2011, 64 (05) : 565 - 572
  • [10] An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries
    Asher, Jana
    Resnick, Dean
    Brite, Jennifer
    Brackbill, Robert
    Cone, James
    [J]. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH, 2020, 17 (18) : 1 - 16