Efficient algorithms for fast integration on large data sets from multiple sources

被引：9

作者：

Mi, Tian ^{[1
]}

Rajasekaran, Sanguthevar ^{[1
]}

Aseltine, Robert ^{[2
]}

机构：

[1] Univ Connecticut Storrs, Dept Comp Sci & Engn, Storrs, CT 06269 USA

[2] Univ Connecticut, Inst Publ Hlth Res, E Hartford, CT USA

来源：

BMC MEDICAL INFORMATICS AND DECISION MAKING | 2012年 / 12卷

基金：

美国国家科学基金会;

关键词：

RECORD-LINKAGE; PARALLEL ALGORITHMS; IDENTIFICATION;

D O I：

10.1186/1472-6947-12-59

中图分类号：

R-058 [];

学科分类号：

摘要：

Background: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. Methods: Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. Results: We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. Conclusions: In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.

引用

下载

页数：12

共 50 条

[41] On optimal multiple changepoint algorithms for large data
Robert Maidstone
Toby Hocking
Guillem Rigaill
Paul Fearnhead
Statistics and Computing, 2017, 27 : 519 - 533
[42] DPLS and PPLS:: two PLS algorithms for large data sets
Milidiú, RL
Rentería, RP
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2005, 48 (01) : 125 - 138
[43] Relabelling algorithms for mixture models with applications for large data sets
Zhu, W.
Fan, Y.
JOURNAL OF STATISTICAL COMPUTATION AND SIMULATION, 2016, 86 (02) : 394 - 413
[44] Comparison of Pagination Algorithms Based-on Large Data Sets
Cao, Junkuo
Wang, Weihua
Shu, Yuanzhong
INFORMATION AND AUTOMATION, 2011, 86 : 384 - 389
[45] Fast Scalable Selection Algorithms for Large Scale Data
Thompson, Lee Parnell
Xu, Weijia
Miranker, Daniel P.
2013 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2013,
[46] Efficient co-triangulation of large data sets
Weimer, H
Warren, J
Troutner, J
Wiggins, W
Shrout, J
VISUALIZATION '98, PROCEEDINGS, 1998, : 119 - +
[47] Efficient nonparametric population modeling for large data sets
De Nicolao, Giuseppe
Pillonetto, Gianluigi
Chierici, Marco
Cobelli, Claudio
2007 AMERICAN CONTROL CONFERENCE, VOLS 1-13, 2007, : 1648 - +
[48] An ontology for the integration of multiple genetic disorder data sources
Gong, P.
Qu, W.
Feng, D. D.
2005 27TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-7, 2005, : 2824 - 2827
[49] A fast algorithm for learning a ranking function from large-scale data sets
Raykar, Vikas C.
Duraiswami, Ramani
Krishnapuram, Balaji
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (07) : 1158 - 1170
[50] Effective processing and integration of large data sets in the Hadoop environment
Drzymala, Pawel
Welfle, Henryk
Drzymala, Agnieszka
PRZEGLAD ELEKTROTECHNICZNY, 2019, 95 (01): : 29 - 32

← 1 2 3 4 5 →