Efficient algorithms for fast integration on large data sets from multiple sources

被引:9
|
作者
Mi, Tian [1 ]
Rajasekaran, Sanguthevar [1 ]
Aseltine, Robert [2 ]
机构
[1] Univ Connecticut Storrs, Dept Comp Sci & Engn, Storrs, CT 06269 USA
[2] Univ Connecticut, Inst Publ Hlth Res, E Hartford, CT USA
基金
美国国家科学基金会;
关键词
RECORD-LINKAGE; PARALLEL ALGORITHMS; IDENTIFICATION;
D O I
10.1186/1472-6947-12-59
中图分类号
R-058 [];
学科分类号
摘要
Background: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. Methods: Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. Results: We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. Conclusions: In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] A fast and efficient method for compressing fMRI data sets
    Theis, FJ
    Tanaka, T
    ARTIFICIAL NEURAL NETWORKS: FORMAL MODELS AND THEIR APPLICATIONS - ICANN 2005, PT 2, PROCEEDINGS, 2005, 3697 : 769 - 777
  • [32] The Research of High Efficient Data Mining Algorithms for Massive Data Sets
    Tao Cuixia
    MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 3901 - 3904
  • [33] Fast principal component analysis of large data sets
    Vogt, F
    Tacke, M
    CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2001, 59 (1-2) : 1 - 18
  • [34] An architecture for fast processing of large unstructured data sets
    Franklin, M
    Chamberlain, R
    Henrichs, M
    Shands, B
    White, J
    IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN: VLSI IN COMPUTERS & PROCESSORS, PROCEEDINGS, 2004, : 280 - 287
  • [35] Fast Algorithms for Designing Complementary Sets of Sequences Under Multiple Constraints
    Wu, Zhong-Jie
    Xu, Tian-Liang
    Zhou, Zhi-Quan
    Wang, Chen-Xu
    IEEE ACCESS, 2019, 7 : 50041 - 50051
  • [36] Software for efficient visualization and analysis of multiple, large, multi-dimensional data sets from magnetic resonance imaging
    Uttecht, S
    Thulborn, KR
    COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2002, 26 (02) : 73 - 89
  • [37] Efficient algorithms for mining long patterns in scientific data sets
    Agarwal, RC
    Aggarwal, CC
    DATA MINING FOR SCIENTIFIC AND ENGINEERING APPLICATIONS, 2001, 2 : 541 - 566
  • [38] Efficient Data Structure and Algorithms for Sparse Integers, Sets and Predicates
    Vuillemin, Jean E.
    ARITH: 2009 19TH IEEE INTERNATIONAL SYMPOSIUM ON COMPUTER ARITHMETIC, 2009, : 7 - 14
  • [39] Integration of Association Rules and Clustering Models Obtained from Multiple Data Sources
    Morales Vega, Daymi
    Martin Rodriguez, Diana
    Wilford Rivera, Ingrid
    Rosete Suarez, Alejandro
    COMPUTACION Y SISTEMAS, 2012, 16 (02): : 175 - 189
  • [40] On optimal multiple changepoint algorithms for large data
    Maidstone, Robert
    Hocking, Toby
    Rigaill, Guillem
    Fearnhead, Paul
    STATISTICS AND COMPUTING, 2017, 27 (02) : 519 - 533