Efficient algorithms for fast integration on large data sets from multiple sources

被引:9
|
作者
Mi, Tian [1 ]
Rajasekaran, Sanguthevar [1 ]
Aseltine, Robert [2 ]
机构
[1] Univ Connecticut Storrs, Dept Comp Sci & Engn, Storrs, CT 06269 USA
[2] Univ Connecticut, Inst Publ Hlth Res, E Hartford, CT USA
基金
美国国家科学基金会;
关键词
RECORD-LINKAGE; PARALLEL ALGORITHMS; IDENTIFICATION;
D O I
10.1186/1472-6947-12-59
中图分类号
R-058 [];
学科分类号
摘要
Background: Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently. Methods: Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block. Results: We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach. Conclusions: In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
引用
下载
收藏
页数:12
相关论文
共 50 条
  • [1] Efficient algorithms for fast integration on large data sets from multiple sources
    Tian Mi
    Sanguthevar Rajasekaran
    Robert Aseltine
    BMC Medical Informatics and Decision Making, 12
  • [2] Efficient algorithms for mining outliers from large data sets
    Ramaswamy, S
    Rastogi, R
    Shim, K
    SIGMOD RECORD, 2000, 29 (02) : 427 - 438
  • [3] A design pattern for efficient retrieval of large data sets from remote data sources
    Long, B
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2002: COOPLS, DOA, AND ODBASE, 2002, 2519 : 650 - 660
  • [4] Fast algorithms for nonparametric population modeling of large data sets
    Pillonetto, Gianluigi
    De Nicolao, Giuseppe
    Chierici, Marco
    Cobelli, Claudio
    AUTOMATICA, 2009, 45 (01) : 173 - 179
  • [5] Fast Fitch-parsimony algorithms for large data sets
    Ronquist, F
    CLADISTICS-THE INTERNATIONAL JOURNAL OF THE WILLI HENNIG SOCIETY, 1998, 14 (04): : 387 - 400
  • [6] MapReduce algorithms for efficient generation of CPS models from large historical data sets
    Windmann, Stefan
    Niggemann, Oliver
    PROCEEDINGS OF 2015 IEEE 20TH CONFERENCE ON EMERGING TECHNOLOGIES & FACTORY AUTOMATION (ETFA), 2015,
  • [7] Fast Dual Selection using Genetic Algorithms for Large Data Sets
    Ros, Frederic
    Harba, Rachid
    Pintore, Marco
    2012 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2012, : 815 - 820
  • [8] Data Integration on Multiple Data Sets
    Mi, Tian
    Aseltine, Robert
    Rajasekaran, Sanguthevar
    2008 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, PROCEEDINGS, 2008, : 443 - +
  • [9] Engineering Algorithms for Large Data Sets
    Sanders, Peter
    SOFSEM 2013: Theory and Practice of Computer Science, 2013, 7741 : 29 - 32
  • [10] IoT streaming data integration from multiple sources
    Doan Quang Tu
    A. S. M. Kayes
    Wenny Rahayu
    Kinh Nguyen
    Computing, 2020, 102 : 2299 - 2329