Damming the genomic data flood using a comprehensive analysis and storage data structure

被引:3
|
作者
Bouffard, Marc [1 ]
Phillips, Michael S. [1 ,2 ,3 ]
Brown, Andrew M. K. [1 ,2 ,3 ]
Marsh, Sharon [1 ]
Tardif, Jean-Claude [1 ,2 ,3 ]
van Rooij, Tibor [1 ]
机构
[1] Univ Montreal, Beaulieu Saucier Univ Montreal Pharmacogen Ctr, Montreal, PQ, Canada
[2] Univ Montreal, Montreal Heart Inst, Montreal, PQ, Canada
[3] Univ Montreal, Fac Med, Montreal, PQ H3C 3J7, Canada
关键词
D O I
10.1093/database/baq029
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Data generation, driven by rapid advances in genomic technologies, is fast outpacing our analysis capabilities. Faced with this flood of data, more hardware and software resources are added to accommodate data sets whose structure has not specifically been designed for analysis. This leads to unnecessarily lengthy processing times and excessive data handling and storage costs. Current efforts to address this have centered on developing new indexing schemas and analysis algorithms, whereas the root of the problem lies in the format of the data itself. We have developed a new data structure for storing and analyzing genotype and phenotype data. By leveraging data normalization techniques, database management system capabilities and the use of a novel multi-table, multidimensional database structure we have eliminated the following: (i) unnecessarily large data set size due to high levels of redundancy, (ii) sequential access to these data sets and (iii) common bottlenecks in analysis times. The resulting novel data structure horizontally divides the data to circumvent traditional problems associated with the use of databases for very large genomic data sets. The resulting data set required 86% less disk space and performed analytical calculations 6248 times faster compared to a standard approach without any loss of information.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] GIMS: an integrated data storage and analysis environment for genomic and functional data
    Cornell, M
    Paton, NW
    Hedeler, C
    Kirby, P
    Delneri, D
    Hayes, A
    Oliver, SG
    YEAST, 2003, 20 (15) : 1291 - 1306
  • [2] NATIONAL ANNUAL FLOOD DATA STORAGE/RETRIEVAL AND FLOOD FREQUENCY ANALYSIS SYSTEM
    JENNINGS, ME
    ISHERWOO.WL
    TRANSACTIONS-AMERICAN GEOPHYSICAL UNION, 1972, 53 (04): : 375 - &
  • [3] Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation
    Pemberton, Trevor J.
    DeGiorgio, Michael
    Rosenberg, Noah A.
    G3-GENES GENOMES GENETICS, 2013, 3 (05): : 891 - 907
  • [4] Comprehensive analysis of coastal flood susceptibility, drought severity, and crop water stress using data fusion
    Kim, Hyeong-Joo
    Rahman, Mahfuzur
    Hammad, Zulfiqar
    Kim, Hyeong-Soo
    Lee, Seok-Jae
    Kim, Tae-Eon
    Jung, So-Hyi
    GEOMATICS NATURAL HAZARDS & RISK, 2025, 16 (01)
  • [5] Genomic data analysis using DNA structure: An analysis of conserved nongenic sequences and ultraconserved elements
    Gardiner, EJ
    Hirons, L
    Hunter, CA
    Willett, P
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2006, 46 (02) : 753 - 761
  • [6] Uniform genomic data analysis in the NCI Genomic Data Commons
    Zhenyu Zhang
    Kyle Hernandez
    Jeremiah Savage
    Shenglai Li
    Dan Miller
    Stuti Agrawal
    Francisco Ortuno
    Louis M. Staudt
    Allison Heath
    Robert L. Grossman
    Nature Communications, 12
  • [7] Uniform genomic data analysis in the NCI Genomic Data Commons
    Zhang, Zhenyu
    Hernandez, Kyle
    Savage, Jeremiah
    Li, Shenglai
    Miller, Dan
    Agrawal, Stuti
    Ortuno, Francisco
    Staudt, Louis M.
    Heath, Allison
    Grossman, Robert L.
    NATURE COMMUNICATIONS, 2021, 12 (01)
  • [8] COMPREHENSIVE FLOOD PLAIN STUDIES USING SPATIAL DATA MANAGEMENT-TECHNIQUES
    DAVIS, DW
    WATER RESOURCES BULLETIN, 1978, 14 (03): : 587 - 604
  • [9] Integrating machine learning and geospatial data analysis for comprehensive flood hazard assessment
    Singha, Chiranjit
    Rana, Vikas Kumar
    Pham, Quoc Bao
    Nguyen, Duc C.
    Lupikasza, Ewa
    Environmental Science and Pollution Research, 2024, 31 (35) : 48497 - 48522
  • [10] A Comprehensive Survey on Data Storage and Retrieval
    Eswarawaka, Rajesh
    Sumalatha, U.
    Krishna, Siva Sai K.
    Karnegari, Vivek Reddy
    2017 INTERNATIONAL CONFERENCE ON INNOVATIVE MECHANISMS FOR INDUSTRY APPLICATIONS (ICIMIA), 2017, : 676 - 680