A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

被引:1
|
作者
Perez, Sandino Vargas [1 ]
Saeed, Fahad [2 ]
机构
[1] Western Michigan Univ, Dept Comp Sci, Kalamazoo, MI 49008 USA
[2] Western Michigan Univ, Dept Elect & Comp Engn, Kalamazoo, MI 49008 USA
关键词
Next-Generation Sequencing; parallel implementation; DSRC; MPI; big data; FASTQ; FASTQ; FORMAT;
D O I
10.1109/Trustcom.2015.632
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms.
引用
收藏
页码:196 / 201
页数:6
相关论文
共 50 条
  • [11] Next-Generation Sequencing Demands Next-Generation Phenotyping
    Hennekam, Raoul C. M.
    Biesecker, Leslie G.
    HUMAN MUTATION, 2012, 33 (05) : 884 - 886
  • [12] Next-generation sequencing revolution through big data analytics
    Tripathi, Rashmi
    Sharma, Pawan
    Chakraborty, Pavan
    Varadwaj, Pritish Kumar
    FRONTIERS IN LIFE SCIENCE, 2016, 9 (02): : 119 - 149
  • [13] NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets
    Breese, Marcus R.
    Liu, Yunlong
    BIOINFORMATICS, 2013, 29 (04) : 494 - 496
  • [14] Next-generation sequencing
    Haferlach, T.
    ONCOLOGY RESEARCH AND TREATMENT, 2016, 39 : 40 - 41
  • [15] Next-Generation Sequencing
    Xiong, Momiao
    Zhao, Zhongming
    Arnold, Jonathan
    Yu, Fuli
    JOURNAL OF BIOMEDICINE AND BIOTECHNOLOGY, 2010,
  • [16] Next-generation sequencing
    Jorge S Reis-Filho
    Breast Cancer Research, 11
  • [17] Next-Generation Sequencing
    Le Gallo, Matthieu
    Lozy, Fred
    Bell, Daphne W.
    MOLECULAR GENETICS OF ENDOMETRIAL CARCINOMA, 2017, 943 : 119 - 148
  • [18] Next-generation sequencing
    Reis-Filho, Jorge S.
    BREAST CANCER RESEARCH, 2009, 11
  • [20] CloudEC: A MapReduce-based Algorithm for Correcting Errors in Next-generation Sequencing Big Data
    Chung, Wei-Chun
    Ho, Jan-Ming
    Lin, Chung-Yen
    Lee, D. T.
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2836 - 2842