A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

被引:1
|
作者
Perez, Sandino Vargas [1 ]
Saeed, Fahad [2 ]
机构
[1] Western Michigan Univ, Dept Comp Sci, Kalamazoo, MI 49008 USA
[2] Western Michigan Univ, Dept Elect & Comp Engn, Kalamazoo, MI 49008 USA
关键词
Next-Generation Sequencing; parallel implementation; DSRC; MPI; big data; FASTQ; FASTQ; FORMAT;
D O I
10.1109/Trustcom.2015.632
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms.
引用
收藏
页码:196 / 201
页数:6
相关论文
共 50 条
  • [22] On Next-Generation Sequencing Compression via Multi-GPU
    De Luca, Pasquale
    Di Mauro, Annabella
    Fiscale, Stefano
    INTELLIGENT DISTRIBUTED COMPUTING XIV, 2022, 1026 : 457 - 466
  • [23] APPLICATIONS OF NEXT-GENERATION SEQUENCING Sequencing technologies - the next generation
    Metzker, Michael L.
    NATURE REVIEWS GENETICS, 2010, 11 (01) : 31 - 46
  • [24] Citation Classic: Massively Parallel ("Next-Generation") DNA Sequencing
    Rothberg, Bonnie E. Gould
    Rothberg, Jonathan M.
    CLINICAL CHEMISTRY, 2015, 61 (07) : 997 - 998
  • [25] Highly Efficient Parallel Approach to the Next-Generation DNA Sequencing
    Blazewicz, Jacek
    Bosak, Bartosz
    Gawron, Piotr
    Kasprzak, Marta
    Kurowski, Krzysztof
    Piontek, Tomasz
    Swiercz, Aleksandra
    PARALLEL PROCESSING AND APPLIED MATHEMATICS, PT II, 2012, 7204 : 262 - 271
  • [26] NGSNGS: next-generation simulator for next-generation sequencing data
    Henriksen, Rasmus Amund
    Zhao, Lei
    Korneliussen, Thorfinn Sand
    BIOINFORMATICS, 2023, 39 (01)
  • [27] Towards standardization of the description and publication of next-generation sequencing datasets of fungal communities
    Nilsson, R. Henrik
    Tedersoo, Leho
    Lindahl, Bjorn D.
    Kjoller, Rasmus
    Carlsen, Tor
    Quince, Christopher
    Abarenkov, Kessy
    Pennanen, Taina
    Stenlid, Jan
    Bruns, Tom
    Larsson, Karl-Henrik
    Koljalg, Urmas
    Kauserud, Havard
    NEW PHYTOLOGIST, 2011, 191 (02) : 314 - 318
  • [28] Next-generation sequencing: big data meets high performance computing
    Schmidt, Bertil
    Hildebrandt, Andreas
    DRUG DISCOVERY TODAY, 2017, 22 (04) : 712 - 717
  • [29] INCORPORATING NEXT-GENERATION SEQUENCING IN THE MANAGEMENT ALGORITHM OF PANCREATIC CYSTS
    Jones, Alex R.
    Bardhi, Olgert
    Tielleman, Thomas
    Ellis, Daniel J.
    Vanderveldt, Hendrikus
    Tavakkoli, Anna
    Polanco, Patricio M.
    Goldschmiedt, Markus
    Mansour, John
    Singhi, Aatur
    Kubiliun, Nisa
    Sawas, Tarek
    GASTROENTEROLOGY, 2023, 164 (06) : S63 - S63
  • [30] Transformations for the compression of FASTQ quality scores of next-generation sequencing data
    Wan, Raymond
    Vo Ngoc Anh
    Asai, Kiyoshi
    BIOINFORMATICS, 2012, 28 (05) : 628 - 635