MZPAQ: a FASTQ data compression tool

被引:3
|
作者
El Allali, Achraf [1 ]
Arshad, Mariam [1 ]
机构
[1] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh, Saudi Arabia
来源
关键词
DNA compression; Next generation sequences; FASTA files; FASTQ files; ALGORITHM;
D O I
10.1186/s13029-019-0073-5
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
BackgroundDue to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.ResultsIn this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data.ConclusionsCurrently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Context Based Compression of FASTQ Data
    Mallavarapu, Rama Srikanth
    Chinnamalliah, Pandu Kumar
    Bopardikar, Ajit S.
    Ahn, TaeJin
    [J]. 2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 97 - 100
  • [2] Compression of FASTQ and SAM Format Sequencing Data
    Bonfield, James K.
    Mahoney, Matthew V.
    [J]. PLOS ONE, 2013, 8 (03):
  • [3] FCompress: An Algorithm for FASTQ Sequence Data Compression
    Sardaraz, Muhammad
    Tahir, Muhammad
    [J]. CURRENT BIOINFORMATICS, 2019, 14 (02) : 123 - 129
  • [4] High-Throughput Compression of FASTQ Data with SeqDB
    Howison, Mark
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2013, 10 (01) : 213 - 218
  • [5] GTZ: a fast compression and cloud transmission tool optimized for FASTQ files
    Xing, Yuting
    Li, Gen
    Wang, Zhenguo
    Feng, Bolun
    Song, Zhuo
    Wu, Chengkun
    [J]. BMC BIOINFORMATICS, 2017, 18
  • [6] GTZ: a fast compression and cloud transmission tool optimized for FASTQ files
    Yuting Xing
    Gen Li
    Zhenguo Wang
    Bolun Feng
    Zhuo Song
    Chengkun Wu
    [J]. BMC Bioinformatics, 18
  • [7] No-Reference Compression of Genomic Data Stored In FASTQ Format
    Bhola, Vishal
    Bopardikar, Ajit S.
    Narayanan, Rangavittal
    Lee, Kyusang
    Ahn, TaeJin
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 147 - 150
  • [8] Compression of Nanopore FASTQ Files
    Dufort y Alvarez, Guillermo
    Seroussi, Gadiel
    Smircich, Pablo
    Sotelo, Jose
    Ochoa, Idoia
    Martin, Alvaro
    [J]. BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2019, PT I, 2019, 11465 : 36 - 47
  • [9] Light-weight reference-based compression of FASTQ data
    Yongpeng Zhang
    Linsen Li
    Yanli Yang
    Xiao Yang
    Shan He
    Zexuan Zhu
    [J]. BMC Bioinformatics, 16
  • [10] Light-weight reference-based compression of FASTQ data
    Zhang, Yongpeng
    Li, Linsen
    Yang, Yanli
    Yang, Xiao
    He, Shan
    Zhu, Zexuan
    [J]. BMC BIOINFORMATICS, 2015, 16