FCompress: An Algorithm for FASTQ Sequence Data Compression

被引:2
|
作者
Sardaraz, Muhammad [1 ]
Tahir, Muhammad [1 ]
机构
[1] COMSATS Inst Informat Technol, Dept Comp Sci, Attock, Pakistan
关键词
High throughput sequencing; NGS technologies; NGS sequence compression; Huffman Coding; Fcompress; Algorithm; GENOMIC SEQUENCE;
D O I
10.2174/1574893613666180322125337
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Biological sequence data have increased at a rapid rate due to the advancements in sequencing technologies and reduction in the cost of sequencing data. The huge increase in these data presents significant research challenges to researchers. In addition to meaningful analysis, data storage is also a challenge, an increase in data production is outpacing the storage capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as well as transmission cost over the internet. Objective: This article presents a novel compression algorithm (FCompress) for Next Generation Sequencing (NGS) data in FASTQ format. Method: The proposed algorithm uses bits manipulation and dictionary-based compression for bases compression. Headers are compressed with reference-based compression, whereas quality scores are compressed with Huffman coding. Results: The proposed algorithm is validated with experimental results on real datasets. The results are compared with both general purpose and specialized compression programs. Conclusion: The proposed algorithm produces better compression ratio in a comparable time to other algorithms.
引用
收藏
页码:123 / 129
页数:7
相关论文
共 50 条
  • [1] Compression of DNA sequence reads in FASTQ format
    Deorowicz, Sebastian
    Grabowski, Szymon
    [J]. BIOINFORMATICS, 2011, 27 (06) : 860 - 862
  • [2] MZPAQ: a FASTQ data compression tool
    El Allali, Achraf
    Arshad, Mariam
    [J]. SOURCE CODE FOR BIOLOGY AND MEDICINE, 2019, 14
  • [3] Context Based Compression of FASTQ Data
    Mallavarapu, Rama Srikanth
    Chinnamalliah, Pandu Kumar
    Bopardikar, Ajit S.
    Ahn, TaeJin
    [J]. 2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 97 - 100
  • [4] LFQC: a lossless compression algorithm for FASTQ files
    Nicolae, Marius
    Pathak, Sudipta
    Rajasekaran, Sanguthevar
    [J]. BIOINFORMATICS, 2015, 31 (20) : 3276 - 3281
  • [5] Compression of FASTQ and SAM Format Sequencing Data
    Bonfield, James K.
    Mahoney, Matthew V.
    [J]. PLOS ONE, 2013, 8 (03):
  • [6] FastQ-brew: Module for analysis, preprocessing, and reformatting of FASTQ sequence data
    O'Halloran D.M.
    [J]. BMC Research Notes, 10 (1)
  • [7] High-Throughput Compression of FASTQ Data with SeqDB
    Howison, Mark
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2013, 10 (01) : 213 - 218
  • [8] No-Reference Compression of Genomic Data Stored In FASTQ Format
    Bhola, Vishal
    Bopardikar, Ajit S.
    Narayanan, Rangavittal
    Lee, Kyusang
    Ahn, TaeJin
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 147 - 150
  • [9] Compression of Nanopore FASTQ Files
    Dufort y Alvarez, Guillermo
    Seroussi, Gadiel
    Smircich, Pablo
    Sotelo, Jose
    Ochoa, Idoia
    Martin, Alvaro
    [J]. BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2019, PT I, 2019, 11465 : 36 - 47
  • [10] Light-weight reference-based compression of FASTQ data
    Yongpeng Zhang
    Linsen Li
    Yanli Yang
    Xiao Yang
    Shan He
    Zexuan Zhu
    [J]. BMC Bioinformatics, 16