FCompress: An Algorithm for FASTQ Sequence Data Compression

被引:2
|
作者
Sardaraz, Muhammad [1 ]
Tahir, Muhammad [1 ]
机构
[1] COMSATS Inst Informat Technol, Dept Comp Sci, Attock, Pakistan
关键词
High throughput sequencing; NGS technologies; NGS sequence compression; Huffman Coding; Fcompress; Algorithm; GENOMIC SEQUENCE;
D O I
10.2174/1574893613666180322125337
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Biological sequence data have increased at a rapid rate due to the advancements in sequencing technologies and reduction in the cost of sequencing data. The huge increase in these data presents significant research challenges to researchers. In addition to meaningful analysis, data storage is also a challenge, an increase in data production is outpacing the storage capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as well as transmission cost over the internet. Objective: This article presents a novel compression algorithm (FCompress) for Next Generation Sequencing (NGS) data in FASTQ format. Method: The proposed algorithm uses bits manipulation and dictionary-based compression for bases compression. Headers are compressed with reference-based compression, whereas quality scores are compressed with Huffman coding. Results: The proposed algorithm is validated with experimental results on real datasets. The results are compared with both general purpose and specialized compression programs. Conclusion: The proposed algorithm produces better compression ratio in a comparable time to other algorithms.
引用
收藏
页码:123 / 129
页数:7
相关论文
共 50 条
  • [21] Transformations for the compression of FASTQ quality scores of next-generation sequencing data
    Wan, Raymond
    Vo Ngoc Anh
    Asai, Kiyoshi
    BIOINFORMATICS, 2012, 28 (05) : 628 - 635
  • [22] Reference-free compression of next-generation sequencing data in FASTQ format
    Tan, Li
    Sun, Jifeng
    2017 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2017, : 10 - 13
  • [23] BIND - An algorithm for loss-less compression of nucleotide sequence data
    Bose, Tungadri
    Mohammed, Monzoorul Haque
    Dutta, Anirban
    Mande, Sharmila S.
    JOURNAL OF BIOSCIENCES, 2012, 37 (04) : 785 - 789
  • [24] BIND – An algorithm for loss-less compression of nucleotide sequence data
    Tungadri Bose
    Monzoorul Haque Mohammed
    Anirban Dutta
    Sharmila S Mande
    Journal of Biosciences, 2012, 37 : 785 - 789
  • [25] Manipulation of FASTQ data with Galaxy
    Blankenberg, Daniel
    Gordon, Assaf
    Von Kuster, Gregory
    Coraor, Nathan
    Taylor, James
    Nekrutenko, Anton
    BIOINFORMATICS, 2010, 26 (14) : 1783 - 1785
  • [26] FQZip: Lossless Reference-Based Compression of Next Generation Sequencing Data in FASTQ Format
    Zhang, Yongpeng
    Li, Linsen
    Xiao, Jun
    Yang, Yanli
    Zhu, Zexuan
    PROCEEDINGS OF THE 18TH ASIA PACIFIC SYMPOSIUM ON INTELLIGENT AND EVOLUTIONARY SYSTEMS, VOL 2, 2015, : 127 - 135
  • [27] Sequence Statistical Code Based Data Compression Algorithm for Wireless Sensor Network
    S. Jancy
    C. Jayakumar
    Wireless Personal Communications, 2019, 106 : 971 - 985
  • [28] Design and development of bioinformatics feature based DNA sequence data compression algorithm
    Banerjee K.
    Bali V.
    EAI Endorsed Transactions on Pervasive Health and Technology, 2020, 5 (20):
  • [29] Sequence Statistical Code Based Data Compression Algorithm for Wireless Sensor Network
    Jancy, S.
    Jayakumar, C.
    WIRELESS PERSONAL COMMUNICATIONS, 2019, 106 (03) : 971 - 985
  • [30] LFQC: A lossless compression algorithm for FASTQ files (vol 35, pg e1, 2019)
    Pathak, Sudipta
    Rajasekaran, Sanguthevar
    BIOINFORMATICS, 2020, 36 (22-23) : 5566 - 5566