FCompress: An Algorithm for FASTQ Sequence Data Compression

被引:2
|
作者
Sardaraz, Muhammad [1 ]
Tahir, Muhammad [1 ]
机构
[1] COMSATS Inst Informat Technol, Dept Comp Sci, Attock, Pakistan
关键词
High throughput sequencing; NGS technologies; NGS sequence compression; Huffman Coding; Fcompress; Algorithm; GENOMIC SEQUENCE;
D O I
10.2174/1574893613666180322125337
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Biological sequence data have increased at a rapid rate due to the advancements in sequencing technologies and reduction in the cost of sequencing data. The huge increase in these data presents significant research challenges to researchers. In addition to meaningful analysis, data storage is also a challenge, an increase in data production is outpacing the storage capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as well as transmission cost over the internet. Objective: This article presents a novel compression algorithm (FCompress) for Next Generation Sequencing (NGS) data in FASTQ format. Method: The proposed algorithm uses bits manipulation and dictionary-based compression for bases compression. Headers are compressed with reference-based compression, whereas quality scores are compressed with Huffman coding. Results: The proposed algorithm is validated with experimental results on real datasets. The results are compared with both general purpose and specialized compression programs. Conclusion: The proposed algorithm produces better compression ratio in a comparable time to other algorithms.
引用
收藏
页码:123 / 129
页数:7
相关论文
共 50 条
  • [31] SeqCompress: An algorithm for biological sequence compression
    Sardaraz, Muhammad
    Tahir, Muhammad
    Ikram, Ataul Aziz
    Bajwa, Hassan
    GENOMICS, 2014, 104 (04) : 225 - 228
  • [32] DSRC 2-Industry-oriented compression of FASTQ files
    Roguski, Lukasz
    Deorowicz, Sebastian
    BIOINFORMATICS, 2014, 30 (15) : 2213 - 2215
  • [33] Optimal data compression algorithm
    Sadeh, I
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 1996, 32 (05) : 57 - 72
  • [34] SEQUENCE TIME CODING FOR DATA COMPRESSION
    LYNCH, TJ
    PROCEEDINGS OF THE INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS, 1966, 54 (10): : 1490 - &
  • [35] Data compression approach to sequence analysis
    Loreto, V
    Puglisi, A
    MODELING OF COMPLEX SYSTEMS, 2003, 661 : 184 - 187
  • [36] A simple statistical algorithm for biological sequence compression
    Cao, Minh Duc
    Dix, Trevor I.
    Allison, Lloyd
    Mears, Chris
    DCC 2007: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2007, : 43 - +
  • [37] A new efficient referential genome compression technique for FastQ files
    Sanjeev Kumar
    Mukund Pratap Singh
    Soumya Ranjan Nayak
    Asif Uddin Khan
    Anuj Kumar Jain
    Prabhishek Singh
    Manoj Diwakar
    Thota Soujanya
    Functional & Integrative Genomics, 2023, 23
  • [38] A new efficient referential genome compression technique for FastQ files
    Kumar, Sanjeev
    Singh, Mukund Pratap
    Nayak, Soumya Ranjan
    Khan, Asif Uddin
    Jain, Anuj Kumar
    Singh, Prabhishek
    Diwakar, Manoj
    Soujanya, Thota
    FUNCTIONAL & INTEGRATIVE GENOMICS, 2023, 23 (04)
  • [39] Data structures and compression algorithms for genomic sequence data
    Brandon, Marty C.
    Wallace, Douglas C.
    Baldi, Pierre
    BIOINFORMATICS, 2009, 25 (14) : 1731 - 1738
  • [40] Novel Data Compression Algorithm for Process Data
    Purohit, Amit
    2014 IEEE CONFERENCE ON CONTROL APPLICATIONS (CCA), 2014, : 784 - 789