No-Reference Compression of Genomic Data Stored In FASTQ Format

被引:19
|
作者
Bhola, Vishal [1 ]
Bopardikar, Ajit S. [1 ]
Narayanan, Rangavittal [1 ]
Lee, Kyusang [2 ]
Ahn, TaeJin [2 ]
机构
[1] Samsung India Software Operat, SAIT India, Bangalore, Karnataka, India
[2] Samsung Elect Co Ltd Suwon, SAIT, Suwon, South Korea
关键词
FASTQ; Next generation sequencing; Genomic Data Compression; SEQUENCE;
D O I
10.1109/BIBM.2011.110
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
引用
收藏
页码:147 / 150
页数:4
相关论文
共 50 条
  • [1] Compression of FASTQ and SAM Format Sequencing Data
    Bonfield, James K.
    Mahoney, Matthew V.
    [J]. PLOS ONE, 2013, 8 (03):
  • [2] Reference-free compression of next-generation sequencing data in FASTQ format
    Tan, Li
    Sun, Jifeng
    [J]. 2017 IEEE CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY (CIBCB), 2017, : 10 - 13
  • [3] FQZip: Lossless Reference-Based Compression of Next Generation Sequencing Data in FASTQ Format
    Zhang, Yongpeng
    Li, Linsen
    Xiao, Jun
    Yang, Yanli
    Zhu, Zexuan
    [J]. PROCEEDINGS OF THE 18TH ASIA PACIFIC SYMPOSIUM ON INTELLIGENT AND EVOLUTIONARY SYSTEMS, VOL 2, 2015, : 127 - 135
  • [4] Leveraging CABAC for no-reference compression of genomic data with random access support
    Paridaens, Tom
    Panneel, Jens
    De Neve, Wesley
    Lambert, Peter
    Van de Walle, Rik
    [J]. 2016 DATA COMPRESSION CONFERENCE (DCC), 2016, : 625 - 625
  • [5] Compression of DNA sequence reads in FASTQ format
    Deorowicz, Sebastian
    Grabowski, Szymon
    [J]. BIOINFORMATICS, 2011, 27 (06) : 860 - 862
  • [6] Light-weight reference-based compression of FASTQ data
    Yongpeng Zhang
    Linsen Li
    Yanli Yang
    Xiao Yang
    Shan He
    Zexuan Zhu
    [J]. BMC Bioinformatics, 16
  • [7] Light-weight reference-based compression of FASTQ data
    Zhang, Yongpeng
    Li, Linsen
    Yang, Yanli
    Yang, Xiao
    He, Shan
    Zhu, Zexuan
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [8] MZPAQ: a FASTQ data compression tool
    El Allali, Achraf
    Arshad, Mariam
    [J]. SOURCE CODE FOR BIOLOGY AND MEDICINE, 2019, 14
  • [9] Context Based Compression of FASTQ Data
    Mallavarapu, Rama Srikanth
    Chinnamalliah, Pandu Kumar
    Bopardikar, Ajit S.
    Ahn, TaeJin
    [J]. 2016 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), 2016, : 97 - 100
  • [10] An Approach to Compression of Genomic Data Based on Image File Format
    Martins, Juliano V.
    Kredens, Kelvin V.
    Dordall, Osmar B.
    Arruda, Paulo H. S.
    Borges, Andr P.
    Herai, Roberto H.
    Scalabrin, Edson E.
    Avila, Braulio C.
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 3274 - 3279