No-Reference Compression of Genomic Data Stored In FASTQ Format

被引:19
|
作者
Bhola, Vishal [1 ]
Bopardikar, Ajit S. [1 ]
Narayanan, Rangavittal [1 ]
Lee, Kyusang [2 ]
Ahn, TaeJin [2 ]
机构
[1] Samsung India Software Operat, SAIT India, Bangalore, Karnataka, India
[2] Samsung Elect Co Ltd Suwon, SAIT, Suwon, South Korea
关键词
FASTQ; Next generation sequencing; Genomic Data Compression; SEQUENCE;
D O I
10.1109/BIBM.2011.110
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
引用
收藏
页码:147 / 150
页数:4
相关论文
共 50 条
  • [31] Genomic Data Clustering on FPGAs for Compression
    Petraglio, Enrico
    Wertenbroek, Rick
    Capitao, Flavio
    Guex, Nicolas
    Iseli, Christian
    Thoma, Yann
    APPLIED RECONFIGURABLE COMPUTING, 2017, 10216 : 229 - 240
  • [32] Genomic Encryption of Digital Data Stored in Synthetic DNA
    Grass, Robert N.
    Heckel, Reinhard
    Dessimoz, Christophe
    Stark, Wendelin J.
    ANGEWANDTE CHEMIE-INTERNATIONAL EDITION, 2020, 59 (22) : 8476 - 8480
  • [33] Data structures and compression algorithms for genomic sequence data
    Brandon, Marty C.
    Wallace, Douglas C.
    Baldi, Pierre
    BIOINFORMATICS, 2009, 25 (14) : 1731 - 1738
  • [34] Model-based compression for 3D medical images stored in the DICOM format
    Logeswaran R.
    Eswaran C.
    Journal of Medical Systems, 2006, 30 (2) : 133 - 138
  • [35] Learning a No-Reference Quality Assessment Model of Enhanced Images With Big Data
    Gu, Ke
    Tao, Dacheng
    Qiao, Jun-Fei
    Lin, Weisi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2018, 29 (04) : 1301 - 1313
  • [36] Fast Genomic Data Compression on Multicore Machines
    Sanz, Victoria
    Pousa, Adrian
    Naiouf, Marcelo
    De Giusti, Armando
    CLOUD COMPUTING, BIG DATA AND EMERGING TOPICS, JCC-BD&ET 2024, 2025, 2189 : 3 - 13
  • [37] Lightweight implementation of No-Reference (NR) perceptual quality assessment of H.264/AVC compression
    Leszczuk, Mikolaj
    Kowalczyk, Krzysztof
    Janowski, Lucjan
    Papir, Zdzislaw
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2015, 39 : 457 - 465
  • [38] Lossy compression of quality scores in genomic data
    Canovas, Rodrigo
    Moffat, Alistair
    Turpin, Andrew
    BIOINFORMATICS, 2014, 30 (15) : 2130 - 2136
  • [39] Learning No-Reference Quality Assessment of Multiply and Singly Distorted Images With Big Data
    Zhang, Yi
    Mou, Xuanqin
    Chandler, Damon M.
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 2676 - 2691
  • [40] Reference line approach for vector data compression
    Akimov, A
    Kolesnikov, A
    Fränti, P
    ICIP: 2004 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1- 5, 2004, : 1891 - 1894