No-Reference Compression of Genomic Data Stored In FASTQ Format

被引:19
|
作者
Bhola, Vishal [1 ]
Bopardikar, Ajit S. [1 ]
Narayanan, Rangavittal [1 ]
Lee, Kyusang [2 ]
Ahn, TaeJin [2 ]
机构
[1] Samsung India Software Operat, SAIT India, Bangalore, Karnataka, India
[2] Samsung Elect Co Ltd Suwon, SAIT, Suwon, South Korea
关键词
FASTQ; Next generation sequencing; Genomic Data Compression; SEQUENCE;
D O I
10.1109/BIBM.2011.110
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
引用
收藏
页码:147 / 150
页数:4
相关论文
共 50 条
  • [21] Transformations for the compression of FASTQ quality scores of next-generation sequencing data
    Wan, Raymond
    Vo Ngoc Anh
    Asai, Kiyoshi
    BIOINFORMATICS, 2012, 28 (05) : 628 - 635
  • [22] G-FQZip: Lossless Reference-Based Compression of FASTQ files Using GPUs
    Peng, Cong
    Deng, Qingjin
    Huang, Zhi-An
    Sun, Yiwen
    Zhu, Zexuan
    2017 13TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2017, : 553 - 556
  • [23] No-reference quality metric for HEVC compression distortion estimation in depth maps
    Muhammad Shahid Farid
    Maurizio Lucenteforte
    Marco Grangetto
    Signal, Image and Video Processing, 2020, 14 : 195 - 203
  • [24] The Locus Reference Genomic (LRG) DNA sequence format for LSDBs
    Dalgleish, Raymond
    JOURNAL OF MEDICAL GENETICS, 2009, 46 : S71 - S71
  • [25] No-reference quality metric for HEVC compression distortion estimation in depth maps
    Farid, Muhammad Shahid
    Lucenteforte, Maurizio
    Grangetto, Marco
    SIGNAL IMAGE AND VIDEO PROCESSING, 2020, 14 (01) : 195 - 203
  • [26] Design of new format for mass data compression
    Qin J.-C.
    Bai Z.-Y.
    Journal of China Universities of Posts and Telecommunications, 2011, 18 (01): : 121 - 128
  • [27] Design of new format for mass data compression
    QIN Jian-cheng
    TheJournalofChinaUniversitiesofPostsandTelecommunications, 2011, 18 (01) : 121 - 128
  • [28] No-Reference Metric Design With Machine Learning for Local Video Compression Artifact Level
    Vink, Jelte Peter
    de Haan, Gerard
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2011, 5 (02) : 297 - 308
  • [29] A No-Reference Metric of Cerebral Blood Flow Extraction for fNIRS Data
    Hoshino, Takayuki
    Kanoga, Suguru
    Kanemura, Atsunori
    Ogawa, Takeshi
    2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 83 - 89
  • [30] COMPRESSION OF CONTINUOUS SPATIAL DATA IN THE RASTER DIGITAL FORMAT
    PLUMB, GA
    COMPUTERS & GEOSCIENCES, 1993, 19 (04) : 493 - 497