No-Reference Compression of Genomic Data Stored In FASTQ Format

被引:19
|
作者
Bhola, Vishal [1 ]
Bopardikar, Ajit S. [1 ]
Narayanan, Rangavittal [1 ]
Lee, Kyusang [2 ]
Ahn, TaeJin [2 ]
机构
[1] Samsung India Software Operat, SAIT India, Bangalore, Karnataka, India
[2] Samsung Elect Co Ltd Suwon, SAIT, Suwon, South Korea
关键词
FASTQ; Next generation sequencing; Genomic Data Compression; SEQUENCE;
D O I
10.1109/BIBM.2011.110
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
引用
收藏
页码:147 / 150
页数:4
相关论文
共 50 条
  • [41] Verifying reference intervals for coagulation tests by using stored data
    Aral, Hale
    Usta, Murat
    Cilingirturk, Ahmet M.
    Inal, Berrin B.
    Bilgi, Pinar T.
    Guvenen, Guvenc
    SCANDINAVIAN JOURNAL OF CLINICAL & LABORATORY INVESTIGATION, 2011, 71 (08): : 647 - 652
  • [42] NWB Query Engines: Tools to Search Data Stored in Neurodata Without Borders Format
    Jezek, Petr
    Teeters, Jeffery L.
    Sommer, Friedrich T.
    FRONTIERS IN NEUROINFORMATICS, 2020, 14
  • [43] MRCZ - A file format for cryo-TEM data with fast compression
    McLeod, Robert A.
    Righetto, Ricardo Diogo
    Stewart, Andy
    Stahlberg, Henning
    JOURNAL OF STRUCTURAL BIOLOGY, 2018, 201 (03) : 252 - 257
  • [44] Aligned genomic data compression via improved modeling
    Ochoa, Idoia
    Hernaez, Mikel
    Weissman, Tsachy
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2014, 12 (06)
  • [45] Learning Generalizable Perceptual Representations for Data-Efficient No-Reference Image Quality Assessment
    Srinath, Suhas
    Mitra, Shankhanil
    Rao, Shika
    Soundararajan, Rajiv
    2024 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION, WACV 2024, 2024, : 22 - 31
  • [46] No-Reference Video Quality Assessment Based on Ensemble of Knowledge and Data-Driven Models
    Su, Li
    Cosman, Pamela
    Peng, Qihang
    MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 231 - 242
  • [47] A Test Data Compression Scheme Based on Irrational Numbers Stored Coding
    Wu, Hai-feng
    Cheng, Yu-sheng
    Zhan, Wen-fa
    Cheng, Yi-fei
    Wu, Qiong
    Zhu, Shi-juan
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [48] An Efficient Compression Algorithm and a Novel File Format for Satellite Vibration Test Data
    Nagendra, B. R.
    Misra, N. K.
    Khan, A. M.
    2013 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, COMPUTING AND CONTROL (ISPCC), 2013,
  • [49] Parallel compression and decompression algorithm for massive recording data in IEEE COMTRADE format
    Gui, X. (guinh3@163.com), 2013, Electric Power Automation Equipment Press (33):
  • [50] An efficient image data format for lossless compression and its application to interactive viewing
    Kim, YS
    Kim, WY
    INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, PROCEEDINGS - VOL I, 1996, : 73 - 76