NGC: lossless and lossy compression of aligned high-throughput sequencing data

被引:41
|
作者
Popitsch, Niko [1 ]
von Haeseler, Arndt [1 ]
机构
[1] Med Univ Vienna, Univ Vienna, Max F Perutz Labs, Ctr Integrat Bioinformat Vienna, A-1030 Vienna, Austria
基金
奥地利科学基金会;
关键词
GENOMIC SEQUENCE; QUALITY SCORES; ALGORITHMS; FORMAT;
D O I
10.1093/nar/gks939
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
A major challenge of current high-throughput sequencing experiments is not only the generation of the sequencing data itself but also their processing, storage and transmission. The enormous size of these data motivates the development of data compression algorithms usable for the implementation of the various storage policies that are applied to the produced intermediate and final result files. In this article, we present NGC, a tool for the compression of mapped short read data stored in the wide-spread SAM format. NGC enables lossless and lossy compression and introduces the following two novel ideas: first, we present a way to reduce the number of required code words by exploiting common features of reads mapped to the same genomic positions; second, we present a highly configurable way for the quantization of per-base quality values, which takes their influence on downstream analyses into account. NGC, evaluated with several real-world data sets, saves 33-66% of disc space using lossless and up to 98% disc space using lossy compression. By applying two popular variant and genotype prediction tools to the decompressed data, we could show that the lossy compression modes preserve >99% of all called variants while outperforming comparable methods in some configurations.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] High-Throughput, Lossless Data Compression on FPGAs
    Sukhwani, Bharat
    Abali, Bulent
    Brezzo, Bernard
    Asaad, Sameh
    [J]. 2011 IEEE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2011, : 113 - 116
  • [2] High-Throughput Lossy-to-Lossless 3D Image Compression
    Rossinelli, Diego
    Fourestey, Gilles
    Schmidt, Felix
    Busse, Bjoern
    Kurtcuoglu, Vartan
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2021, 40 (02) : 607 - 620
  • [3] Compression of Structured High-Throughput Sequencing Data
    Campagne, Fabien
    Dorff, Kevin C.
    Chambwe, Nyasha
    Robinson, James T.
    Mesirov, Jill P.
    [J]. PLOS ONE, 2013, 8 (11):
  • [4] ISOBAR Preconditioner for Effective and High-throughput Lossless Data Compression
    Schendel, Eric R.
    Jin, Ye
    Shah, Neil
    Chen, Jackie
    Chang, C. S.
    Ku, Seung-Hoe
    Ethier, Stephane
    Klasky, Scott
    Latham, Robert
    Ross, Robert
    Samatova, Nagiza F.
    [J]. 2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 138 - 149
  • [5] Comparison of high-throughput sequencing data compression tools
    Numanagic, Ibrahim
    Bonfield, James K.
    Hach, Faraz
    Voges, Jan
    Ostermann, Joern
    Alberti, Claudio
    Mattavelli, Marco
    Sahinalp, S. Cenk
    [J]. NATURE METHODS, 2016, 13 (12) : 1005 - +
  • [6] Comparison of high-throughput sequencing data compression tools
    Ibrahim Numanagić
    James K Bonfield
    Faraz Hach
    Jan Voges
    Jörn Ostermann
    Claudio Alberti
    Marco Mattavelli
    S Cenk Sahinalp
    [J]. Nature Methods, 2016, 13 : 1005 - 1008
  • [7] Data structures and compression algorithms for high-throughput sequencing technologies
    Daily, Kenny
    Rigor, Paul
    Christley, Scott
    Xie, Xiaohui
    Baldi, Pierre
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [8] Data structures and compression algorithms for high-throughput sequencing technologies
    Kenny Daily
    Paul Rigor
    Scott Christley
    Xiaohui Xie
    Pierre Baldi
    [J]. BMC Bioinformatics, 11
  • [9] A High-Throughput Lossless Image Compression Engine Optimized for Compression Ratio
    Cai, Siqi
    Chen, Yuzhou
    Zhang, Wenhui
    Jin, Zeyuan
    Wang, Gang
    Chen, Hao
    He, Guanghui
    [J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [10] Low-Latency Lossless Compression Codec Design for High-Throughput Data-Buses
    Katsu, Yuki
    Kaneko, Haruhiko
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS-TAIWAN (ICCE-TW), 2016, : 269 - 270