Data structures and compression algorithms for high-throughput sequencing technologies

被引:41
|
作者
Daily, Kenny [1 ,2 ]
Rigor, Paul [1 ,2 ]
Christley, Scott [1 ,3 ,4 ]
Xie, Xiaohui [1 ,2 ,4 ]
Baldi, Pierre [1 ,2 ,4 ,5 ]
机构
[1] Univ Calif Irvine, Dept Comp Sci, Irvine, CA 92697 USA
[2] Univ Calif Irvine, Inst Genom & Bioinformat, Irvine, CA 92697 USA
[3] Univ Calif Irvine, Dept Math, Irvine, CA 92697 USA
[4] Univ Calif Irvine, Ctr Complex Biol Syst, Irvine, CA 92697 USA
[5] Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA
来源
BMC BIOINFORMATICS | 2010年 / 11卷
基金
美国国家科学基金会;
关键词
HUMAN GENOME; DNA; CODES;
D O I
10.1186/1471-2105-11-514
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage and sharing of HTS data. Results: We develop data structures and compression algorithms for HTS data. A processing stage maps short sequences to a reference genome or a large table of sequences. Then the integers representing the short sequence absolute or relative addresses, their length, and the substitutions they may contain are compressed and stored using various entropy coding algorithms, including both old and new fixed codes (e. g Golomb, Elias Gamma, MOV) and variable codes (e. g. Huffman). The general methodology is illustrated and applied to several HTS data sets. Results show that the information contained in HTS files can be compressed by a factor of 10 or more, depending on the statistical properties of the data sets and various other choices and constraints. Our algorithms fair well against general purpose compression programs such as gzip, bzip2 and 7zip; timing results show that our algorithms are consistently faster than the best general purpose compression programs. Conclusions: It is not likely that exactly one encoding strategy will be optimal for all types of HTS data. Different experimental conditions are going to generate various data distributions whereby one encoding strategy can be more effective than another. We have implemented some of our encoding algorithms into the software package GenCompress which is available upon request from the authors. With the advent of HTS technology and increasingly new experimental protocols for using the technology, sequence databases are expected to continue rising in size. The methodology we have proposed is general, and these advanced compression techniques should allow researchers to manage and share their HTS data in a more timely fashion.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Data structures and compression algorithms for high-throughput sequencing technologies
    Kenny Daily
    Paul Rigor
    Scott Christley
    Xiaohui Xie
    Pierre Baldi
    [J]. BMC Bioinformatics, 11
  • [2] Compression of Structured High-Throughput Sequencing Data
    Campagne, Fabien
    Dorff, Kevin C.
    Chambwe, Nyasha
    Robinson, James T.
    Mesirov, Jill P.
    [J]. PLOS ONE, 2013, 8 (11):
  • [3] High-Throughput Sequencing Technologies
    Reuter, Jason A.
    Spacek, Damek V.
    Snyder, Michael P.
    [J]. MOLECULAR CELL, 2015, 58 (04) : 586 - 597
  • [4] Comparison of high-throughput sequencing data compression tools
    Numanagic, Ibrahim
    Bonfield, James K.
    Hach, Faraz
    Voges, Jan
    Ostermann, Joern
    Alberti, Claudio
    Mattavelli, Marco
    Sahinalp, S. Cenk
    [J]. NATURE METHODS, 2016, 13 (12) : 1005 - +
  • [5] Comparison of high-throughput sequencing data compression tools
    Ibrahim Numanagić
    James K Bonfield
    Faraz Hach
    Jan Voges
    Jörn Ostermann
    Claudio Alberti
    Marco Mattavelli
    S Cenk Sahinalp
    [J]. Nature Methods, 2016, 13 : 1005 - 1008
  • [6] ROLE OF HIGH-THROUGHPUT SEQUENCING TECHNOLOGIES IN GENOME SEQUENCING
    Chaitanya, K. V.
    Alikhan, Akbar P.
    Reddy, V. Prasanth
    Lakhtakia, Rishabh
    Ramji, M. Taraka
    [J]. INTERNATIONAL JOURNAL OF ADVANCED BIOTECHNOLOGY AND RESEARCH, 2010, 1 (02): : 120 - 129
  • [7] NGC: lossless and lossy compression of aligned high-throughput sequencing data
    Popitsch, Niko
    von Haeseler, Arndt
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (01)
  • [8] High-throughput technologies for gathering data
    Fortina, P.
    [J]. CLINICA CHIMICA ACTA, 2019, 493 : S755 - S755
  • [9] Genome reassembly with high-throughput sequencing data
    Nathaniel Parrish
    Benjamin Sudakov
    Eleazar Eskin
    [J]. BMC Genomics, 14
  • [10] Genome reassembly with high-throughput sequencing data
    Parrish, Nathaniel
    Sudakov, Benjamin
    Eskin, Eleazar
    [J]. BMC GENOMICS, 2013, 14