Meta-colored Compacted de Bruijn Graphs

被引:0
|
作者
Pibiri, Giulio Ermanno [1 ,2 ]
Fan, Jason [3 ]
Patro, Rob [3 ]
机构
[1] Ca Foscari Univ Venice, DAIS, Venice, Italy
[2] ISTI CNR, Pisa, Italy
[3] Univ Maryland, Dept Comp Sci, College Pk, MD 20440 USA
关键词
PAN-GENOME ANALYSIS;
D O I
10.1007/978-1-0716-3989-4_9
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from k-mers to the set of references in which they appear. The c-dBG data structure should retrieve this set-the color of the k-mer-efficiently for any given k-mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing. We describe the meta-colored compacted de Bruijn graph (MacdBG)-a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency. This improved space/time trade-off is robust across different datasets and query workloads. Code availability. A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor.
引用
收藏
页码:131 / 146
页数:16
相关论文
共 50 条
  • [1] Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT
    Cracco, Andrea
    Tomescu, Alexandru I.
    [J]. GENOME RESEARCH, 2023, 33 (07) : 1198 - 1207
  • [2] Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs
    Guillaume Holley
    Páll Melsted
    [J]. Genome Biology, 21
  • [3] Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs
    Holley, Guillaume
    Melsted, Pall
    [J]. GENOME BIOLOGY, 2020, 21 (01)
  • [4] Succinct colored de Bruijn graphs
    Muggli, Martin D.
    Bowe, Alexander
    Noyes, Noelle R.
    Morley, Paul S.
    Belk, Keith E.
    Raymond, Robert
    Gagie, Travis
    Puglisi, Simon J.
    Boucher, Christina
    [J]. BIOINFORMATICS, 2017, 33 (20) : 3181 - 3187
  • [5] Compression algorithm for colored de Bruijn graphs
    Rahman, Amatur
    Dufresne, Yoann
    Medvedev, Paul
    [J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2024, 19 (01)
  • [6] Colored de Bruijn graphs and the genome halving problem
    Alekseyev, Max A.
    Pevzner, Pavel A.
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2007, 4 (01) : 98 - 107
  • [7] A space and time-efficient index for the compacted colored de Bruijn graph
    Almodaresi, Fatemeh
    Sarkar, Hirak
    Srivastava, Avi
    Patro, Rob
    [J]. BIOINFORMATICS, 2018, 34 (13) : 169 - 177
  • [8] De novo assembly and genotyping of variants using colored de Bruijn graphs
    Zamin Iqbal
    Mario Caccamo
    Isaac Turner
    Paul Flicek
    Gil McVean
    [J]. Nature Genetics, 2012, 44 : 226 - 232
  • [9] De novo assembly and genotyping of variants using colored de Bruijn graphs
    Iqbal, Zamin
    Caccamo, Mario
    Turner, Isaac
    Flicek, Paul
    McVean, Gil
    [J]. NATURE GENETICS, 2012, 44 (02) : 226 - 232
  • [10] Building large updatable colored de Bruijn graphs via merging
    Muggli, Martin D.
    Alipanahi, Bahar
    Boucher, Christina
    [J]. BIOINFORMATICS, 2019, 35 (14) : I51 - I60