Space-efficient computation of k-mer dictionaries for large values of k

被引:0
|
作者
Diaz-Dominguez, Diego [1 ]
Leinonen, Miika [1 ]
Salmela, Leena [1 ]
机构
[1] Univ Helsinki, Dept Comp Sci, Pietari Kalmin Katu 5, Helsinki 00014, Finland
基金
芬兰科学院;
关键词
Genomics; String hashing; k-mers; PARALLEL;
D O I
10.1186/s13015-024-00259-1
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small kas a hash table keeping keys explicitly (i.e., k-mer sequences) takes O(Nk/w) computer words, N being the num-ber of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O(N+uk/w) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k-1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k-1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses point-ers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O(sigma(k)) time in the worst case, sigma being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Space-efficient computation of k-mer dictionaries for large values of k
    Diego Díaz-Domínguez
    Miika Leinonen
    Leena Salmela
    Algorithms for Molecular Biology, 19
  • [2] Space-efficient representation of genomic k-mer count tables
    Yoshihiro Shibuya
    Djamal Belazzougui
    Gregory Kucherov
    Algorithms for Molecular Biology, 17
  • [3] Space-efficient representation of genomic k-mer count tables
    Shibuya, Yoshihiro
    Belazzougui, Djamal
    Kucherov, Gregory
    ALGORITHMS FOR MOLECULAR BIOLOGY, 2022, 17 (01)
  • [4] On weighted k-mer dictionaries
    Giulio Ermanno Pibiri
    Algorithms for Molecular Biology, 18
  • [5] On weighted k-mer dictionaries
    Pibiri, Giulio Ermanno
    ALGORITHMS FOR MOLECULAR BIOLOGY, 2023, 18 (01)
  • [6] Efficient Techniques for k-mer Counting
    Mamun, Abdullah-Al
    Pal, Soumitra
    Rajasekaran, Sanguthevar
    2015 IEEE 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2015,
  • [7] Efficient dynamic associative dictionary for large k-mer sets
    Dufresne, Yoann
    Marchet, Camille
    Chikhi, Rayan
    Limasset, Antoine
    BMC BIOINFORMATICS, 2020, 21 (SUPPL 20):
  • [8] k-mer Profiling for Bacterial Identification
    Bhange, Snehal V.
    Tikariha, Hitesh
    Dongre, S. S.
    Purohit, H. J.
    HELIX, 2018, 8 (05): : 4007 - 4009
  • [9] Disk compression of k-mer sets
    Rahman, Amatur
    Chikhi, Rayan
    Medvedev, Paul
    ALGORITHMS FOR MOLECULAR BIOLOGY, 2021, 16 (01)
  • [10] k-mer approaches for biodiversity genomics
    Jenike, Katharine M.
    Campos-Dominguez, Lucia
    Bodde, Marilou
    Cerca, Jose
    Hodson, Christina N.
    Schatz, Michael C.
    Jaron, Kamil S.
    GENOME RESEARCH, 2025, 35 (02) : 219 - 230