Space-efficient computation of k-mer dictionaries for large values of k

被引：0

作者：

Diaz-Dominguez, Diego ^{[1
]}

Leinonen, Miika ^{[1
]}

Salmela, Leena ^{[1
]}

机构：

[1] Univ Helsinki, Dept Comp Sci, Pietari Kalmin Katu 5, Helsinki 00014, Finland

来源：

ALGORITHMS FOR MOLECULAR BIOLOGY | 2024年 / 19卷 / 01期

基金：

芬兰科学院;

关键词：

Genomics; String hashing; k-mers; PARALLEL;

D O I：

10.1186/s13015-024-00259-1

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small kas a hash table keeping keys explicitly (i.e., k-mer sequences) takes O(Nk/w) computer words, N being the num-ber of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O(N+uk/w) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k-1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k-1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses point-ers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O(sigma(k)) time in the worst case, sigma being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.

引用

页数：23

共 50 条

[1] Space-efficient computation of k-mer dictionaries for large values of k
Diego Díaz-Domínguez
Miika Leinonen
Leena Salmela
Algorithms for Molecular Biology, 19
[2] Space-efficient representation of genomic k-mer count tables
Yoshihiro Shibuya
Djamal Belazzougui
Gregory Kucherov
Algorithms for Molecular Biology, 17
[3] Space-efficient representation of genomic k-mer count tables
Shibuya, Yoshihiro
Belazzougui, Djamal
Kucherov, Gregory
ALGORITHMS FOR MOLECULAR BIOLOGY, 2022, 17 (01)
[4] On weighted k-mer dictionaries
Giulio Ermanno Pibiri
Algorithms for Molecular Biology, 18
[5] On weighted k-mer dictionaries
Pibiri, Giulio Ermanno
ALGORITHMS FOR MOLECULAR BIOLOGY, 2023, 18 (01)
[6] Efficient Techniques for k-mer Counting
Mamun, Abdullah-Al
Pal, Soumitra
Rajasekaran, Sanguthevar
2015 IEEE 5TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2015,
[7] Efficient dynamic associative dictionary for large k-mer sets
Dufresne, Yoann
Marchet, Camille
Chikhi, Rayan
Limasset, Antoine
BMC BIOINFORMATICS, 2020, 21 (SUPPL 20):
[8] k-mer Profiling for Bacterial Identification
Bhange, Snehal V.
Tikariha, Hitesh
Dongre, S. S.
Purohit, H. J.
HELIX, 2018, 8 (05): : 4007 - 4009
[9] Disk compression of k-mer sets
Rahman, Amatur
Chikhi, Rayan
Medvedev, Paul
ALGORITHMS FOR MOLECULAR BIOLOGY, 2021, 16 (01)
[10] k-mer approaches for biodiversity genomics
Jenike, Katharine M.
Campos-Dominguez, Lucia
Bodde, Marilou
Cerca, Jose
Hodson, Christina N.
Schatz, Michael C.
Jaron, Kamil S.
GENOME RESEARCH, 2025, 35 (02) : 219 - 230

← 1 2 3 4 5 →