Space-efficient computation of k-mer dictionaries for large values of k

被引：0

作者：

Diaz-Dominguez, Diego ^{[1
]}

Leinonen, Miika ^{[1
]}

Salmela, Leena ^{[1
]}

机构：

[1] Univ Helsinki, Dept Comp Sci, Pietari Kalmin Katu 5, Helsinki 00014, Finland

来源：

ALGORITHMS FOR MOLECULAR BIOLOGY | 2024年 / 19卷 / 01期

基金：

芬兰科学院;

关键词：

Genomics; String hashing; k-mers; PARALLEL;

D O I：

10.1186/s13015-024-00259-1

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small kas a hash table keeping keys explicitly (i.e., k-mer sequences) takes O(Nk/w) computer words, N being the num-ber of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O(N+uk/w) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k-1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k-1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses point-ers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O(sigma(k)) time in the worst case, sigma being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.

引用

页数：23

共 50 条

[21] Efficient k-mer Indexing with Application to Mapping-free SNP Genotyping
Marcolin, Mattia
Andreace, Francesco
Comin, Matteo
BIOINFORMATICS: PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL 3: BIOINFORMATICS, 2021, : 62 - 70
[22] Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
Giancarlo, Raffaele
Rombo, Simona E.
Utro, Filippo
BIOINFORMATICS, 2015, 31 (18) : 2939 - 2946
[23] Space-Efficient Randomized Algorithms for K-SUM
Wang, Joshua R.
ALGORITHMS - ESA 2014, 2014, 8737 : 810 - 829
[24] Inter-chromosomal k-mer distances
Alon Kafri
Benny Chor
David Horn
BMC Genomics, 22
[25] K-mer Counting for Genomic Big Data
Ge, Jianqiu
Guo, Ning
Meng, Jintao
Wang, Bingqiang
Balaji, Pavan
Feng, Shengzhong
Zhou, Jiaxiu
Wei, Yanjie
BIG DATA - BIGDATA 2018, 2018, 10968 : 345 - 351
[26] Optimizing Spaced k-mer Neighbors for Efficient Filtration in Protein Similarity Search
Li, Weiming
Ma, Bin
Zhang, Kaizhong
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2014, 11 (02) : 398 - 406
[27] iMOKA: k-mer based software to analyze large collections of sequencing data
Lorenzi, Claudio
Barriere, Sylvain
Villemin, Jean-Philippe
Dejardin Bretones, Laureline
Mancheron, Alban
Ritchie, William
GENOME BIOLOGY, 2020, 21 (01)
[28] iMOKA: k-mer based software to analyze large collections of sequencing data
Claudio Lorenzi
Sylvain Barriere
Jean-Philippe Villemin
Laureline Dejardin Bretones
Alban Mancheron
William Ritchie
Genome Biology, 21
[29] Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
Kanda, Shunsuke
Morita, Kazuhiro
Fuketa, Masao
STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2017), 2017, 10508 : 221 - 233
[30] These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure
Zhang, Qingpeng
Pell, Jason
Canino-Koning, Rosangela
Howe, Adina Chuang
Brown, C. Titus
PLOS ONE, 2014, 9 (07):

← 1 2 3 4 5 →