Space-efficient computation of k-mer dictionaries for large values of k

被引:0
|
作者
Diaz-Dominguez, Diego [1 ]
Leinonen, Miika [1 ]
Salmela, Leena [1 ]
机构
[1] Univ Helsinki, Dept Comp Sci, Pietari Kalmin Katu 5, Helsinki 00014, Finland
基金
芬兰科学院;
关键词
Genomics; String hashing; k-mers; PARALLEL;
D O I
10.1186/s13015-024-00259-1
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Computing k-mer frequencies in a collection of reads is a common procedure in many genomic applications. Several state-of-the-art k-mer counters rely on hash tables to carry out this task but they are often optimised for small kas a hash table keeping keys explicitly (i.e., k-mer sequences) takes O(Nk/w) computer words, N being the num-ber of distinct k-mers and w the computer word size, which is impractical for long values of k. This space usage is an important limitation as analysis of long and accurate HiFi sequencing reads can require larger values of k. We propose Kaarme, a space-efficient hash table for k-mers using O(N+uk/w) words of space, where u is the number of reads. Our framework exploits the fact that consecutive k-mers overlap by k-1 symbols. Thus, we only store the last symbol of a k-mer and a pointer within the hash table to a previous one, which we can use to recover the remaining k-1 symbols. We adapt Kaarme to compute canonical k-mers as well. This variant also uses point-ers within the hash table to save space but requires more work to decode the k-mers. Specifically, it takes O(sigma(k)) time in the worst case, sigma being the DNA alphabet, but our experiments show this is hardly ever the case. The canonical variant does not improve our theoretical results but greatly reduces space usage in practice while keeping a competitive performance to get the k-mers and their frequencies. We compare canonical Kaarme to a regular hash table storing canonical k-mers explicitly as keys and show that our method uses up to five times less space while being less than 1.5 times slower. We also show that canonical Kaarme uses significantly less memory than state-of-the-art k-mer counters when they do not resort to disk to keep intermediate results.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Efficient k-mer Indexing with Application to Mapping-free SNP Genotyping
    Marcolin, Mattia
    Andreace, Francesco
    Comin, Matteo
    BIOINFORMATICS: PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES - VOL 3: BIOINFORMATICS, 2021, : 62 - 70
  • [22] Epigenomic k-mer dictionaries: shedding light on how sequence composition influences in vivo nucleosome positioning
    Giancarlo, Raffaele
    Rombo, Simona E.
    Utro, Filippo
    BIOINFORMATICS, 2015, 31 (18) : 2939 - 2946
  • [23] Space-Efficient Randomized Algorithms for K-SUM
    Wang, Joshua R.
    ALGORITHMS - ESA 2014, 2014, 8737 : 810 - 829
  • [24] Inter-chromosomal k-mer distances
    Alon Kafri
    Benny Chor
    David Horn
    BMC Genomics, 22
  • [25] K-mer Counting for Genomic Big Data
    Ge, Jianqiu
    Guo, Ning
    Meng, Jintao
    Wang, Bingqiang
    Balaji, Pavan
    Feng, Shengzhong
    Zhou, Jiaxiu
    Wei, Yanjie
    BIG DATA - BIGDATA 2018, 2018, 10968 : 345 - 351
  • [26] Optimizing Spaced k-mer Neighbors for Efficient Filtration in Protein Similarity Search
    Li, Weiming
    Ma, Bin
    Zhang, Kaizhong
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2014, 11 (02) : 398 - 406
  • [27] iMOKA: k-mer based software to analyze large collections of sequencing data
    Lorenzi, Claudio
    Barriere, Sylvain
    Villemin, Jean-Philippe
    Dejardin Bretones, Laureline
    Mancheron, Alban
    Ritchie, William
    GENOME BIOLOGY, 2020, 21 (01)
  • [28] iMOKA: k-mer based software to analyze large collections of sequencing data
    Claudio Lorenzi
    Sylvain Barriere
    Jean-Philippe Villemin
    Laureline Dejardin Bretones
    Alban Mancheron
    William Ritchie
    Genome Biology, 21
  • [29] Practical Implementation of Space-Efficient Dynamic Keyword Dictionaries
    Kanda, Shunsuke
    Morita, Kazuhiro
    Fuketa, Masao
    STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2017), 2017, 10508 : 221 - 233
  • [30] These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure
    Zhang, Qingpeng
    Pell, Jason
    Canino-Koning, Rosangela
    Howe, Adina Chuang
    Brown, C. Titus
    PLOS ONE, 2014, 9 (07):