Engineering a compressed suffix tree implementation

被引:0
|
作者
Valimaki, Niko [1 ]
Gerlach, Wolfgang [2 ]
Dixit, Kashyap [3 ]
Makinen, Veli [1 ]
机构
[1] Univ Helsinki, Dept Comp Sci, Teollisuuskatu 23, SF-00510 Helsinki, Finland
[2] Univ Bielefeld, Technische Fakultat, Bielefeld, Germany
[3] Indian Inst Technol, Dept Comp Engn & Sci, Kanpur 208016, Uttar Pradesh, India
来源
基金
芬兰科学院;
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Suffix tree is one of the most important data structures in string algorithms and biological sequence analysis. Unfortunately, when it comes to implementing those algorithms and applying them to real genomic sequences, often the main memory size becomes the bottleneck. This is easily explained by the fact that while a DNA sequence of length n from alphabet Sigma = {A, C, G, T} can be stored in nlog vertical bar Sigma vertical bar = 2n bits, its suffix tree occupies O(n log n) bits. In practice, the size difference easily reaches factor 50. We report on an implementation of the compressed suffix tree very recently proposed by Sadakane (Theory of Computing Systems, in press). The compressed suffix tree occupies space proportional to the text size, i.e. O(n log vertical bar Sigma vertical bar bits, and supports all typical suffix tree operations with at most log n factor slowdown. Our experiments show that, e.g. on a 10 MB DNA sequence, the compressed suffix tree takes 10% of the space of normal suffix tree. At the same time, a representative algorithm is slowed down by factor 30. Our implementation follows the original proposal in spirit, but some internal parts are tailored towards practical implementation. Our construction algorithm has time requirement O(n log n log vertical bar Sigma vertical bar) and uses closely the same space as the final structure while constructing it: on the 10 MB DNA sequence, the maximum space usage during construction is only 1.4 times the final product size.
引用
下载
收藏
页码:217 / +
页数:3
相关论文
共 50 条
  • [11] A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing
    Huo, Hongwei
    Chen, Longgang
    Vitter, Jeffrey Scott
    Nekrich, Yakov
    2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, : 292 - 301
  • [12] A quick tour on suffix arrays and compressed suffix arrays
    Grossi, Roberto
    THEORETICAL COMPUTER SCIENCE, 2011, 412 (27) : 2964 - 2973
  • [13] Smaller Compressed Suffix Arrays
    Benza, Ekaterina
    Klein, Shmuel T.
    Shapira, Dana
    COMPUTER JOURNAL, 2021, 64 (05): : 721 - 730
  • [14] Suffix cactus: A cross between suffix tree and suffix array
    Karkkainen, J
    COMBINATORIAL PATTERN MATCHING, 1995, 937 : 191 - 204
  • [15] Fully Compressed Suffix Trees
    Russo, Luis M. S.
    Navarro, Gonzalo
    Oliveira, Arlindo L.
    ACM TRANSACTIONS ON ALGORITHMS, 2011, 7 (04)
  • [16] Compressed Property Suffix Trees
    Hon, Wing-Kai
    Patil, Manish
    Shah, Rahul
    Thankachan, Sharma V.
    2011 DATA COMPRESSION CONFERENCE (DCC), 2011, : 123 - 132
  • [17] Compressed compact suffix arrays
    Mäkinen, V
    Navarro, G
    COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 420 - 433
  • [18] Compressed property suffix trees
    Hon, Wing-Kai
    Patil, Manish
    Shah, Rahul
    Thankachan, Sharma V.
    INFORMATION AND COMPUTATION, 2013, 232 : 10 - 18
  • [19] Practical Compressed Suffix Trees
    Canovas, Rodrigo
    Navarro, Gonzalo
    EXPERIMENTAL ALGORITHMS, PROCEEDINGS, 2010, 6049 : 94 - 105
  • [20] PFP Compressed Suffix Trees
    Boucher, Christina
    Cvacho, Onclfej
    Gagie, Travis
    Holub, Jan
    Manzini, Giovanni
    Navarro, Gonzalo
    Rossi, Massimiliano
    2021 PROCEEDINGS OF THE SYMPOSIUM ON ALGORITHM ENGINEERING AND EXPERIMENTS, ALENEX, 2021, : 60 - 72