A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing

被引:13
|
作者
Huo, Hongwei [1 ]
Chen, Longgang [1 ]
Vitter, Jeffrey Scott [2 ]
Nekrich, Yakov [2 ]
机构
[1] Xidian Univ, 2 Taibai South Rd, Xian 710071, Shaanxi, Peoples R China
[2] Univ Kansas, 1450 Jayhawk Blvd, Lawrence, KS 66045 USA
基金
中国国家自然科学基金;
关键词
CONSTRUCTION; TREES;
D O I
10.1109/DCC.2014.49
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper we develop a simple and practical text indexing scheme for compressed suffix arrays (CSA). For a text of n characters, our CSA can be constructed in linear time and needs 2nH(k) + n + o(n) bits of space for any k <= clog(sigma) n - 1 and any constant c < 1, where H-k denotes the kth order entropy. We compare the performance of our method with two established compressed indexing methods, the FM-index and the Sad-CSA. Experiments on the Canterbury Corpus and the Pizza&Chili Corpus show significant advantages of our algorithm over two other indexes in terms of compression and query time. Our storage scheme achieves better performance on all types of data present in these two corpora, except for evenly distributed data, such as DNA. The source code for our CSA is available online.
引用
收藏
页码:292 / 301
页数:10
相关论文
共 50 条
  • [1] Compressed suffix arrays and suffix trees with applications to text indexing and string matching
    Grossi, R
    Vitter, JS
    [J]. SIAM JOURNAL ON COMPUTING, 2005, 35 (02) : 378 - 407
  • [2] Practical High-Order Entropy-Compressed Text Self-Indexing
    Huo, Hongwei
    Long, Peng
    Vitter, Jeffrey Scott
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (03) : 2943 - 2960
  • [3] New text indexing functionalities of the compressed suffix arrays
    Sadakane, K
    [J]. JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2003, 48 (02): : 294 - 313
  • [4] Implicit compression boosting with applications to self-indexing
    Makinen, Veli
    Navarro, Gonzalo
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2007, 4726 : 229 - +
  • [5] Self-indexing Natural Language
    Brisaboa, Nieves R.
    Farina, Antonio
    Navarro, Gonzalo
    Places, Angeles S.
    Rodriguez, Eduardo
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2008, 5280 : 121 - +
  • [6] A quick tour on suffix arrays and compressed suffix arrays
    Grossi, Roberto
    [J]. THEORETICAL COMPUTER SCIENCE, 2011, 412 (27) : 2964 - 2973
  • [7] Self-Indexing RDF Archives
    Cerdeira-Pena, Ana
    Farina, Antonio
    Fernandez, Javier D.
    Martinez-Prieto, Miguel A.
    [J]. 2016 DATA COMPRESSION CONFERENCE (DCC), 2016, : 526 - 535
  • [8] Smaller Compressed Suffix Arrays
    Benza, Ekaterina
    Klein, Shmuel T.
    Shapira, Dana
    [J]. COMPUTER JOURNAL, 2021, 64 (05): : 721 - 730
  • [9] Compressed compact suffix arrays
    Mäkinen, V
    Navarro, G
    [J]. COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 420 - 433
  • [10] ANALYSIS OF SELF-INDEXING, DISK FILES
    WATERS, SJ
    [J]. COMPUTER JOURNAL, 1975, 18 (03): : 200 - 205