Compressed suffix arrays and suffix trees with applications to text indexing and string matching

被引:222
|
作者
Grossi, R
Vitter, JS
机构
[1] Univ Pisa, Dipartimento Informat, I-56127 Pisa, Italy
[2] Purdue Univ, Dept Comp Sci, W Lafayette, IN 47907 USA
[3] Duke Univ, Durham, NC 27706 USA
关键词
compression; text indexing; text retrieval; compressed data structures; suffix arrays; suffix trees; string searching; pattern matching;
D O I
10.1137/S0097539702402354
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Sigma. The text T can be represented in n lg |Sigma| bits by encoding each symbol with lg |Sigma| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require O( n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Omega(n) memory words, each of Omega(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Omega(lg| Sigma| n), which is significant when Sigma is of constant size, such as in ASCII or UNICODE. On the other hand, these indexes support fast searching, either in O(m lg |Sigma|) time or in O(m+ lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/lg(|Sigma|) n + lg(|Sigma|)(epsilon) n) search time in the worst case, for any constant 0 < epsilon <= 1, using at most (epsilon(-1) + O(1)) n lg |Sigma| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed su. x array for a typical 100 MB ASCII file can require 30 - 40 MB or less, while the raw su. x array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg(|Sigma|)(epsilon) n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O( n lg |Sigma|) bits to obtain a total search bound of O(m/lg(|Sigma|) n + occ) time, which is optimal.
引用
收藏
页码:378 / 407
页数:30
相关论文
共 50 条
  • [1] New text indexing functionalities of the compressed suffix arrays
    Sadakane, K
    [J]. JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2003, 48 (02): : 294 - 313
  • [2] Approximate string matching using compressed suffix arrays
    Huynh, TND
    Hon, WK
    Lam, TW
    Sung, WK
    [J]. THEORETICAL COMPUTER SCIENCE, 2006, 352 (1-3) : 240 - 249
  • [3] Approximate string matching using compressed suffix arrays
    Huynh, TND
    Hon, WK
    Lam, TW
    Sung, WK
    [J]. COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 434 - 444
  • [4] A Practical Implementation of Compressed Suffix Arrays with Applications to Self-Indexing
    Huo, Hongwei
    Chen, Longgang
    Vitter, Jeffrey Scott
    Nekrich, Yakov
    [J]. 2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, : 292 - 301
  • [5] Dynamic Dictionary Matching and Compressed Suffix Trees
    Chan, Ho-Leung
    Hon, Wing-Kai
    Lam, Tak-Wah
    Sadakane, Kunihiko
    [J]. PROCEEDINGS OF THE SIXTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2005, : 13 - 22
  • [6] A quick tour on suffix arrays and compressed suffix arrays
    Grossi, Roberto
    [J]. THEORETICAL COMPUTER SCIENCE, 2011, 412 (27) : 2964 - 2973
  • [7] Dotted suffix trees - A structure for approximate text indexing
    Coelho, Luis Pedro
    Oliveira, Arlindo L.
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2006, 4209 : 329 - 336
  • [8] Faster Compressed Suffix Trees for Repetitive Text Collections
    Navarro, Gonzalo
    Ordonez, Alberto
    [J]. EXPERIMENTAL ALGORITHMS, SEA 2014, 2014, 8504 : 424 - 435
  • [9] Computing suffix links for suffix trees and arrays
    Maass, Moritz G.
    [J]. INFORMATION PROCESSING LETTERS, 2007, 101 (06) : 250 - 254
  • [10] Suffix Trays and Suffix Trists: Structures for Faster Text Indexing
    Richard Cole
    Tsvi Kopelowitz
    Moshe Lewenstein
    [J]. Algorithmica, 2015, 72 : 450 - 466