Constructing compressed suffix arrays with large alphabets

被引:0
|
作者
Hon, WK [1 ]
Lam, TW
Sadakane, K
Sung, WK
机构
[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
[2] Kyushu Univ, Dept Comp Sci & Commun Engn, Fukuoka 812, Japan
[3] Natl Univ Singapore, Sch Comp, Singapore 117548, Singapore
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet E, this algorithm requires O(\Sigma\n log n) time and (2H(0) + 1 + epsilon)n bits of working space, where Ho is the 0-th order empirical entropy of T and E is any non-zero constant. This algorithm is good enough when the alphabet size El is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + epsilon)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H-0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.
引用
收藏
页码:240 / 249
页数:10
相关论文
共 50 条
  • [21] New text indexing functionalities of the compressed suffix arrays
    Sadakane, K
    [J]. JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2003, 48 (02): : 294 - 313
  • [22] Approximate string matching using compressed suffix arrays
    Huynh, TND
    Hon, WK
    Lam, TW
    Sung, WK
    [J]. THEORETICAL COMPUTER SCIENCE, 2006, 352 (1-3) : 240 - 249
  • [23] Improved and extended locating functionality on compressed suffix arrays
    Gog, Simon
    Navarro, Gonzalo
    Petri, Matthias
    [J]. JOURNAL OF DISCRETE ALGORITHMS, 2015, 32 : 53 - 63
  • [24] Approximate string matching using compressed suffix arrays
    Huynh, TND
    Hon, WK
    Lam, TW
    Sung, WK
    [J]. COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 434 - 444
  • [25] gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections
    Louza, Felipe A.
    Telles, Guilherme P.
    Gog, Simon
    Prezza, Nicola
    Rosone, Giovanna
    [J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2020, 15 (01)
  • [26] Improved and Extended Locating Functionality on Compressed Suffix Arrays
    Gog, Simon
    Navarro, Gonzalo
    [J]. EXPERIMENTAL ALGORITHMS, SEA 2014, 2014, 8504 : 436 - 447
  • [27] Tree Contraction for Compressed Suffix Arrays on Modern Processors
    Yamamuro, Takeshi
    Onizuka, Makoto
    Honjo, Toshimori
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, PT II, 2015, 9050 : 363 - 378
  • [28] Constructing suffix arrays for multi-dimensional matrices
    Kim, DK
    Kim, YA
    Park, K
    [J]. COMBINATORIAL PATTERN MATCHING, 1998, 1448 : 126 - 139
  • [29] Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space
    Kempa, Dominik
    Kociumaka, Tomasz
    [J]. 2023 IEEE 64TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, FOCS, 2023, : 1877 - 1886
  • [30] An efficient index data structure with the capabilities of suffix fees and suffix arrays for alphabets of non-negligible size
    Kim, DK
    Jeon, JE
    Park, H
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 3246 : 138 - 149