Constructing compressed suffix arrays with large alphabets

被引：0

作者：

Hon, WK ^{[1
]}

Lam, TW

Sadakane, K

Sung, WK

机构：

[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China

[2] Kyushu Univ, Dept Comp Sci & Commun Engn, Fukuoka 812, Japan

[3] Natl Univ Singapore, Sch Comp, Singapore 117548, Singapore

来源：

ALGORITHMS AND COMPUTATION, PROCEEDINGS | 2003年 / 2906卷

关键词：

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet E, this algorithm requires O(\Sigma\n log n) time and (2H(0) + 1 + epsilon)n bits of working space, where Ho is the 0-th order empirical entropy of T and E is any non-zero constant. This algorithm is good enough when the alphabet size El is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + epsilon)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H-0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.

引用

页码：240 / 249

页数：10

共 50 条

[21] New text indexing functionalities of the compressed suffix arrays
Sadakane, K
[J]. JOURNAL OF ALGORITHMS-COGNITION INFORMATICS AND LOGIC, 2003, 48 (02): : 294 - 313
[22] Approximate string matching using compressed suffix arrays
Huynh, TND
Hon, WK
Lam, TW
Sung, WK
[J]. THEORETICAL COMPUTER SCIENCE, 2006, 352 (1-3) : 240 - 249
[23] Improved and extended locating functionality on compressed suffix arrays
Gog, Simon
Navarro, Gonzalo
Petri, Matthias
[J]. JOURNAL OF DISCRETE ALGORITHMS, 2015, 32 : 53 - 63
[24] Approximate string matching using compressed suffix arrays
Huynh, TND
Hon, WK
Lam, TW
Sung, WK
[J]. COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 434 - 444
[25] gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections
Louza, Felipe A.
Telles, Guilherme P.
Gog, Simon
Prezza, Nicola
Rosone, Giovanna
[J]. ALGORITHMS FOR MOLECULAR BIOLOGY, 2020, 15 (01)
[26] Improved and Extended Locating Functionality on Compressed Suffix Arrays
Gog, Simon
Navarro, Gonzalo
[J]. EXPERIMENTAL ALGORITHMS, SEA 2014, 2014, 8504 : 436 - 447
[27] Tree Contraction for Compressed Suffix Arrays on Modern Processors
Yamamuro, Takeshi
Onizuka, Makoto
Honjo, Toshimori
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, PT II, 2015, 9050 : 363 - 378
[28] Constructing suffix arrays for multi-dimensional matrices
Kim, DK
Kim, YA
Park, K
[J]. COMBINATORIAL PATTERN MATCHING, 1998, 1448 : 126 - 139
[29] Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space
Kempa, Dominik
Kociumaka, Tomasz
[J]. 2023 IEEE 64TH ANNUAL SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE, FOCS, 2023, : 1877 - 1886
[30] An efficient index data structure with the capabilities of suffix fees and suffix arrays for alphabets of non-negligible size
Kim, DK
Jeon, JE
Park, H
[J]. STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2004, 3246 : 138 - 149

← 1 2 3 4 5 →