Constructing compressed suffix arrays with large alphabets

被引:0
|
作者
Hon, WK [1 ]
Lam, TW
Sadakane, K
Sung, WK
机构
[1] Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China
[2] Kyushu Univ, Dept Comp Sci & Commun Engn, Fukuoka 812, Japan
[3] Natl Univ Singapore, Sch Comp, Singapore 117548, Singapore
来源
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet E, this algorithm requires O(\Sigma\n log n) time and (2H(0) + 1 + epsilon)n bits of working space, where Ho is the 0-th order empirical entropy of T and E is any non-zero constant. This algorithm is good enough when the alphabet size El is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + epsilon)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that H-0 > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time.
引用
收藏
页码:240 / 249
页数:10
相关论文
共 50 条
  • [1] A fast algorithm for constructing suffix arrays for DNA alphabets
    Rabea, Zeinab
    El-Metwally, Sara
    Elmougy, Samir
    Zakaria, Magdi
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (07) : 4659 - 4668
  • [2] A fast algorithm for constructing suffix arrays for fixed-size alphabets
    Kim, DK
    Jo, J
    Park, H
    [J]. EXPERIMENTAL AND EFFICIENT ALGORITHMS, 2004, 3059 : 301 - 314
  • [3] A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays
    Wing-Kai Hon
    Tak-Wah Lam
    Kunihiko Sadakane
    Wing-Kin Sung
    Siu-Ming Yiu
    [J]. Algorithmica, 2007, 48 : 23 - 36
  • [4] A space and time efficient algorithm for constructing compressed suffix arrays
    Hon, Wing-Kai
    Lam, Tak-Wah
    Sadakane, Kunihiko
    Sung, Wing-Kin
    Yiu, Siu-Ming
    [J]. ALGORITHMICA, 2007, 48 (01) : 23 - 36
  • [5] A quick tour on suffix arrays and compressed suffix arrays
    Grossi, Roberto
    [J]. THEORETICAL COMPUTER SCIENCE, 2011, 412 (27) : 2964 - 2973
  • [6] Smaller Compressed Suffix Arrays
    Benza, Ekaterina
    Klein, Shmuel T.
    Shapira, Dana
    [J]. COMPUTER JOURNAL, 2021, 64 (05): : 721 - 730
  • [7] Compressed compact suffix arrays
    Mäkinen, V
    Navarro, G
    [J]. COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2004, 3109 : 420 - 433
  • [8] Compressed Spaced Suffix Arrays
    Gagie T.
    Manzini G.
    Valenzuela D.
    [J]. Mathematics in Computer Science, 2017, 11 (2) : 151 - 157
  • [9] Linear-time construction of compressed suffix arrays using o(n log n)-bit working space for large alphabets
    Na, JC
    [J]. COMBINATORIAL PATTERN MATCHING, PROCEEDINGS, 2005, 3537 : 57 - 67
  • [10] Optimal lightweight construction of suffix Arrays for constant alphabets
    Nong, Ge
    Zhang, Sen
    [J]. ALGORITHMS AND DATA STRUCTURES, PROCEEDINGS, 2007, 4619 : 613 - +