Scalable Detection of Frequent Substrings by Grammar-Based Compression

被引:3
|
作者
Nakahara, Masaya [1 ]
Maruyama, Shirou [2 ]
Kuboyama, Tetsuji [3 ]
Sakamoto, Hiroshi [1 ,4 ]
机构
[1] Kyushu Inst Technol, Iizuka, Fukuoka 8208502, Japan
[2] Kyushu Univ, Fukuoka 8190395, Japan
[3] Gakushuin Univ, Tokyo 1718588, Japan
[4] JST PRESTO, Kawaguchi, Saitama 3320012, Japan
来源
关键词
pattern discovery; grammar-based compression; edit-sensitive parsing; APPROXIMATION ALGORITHM;
D O I
10.1587/transinf.E96.D.457
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string deterministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
引用
收藏
页码:457 / 464
页数:8
相关论文
共 50 条
  • [1] Scalable Detection of Frequent Substrings by Grammar-Based Compression
    Nakahara, Masaya
    Maruyama, Shirou
    Kuboyama, Tetsuji
    Sakamoto, Hiroshi
    [J]. DISCOVERY SCIENCE, 2011, 6926 : 236 - +
  • [2] Grammar-Based Tree Compression
    Lohrey, Markus
    [J]. DEVELOPMENTS IN LANGUAGE THEORY (DLT 2015), 2015, 9168 : 46 - 57
  • [3] Grammar-based graph compression
    Maneth, Sebastian
    Peternek, Fabian
    [J]. INFORMATION SYSTEMS, 2018, 76 : 19 - 45
  • [4] A Simple Grammar-Based Index for Finding Approximately Longest Common Substrings
    Gagie, Travis
    Kashgouli, Sana
    Navarro, Gonzalo
    [J]. STRING PROCESSING AND INFORMATION RETRIEVAL, SPIRE 2023, 2023, 14240 : 246 - 252
  • [5] Grammar-Based Compression of Unranked Trees
    Gascon, Adria
    Lohrey, Markus
    Maneth, Sebastian
    Reh, Carl Philipp
    Siebert, Kurt
    [J]. COMPUTER SCIENCE - THEORY AND APPLICATIONS, CSR 2018, 2018, 10846 : 118 - 131
  • [6] Grammar-Based Compression of Unranked Trees
    Gascon, Adria
    Lohrey, Markus
    Maneth, Sebastian
    Reh, Carl Philipp
    Sieber, Kurt
    [J]. THEORY OF COMPUTING SYSTEMS, 2020, 64 (01) : 141 - 176
  • [7] Grammar-Based Compression of Unranked Trees
    Adrià Gascón
    Markus Lohrey
    Sebastian Maneth
    Carl Philipp Reh
    Kurt Sieber
    [J]. Theory of Computing Systems, 2020, 64 : 141 - 176
  • [8] Grammar-based compression of interpreted code
    Evans, WS
    Fraser, CW
    [J]. COMMUNICATIONS OF THE ACM, 2003, 46 (08) : 61 - 66
  • [9] On the complexity of optimal grammar-based compression
    Arpe, Jan
    Reischuk, R. diger
    [J]. DCC 2006: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2006, : 173 - +
  • [10] Grammar-Based Compression in a Streaming Model
    Gagie, Travis
    Gawrychowski, Pawel
    [J]. LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, 2010, 6031 : 273 - +