Pattern matching in LZW compressed files

被引:17
|
作者
Tao, T [1 ]
Mukherjee, A [1 ]
机构
[1] Univ Cent Florida, Sch Comp Sci, Orlando, FL 32826 USA
基金
美国国家科学基金会;
关键词
data compaction and compression; information search and retrieval; pattern matching;
D O I
10.1109/TC.2005.133
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Compressed pattern matching is an emerging research area that addresses the following problem: Given a text file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal ( or no) decompression. In this paper, we report our work on compressed pattern matching in LZW compressed files. The work includes an extension of Amir et al.'s well-known "almost-optimal" algorithm. The original Amir et al.'s algorithm has been improved to search not only the first occurrence of the pattern but also all other occurrences. A faster implementation for so-called "simple patterns" is also proposed. The work also includes a novel multiple-pattern matching algorithm using the Aho-Corasick algorithm. The algorithm takes O(mt+n+r) time with O(mt) extra space, where n is the size of the compressed file, m is the total length of all patterns, t is the size of the LZW trie, and r is the number of occurrences of the patterns. Extensive experiments have been conducted to test the performance of our algorithms and to compare with other well-known compressed pattern matching algorithms, particularly the BWT-based algorithms and another similar multiple-pattern matching algorithm by Kida et al. that also uses the Aho-Corasick algorithm on the LZW compressed data. The results showed that our multiple-pattern matching algorithm is competitive among the best compressed pattern-matching algorithms and is practically the fastest among all approaches when the number of patterns is not very large. Therefore, our algorithm is preferable for general string matching applications. The proposed algorithm is efficient for large files and it is particularly efficient when being applied on archive search if the archives are compressed with a common LZW trie. LZW is one of the most efficient and popular compression algorithms used extensively and our method requires no modification on the compression algorithm. The work reported in this paper, therefore, has great economic and market potential.
引用
收藏
页码:929 / 938
页数:10
相关论文
共 50 条
  • [1] Multiple-pattern matching for LZW compressed files
    Tao, T
    Mukherjee, A
    [J]. ITCC 2005: International Conference on Information Technology: Coding and Computing, Vol 1, 2005, : 91 - 96
  • [2] LZW based compressed pattern matching
    Tao, T
    Mukherjee, A
    [J]. DCC 2004: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2004, : 568 - 568
  • [3] Optimal pattern matching in LZW compressed strings
    Gawrychowski, Pawel
    [J]. PROCEEDINGS OF THE TWENTY-SECOND ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2011, : 362 - 372
  • [4] Optimal Pattern Matching in LZW Compressed Strings
    Gawrychowski, Pawel
    [J]. ACM TRANSACTIONS ON ALGORITHMS, 2013, 9 (03)
  • [5] Multiple pattern matching in LZW compressed text
    Kida, T
    Takeda, M
    Shinohara, A
    Miyazaki, M
    Arikawa, S
    [J]. DCC '98 - DATA COMPRESSION CONFERENCE, 1998, : 103 - 112
  • [6] Multiple-pattern matching in LZW compressed files using Aho-Corasick algorithm
    Tao, T
    Mukherjee, A
    [J]. DCC 2005: Data Compression Conference, Proceedings, 2005, : 482 - 482
  • [7] An efficient pattern matching scheme in LZW compressed sequences
    Lee, Tsern-Huei
    Huang, Nai-Lun
    [J]. SECURITY AND COMMUNICATION NETWORKS, 2008, 1 (04) : 325 - 335
  • [8] Almost optimal fully LZW-compressed pattern matching
    Gasieniec, L
    Rytter, W
    [J]. DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, : 316 - 325
  • [9] Simple and efficient LZW-compressed multiple pattern matching
    Gawrychowski, Pawel
    [J]. JOURNAL OF DISCRETE ALGORITHMS, 2014, 25 : 34 - 41
  • [10] Empirical evaluation of LZW-Compressed Multiple Pattern Matching Algorithms
    Reja, Mario
    [J]. 2022 24TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, SYNASC, 2022, : 125 - 132