Pattern matching in LZW compressed files

被引:17
|
作者
Tao, T [1 ]
Mukherjee, A [1 ]
机构
[1] Univ Cent Florida, Sch Comp Sci, Orlando, FL 32826 USA
基金
美国国家科学基金会;
关键词
data compaction and compression; information search and retrieval; pattern matching;
D O I
10.1109/TC.2005.133
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Compressed pattern matching is an emerging research area that addresses the following problem: Given a text file in compressed format and a pattern, report the occurrence(s) of the pattern in the file with minimal ( or no) decompression. In this paper, we report our work on compressed pattern matching in LZW compressed files. The work includes an extension of Amir et al.'s well-known "almost-optimal" algorithm. The original Amir et al.'s algorithm has been improved to search not only the first occurrence of the pattern but also all other occurrences. A faster implementation for so-called "simple patterns" is also proposed. The work also includes a novel multiple-pattern matching algorithm using the Aho-Corasick algorithm. The algorithm takes O(mt+n+r) time with O(mt) extra space, where n is the size of the compressed file, m is the total length of all patterns, t is the size of the LZW trie, and r is the number of occurrences of the patterns. Extensive experiments have been conducted to test the performance of our algorithms and to compare with other well-known compressed pattern matching algorithms, particularly the BWT-based algorithms and another similar multiple-pattern matching algorithm by Kida et al. that also uses the Aho-Corasick algorithm on the LZW compressed data. The results showed that our multiple-pattern matching algorithm is competitive among the best compressed pattern-matching algorithms and is practically the fastest among all approaches when the number of patterns is not very large. Therefore, our algorithm is preferable for general string matching applications. The proposed algorithm is efficient for large files and it is particularly efficient when being applied on archive search if the archives are compressed with a common LZW trie. LZW is one of the most efficient and popular compression algorithms used extensively and our method requires no modification on the compression algorithm. The work reported in this paper, therefore, has great economic and market potential.
引用
收藏
页码:929 / 938
页数:10
相关论文
共 50 条
  • [41] Manipulatable compressed string indexing technology for pattern matching
    Denzumi, Shuhei
    Arimura, Hiroki
    Sadakane, Kunihiko
    [J]. Journal of the Institute of Electronics, Information and Communication Engineers, 2014, 97 (12): : 1080 - 1085
  • [42] Faster Approximate Pattern Matching in Compressed Repetitive Texts
    Gagie, Travis
    Gawrychowski, Pawel
    Puglisi, Simon J.
    [J]. ALGORITHMS AND COMPUTATION, 2011, 7074 : 653 - +
  • [43] A Compression System for Unicode Files Using an Enhanced Lzw Method
    Anto, Rincy Thayyalakkal
    Ramachandran, Rajesh
    [J]. PERTANIKA JOURNAL OF SCIENCE AND TECHNOLOGY, 2020, 28 (04): : 1427 - 1444
  • [44] Modified LZW algorithm for efficient compressed text retrieval
    Zhang, N
    Tao, T
    Satya, RV
    Mukherjee, A
    [J]. ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 2, PROCEEDINGS, 2004, : 224 - 228
  • [45] A fully compressed pattern matching algorithm for simple collage systems
    Inenaga, S
    Shinohara, A
    Takeda, M
    [J]. INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2005, 16 (06) : 1155 - 1166
  • [46] Parallel Pattern Matching over Brotli Compressed Network Traffic
    Sun, Xiuwen
    Zhang, Guangzheng
    Wu, Di
    Yu, Qingying
    Cui, Jie
    Zhong, Hong
    [J]. 2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 477 - 484
  • [47] Fast Pattern Matching in Compressed Text using Wavelet Tree
    Mishra, Surya Prakash
    Prasad, Rajesh
    Singh, Gurmit
    [J]. IETE JOURNAL OF RESEARCH, 2018, 64 (01) : 87 - 99
  • [48] A Boyer-Moore type algorithm for compressed pattern matching
    Shibata, Y
    Matsumoto, T
    Takeda, M
    Shinohara, A
    Arikawa, S
    [J]. COMBINATORIAL PATTERN MATCHING, 2000, 1848 : 181 - 194
  • [49] Pattern Matching on Grammar-Compressed Strings in Linear Time
    Ganardi, Moses
    Gawrychowskit, Pawel
    [J]. PROCEEDINGS OF THE 2022 ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, SODA, 2022, : 2833 - 2846
  • [50] Approximate pattern matching in LZ77-compressed texts
    Gagie, Travis
    Gawrychowski, Pawel
    Puglisi, Simon J.
    [J]. JOURNAL OF DISCRETE ALGORITHMS, 2015, 32 : 64 - 68