Improving semistatic compression via phrase-based modeling

被引:2
|
作者
Brisaboa, Nieves R. [1 ]
Farina, Antonio [1 ]
Navarro, Gonzalo [2 ]
Parama, Jose R. [1 ]
机构
[1] Univ A Coruna, Database Lab, Fac Informat, La Coruna 15071, Spain
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
Text compression; Direct search; ALGORITHM;
D O I
10.1016/j.ipm.2011.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms. (C) 2011 Elsevier Ltd. All rights reserved.
引用
下载
收藏
页码:545 / 559
页数:15
相关论文
共 50 条
  • [1] Improving semistatic compression via pair-based coding
    Brisaboa, Nieves R.
    Farina, Antonio
    Navarro, Gonzalo
    Parama, Jose R.
    PERSPECTIVES OF SYSTEMS INFORMATICS, 2007, 4378 : 124 - +
  • [2] Leveraging External Knowledge for Phrase-based Topic Modeling
    Xu, Mingyang
    Yang, Ruixin
    Ranshous, Stephen
    Li, Shijie
    Samatova, Nagiza F.
    2017 CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI), 2017, : 29 - 32
  • [3] Phrase-based correction model for improving handwriting recognition accuracies
    Farooq, Faisal
    Jose, Damien
    Govindaraju, Venu
    PATTERN RECOGNITION, 2009, 42 (12) : 3271 - 3277
  • [4] Improving Phrase-Based Statistical Machine Translation with Preprocessing Techniques
    Yashothara, S.
    Uthayasanker, R. T.
    Jayasena, S.
    2018 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2018, : 322 - 327
  • [5] Improving phrase-based statistical machine translation with morphosyntactic transformation
    Thai Phuong Nguyen
    Shimazu, Akira
    MACHINE TRANSLATION, 2006, 20 (03) : 147 - 166
  • [6] PHRASE-BASED RAGA RECOGNITION USING VECTOR SPACE MODELING
    Gulati, Sankalp
    Serra, Joan
    Ishwar, Vignesh
    Senturk, Sertan
    Serra, Xavier
    2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, : 66 - 70
  • [7] Statistical phrase-based translation
    Koehn, P
    Och, FJ
    Marcu, D
    HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 127 - 133
  • [8] Hierarchical phrase-based translation
    Chiang, David
    COMPUTATIONAL LINGUISTICS, 2007, 33 (02) : 201 - 228
  • [9] Improving Phrase-based Korean-English Statistical Machine Translation
    Lee, Jonghoon
    Lee, Donghyeon
    Lee, Gary Geunbae
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 753 - 756
  • [10] Improving phrase-based statistical translation through combination of word alignments
    Chen, Boxing
    Federico, Marcello
    ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4139 : 356 - 367