Improving semistatic compression via phrase-based modeling

被引:2
|
作者
Brisaboa, Nieves R. [1 ]
Farina, Antonio [1 ]
Navarro, Gonzalo [2 ]
Parama, Jose R. [1 ]
机构
[1] Univ A Coruna, Database Lab, Fac Informat, La Coruna 15071, Spain
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
Text compression; Direct search; ALGORITHM;
D O I
10.1016/j.ipm.2011.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms. (C) 2011 Elsevier Ltd. All rights reserved.
引用
下载
收藏
页码:545 / 559
页数:15
相关论文
共 50 条
  • [21] Increasing Translation Speed in Phrase-based Models via Suboptimal Segmentation
    Sanchis-Trilles, German
    Casacuberta, Francisco
    PATTERN RECOGNITION IN INFORMATION SYSTEMS, PROCEEDINGS, 2008, : 135 - 143
  • [22] Statistical phrase-based speech translation
    Mathias, Lambert
    Byrne, William
    2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 561 - 564
  • [23] Improved techniques for phrase-based translation
    Ruiz Costa-Jussa, Marta
    Fonollosa, Jose A. R.
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 351 - 356
  • [24] Deriving phrase-based language models
    Heeman, PA
    Damnati, G
    1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, 1997, : 41 - 48
  • [25] Phrase-based statistical machine translation
    Zens, R
    Och, FJ
    Ney, H
    KI2002: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2002, 2479 : 18 - 32
  • [26] Syntactically lexicalized phrase-based SMT
    Hassan, Hany
    Sima'an, Khalil
    Way, Andy
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (07): : 1260 - 1273
  • [27] Phrase-based statistical language Modeling from bilingual parallel corpus
    Mao, Jun
    Cheng, Gang
    He, Yanxiang
    COMBINATORICS, ALGORITHMS, PROBABILISTIC AND EXPERIMENTAL METHODOLOGIES, 2007, 4614 : 317 - +
  • [28] Using syntax for improving phrase-based SMT in low-resource languages
    Fadaei, Hakimeh
    Faili, Heshaam
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2020, 35 (03) : 507 - 528
  • [29] An Empirical Study on Improving Hierarchical Phrase-based Translation Using Alignment Features
    Huang, Songfang
    Zhou, Bowen
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2112 - 2115
  • [30] A Reordering Model Based on Shallow Parsing for Improving Phrase-Based Statistical Machine Translation Systems
    Chen, Yidong
    Shi, Xiaodong
    Zhou, Changle
    Hong, Qingyang
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2009, 12 (02): : 297 - 309