Improving semistatic compression via phrase-based modeling

被引:2
|
作者
Brisaboa, Nieves R. [1 ]
Farina, Antonio [1 ]
Navarro, Gonzalo [2 ]
Parama, Jose R. [1 ]
机构
[1] Univ A Coruna, Database Lab, Fac Informat, La Coruna 15071, Spain
[2] Univ Chile, Dept Comp Sci, Santiago, Chile
关键词
Text compression; Direct search; ALGORITHM;
D O I
10.1016/j.ipm.2011.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, new semistatic word-based byte-oriented text compressors, such as Tagged Huffman and those based on Dense Codes, have shown that it is possible to perform fast direct search over compressed text and decompression of arbitrary text passages over collections reduced to around 30-35% of their original size. Much of their success is due to the use of words as source symbols and a byte-oriented target alphabet. This approach broke with traditional statistical compressors, which use characters as source symbols and a bit-oriented target alphabet. In this work we go one step beyond by using phrases as source symbols. We present two new semistatic modelers that we combined with a dense coding scheme to obtain two new compressors: Pair-Based End-Tagged Dense Code (PETDC), where source symbols can be either words or pairs of words, and Phrase-Based End-Tagged Dense Code (PhETDC), which considers words and sequences of words (phrases). PETDC compresses English texts to 28-29% and PhETDC to around 23%, outperforming the optimal byte-oriented zero-order prefix-free word-based semistatic compressor by up to 8 percentage points. Moreover, PETDC and PhETDC still permit random access and efficient direct searches using fast Boyer-Moore algorithms. (C) 2011 Elsevier Ltd. All rights reserved.
引用
下载
收藏
页码:545 / 559
页数:15
相关论文
共 50 条
  • [31] Document Classification Efficiency of Phrase-Based Techniques
    Kapalavayi, Nagesh
    Murthy, S. N. Jayaram
    Hu, Gongzhu
    2009 IEEE/ACS INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, VOLS 1 AND 2, 2009, : 174 - 178
  • [32] Improvements in phrase-based statistical machine translation
    Zens, R
    Ney, H
    HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 257 - 264
  • [33] Phrase-based pattern matching in compressed text
    Culpepper, J. Shane
    Moffat, Alistair
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2006, 4209 : 337 - 345
  • [34] A reordering model for phrase-based machine translation
    Nguyen, Vinh Van
    Nguyen, Thai Phuong
    Shimazu, Akira
    Nguyen, Minh Le
    ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2008, 5221 : 476 - +
  • [35] Phrase-Based & Neural Unsupervised Machine Translation
    Lample, Guillaume
    Ott, Myle
    Conneau, Alexis
    Denoyer, Ludovic
    Ranzato, Marc'Aurelio
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 5039 - 5049
  • [36] Phrase-based hashtag recommendation for microblog posts
    Yeyun GONG
    Qi ZHANG
    Xiaoying HAN
    Xuanjing HUANG
    Science China(Information Sciences), 2017, 60 (01) : 132 - 144
  • [37] Efficient phrase-based document similarity for clustering
    Chim, Hung
    Deng, Xiaotie
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2008, 20 (09) : 1217 - 1229
  • [38] Browsing in digital libraries: A phrase-based approach
    NevillManning, CG
    Witten, IH
    Paynter, GW
    ACM DIGITAL LIBRARIES '97, 1997, : 230 - 236
  • [39] FACTORED PHRASE-BASED STATISTICAL MACHINE TRANSLATION
    Tufis, Dan
    Ceausu, Alexandru
    FROM SPEECH PROCESSING TO SPOKEN LANGUAGE TECHNOLOGY, 2009, : 115 - 124
  • [40] Introducing a translation dictionary into phrase-based SMT
    Okuma, Hideo
    Yamamoto, Hirofumi
    Sumita, Eiichiro
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (07): : 2051 - 2057