Sample Selection for Dictionary-Based Corpus Compression

被引:0
|
作者
Hoobin, Christopher [1 ]
Puglisi, Simon [1 ]
Zobel, Justin [2 ]
机构
[1] RMIT Univ, Sch Comp Sci & Informat Technol, Melbourne, Vic, Australia
[2] Univ Melbourne, Dept Comp Sci & Software Engn, Melbourne, Vic, Australia
关键词
Dictionary Compression; Random Access; Document Retrieval; Sampling;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Compression of large text corpora has the potential to drastically reduce both storage requirements and per-document access costs. Adaptive methods used for general-purpose compression are ineffective for this application, and historically the most successful methods have been based on word-based dictionaries, which allow use of global properties of the text. However, these are dependent on the text complying with assumptions about content and lead to dictionaries of unpredictable size. In recent work we have described an LZ-like approach in which sampled blocks of a corpus are used as a dictionary against which the complete corpus is compressed, giving compression twice as effective than that of zlib. Here we explore how pre-processing can be used to eliminate redundancy in our sampled dictionary. Our experiments show that dictionary size can be reduced by 50% or more (less than 0.1% of the collection size) with no significant effect on compression or access speed.
引用
收藏
页码:1137 / 1138
页数:2
相关论文
共 50 条
  • [1] Offline dictionary-based compression
    Larsson, NJ
    Moffat, A
    [J]. DCC '99 - DATA COMPRESSION CONFERENCE, PROCEEDINGS, 1999, : 296 - 305
  • [2] Programmability in dictionary-based compression
    Heikkinen, Jari
    Takala, Janno
    [J]. 2006 INTERNATIONAL SYMPOSIUM ON SYSTEM-ON-CHIP PROCEEDINGS, 2006, : 171 - +
  • [3] Revisiting dictionary-based compression
    Skibinski, P
    Grabowski, S
    Deorowicz, S
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2005, 35 (15): : 1455 - 1476
  • [4] SE-Compression: A Generalization of Dictionary-Based Compression
    Popa, Ionut
    [J]. COMPUTER JOURNAL, 2011, 54 (11): : 1876 - 1881
  • [5] Dictionary-based fast transform for text compression
    Sun, WF
    Zhang, N
    Mukherjee, A
    [J]. ITCC 2003: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: COMPUTERS AND COMMUNICATIONS, PROCEEDINGS, 2003, : 176 - 182
  • [6] Off-line dictionary-based compression
    Larsson, NJ
    Moffat, A
    [J]. PROCEEDINGS OF THE IEEE, 2000, 88 (11) : 1722 - 1732
  • [7] Lossy dictionary-based image compression method
    Dudek, Gabriela
    Borys, Przemyslaw
    Grzywna, Zbigniew J.
    [J]. IMAGE AND VISION COMPUTING, 2007, 25 (06) : 883 - 889
  • [8] Fast Dictionary-Based Compression for Inverted Indexes
    Pibiri, Giulio Ermanno
    Petri, Matthias
    Moffat, Alistair
    [J]. PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 6 - 14
  • [9] Dictionary-based order-preserving string compression
    Antoshenkov G.
    [J]. The VLDB Journal, 1997, 6 (1) : 26 - 39
  • [10] Dictionary-based program compression on customizable processor architectures
    Heikkinen, Jari
    Takala, Jarmo
    Corporaal, Henk
    [J]. MICROPROCESSORS AND MICROSYSTEMS, 2009, 33 (02) : 139 - 153