Compressing Multisets with Large Alphabets

被引:0
|
作者
Severo, Daniel [1 ,2 ,3 ]
Townsend, James [4 ]
Khisti, Ashish [2 ]
Makhzani, Alireza [2 ,3 ]
Ullrich, Karen [1 ]
机构
[1] Meta AI, New York, NY USA
[2] Univ Toronto, Toronto, ON, Canada
[3] Vector Inst AI, Toronto, ON, Canada
[4] UCL, London, England
关键词
D O I
10.1109/DCC52660.2022.00040
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Current methods which compress multisets at an optimal rate have computational complexity that scales linearly with alphabet size, making them too slow to be practical in many real-world settings. We show how to convert a compression algorithm for sequences into one for multisets, in exchange for an additional complexity term that is quasi-linear in sequence length. This allows us to compress multisets of independent and identically distributed symbols at an optimal rate, with computational complexity decoupled from the alphabet size. The key insight is to avoid encoding the multiset directly, and instead compress a proxy sequence, using a technique called `bits-back coding'. We demonstrate the method experimentally on two tasks which are intractible with previous optimal-rate methods: compression of multisets of images and JavaScript Object Notation (JSON) files. Code for our experiments is available at https://github.com/facebookresearch/multiset-compression.
引用
收藏
页码:322 / 331
页数:10
相关论文
共 50 条
  • [1] Compressing Multisets With Large Alphabets
    Severo, Daniel
    Townsend, James
    Khisti, Ashish
    Makhzani, Alireza
    Ullrich, Karen
    IEEE Journal on Selected Areas in Information Theory, 2022, 3 (04): : 605 - 615
  • [2] Compressing Multisets with Large Alphabets using Bits-Back Coding
    Severo, Daniel
    Townsend, James
    Khisti, Ashish
    Makhzani, Alireza
    Ullrich, Karen
    arXiv, 2021,
  • [3] Compressing Huffman Models on Large Alphabets
    Navarro, Gonzalo
    Ordonez, Alberto
    2013 DATA COMPRESSION CONFERENCE (DCC), 2013, : 381 - 390
  • [4] Compressing Sets and Multisets of Sequences
    Steinruecken, Christian
    2014 DATA COMPRESSION CONFERENCE (DCC 2014), 2014, : 427 - 427
  • [5] Compressing Sets and Multisets of Sequences
    Steinruecken, Christian
    IEEE TRANSACTIONS ON INFORMATION THEORY, 2015, 61 (03) : 1485 - 1490
  • [6] Compressing multisets using tries
    Gripon, Vincent
    Rabbat, Michael
    Skachek, Vitaly
    Gross, Warren J.
    2012 IEEE INFORMATION THEORY WORKSHOP (ITW), 2012, : 642 - 646
  • [7] Large alphabets and incompressibility
    Gagie, Travis
    INFORMATION PROCESSING LETTERS, 2006, 99 (06) : 246 - 251
  • [8] Minimax Redundancy for Large Alphabets
    Szpankowski, Wojciech
    Weinberger, Marcelo J.
    2010 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2010, : 1488 - 1492
  • [9] On the repetition threshold for large alphabets
    Carpi, Arturo
    MATHEMATICAL FOUNDATIONS OF COMPUTER SCIENCE 2006, PROCEEDINGS, 2006, 4162 : 226 - 237
  • [10] Relative redundancy for large alphabets
    Orlitsky, Alon
    Santhanam, Narayana
    Zhang, Junan
    2006 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, VOLS 1-6, PROCEEDINGS, 2006, : 2672 - +