Efficient Neural Compression with Inference-time Decoding

被引:0
|
作者
Metz, Clement [1 ]
Bichler, Olivier [2 ]
Dupret, Antoine [3 ]
机构
[1] Univ Paris Saclay, CEA List, Palaiseau, France
[2] CEA List, Palaiseau, France
[3] CEA Leti, Palaiseau, France
关键词
neural network; quantization; entropy coding; ANS;
D O I
10.1109/ISCAS58744.2024.10558050
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper explores the combination of neural network quantization and entropy coding for memory footprint minimization. Edge deployment of quantized models is hampered by the harsh Pareto frontier of the accuracy-to-bitwidth tradeoff, causing dramatic accuracy loss below a certain bitwidth. This accuracy loss can be alleviated thanks to mixed precision quantization, allowing for more flexible bitwidth allocation. However, standard mixed precision benefits remain limited due to the 1-bit frontier, that forces each parameter to be encoded on at least 1 bit of data. This paper introduces an approach that combines mixed precision, zero-point quantization and entropy coding to push the compression boundary of Resnets beyond the 1-bit frontier with an accuracy drop below 1% on the ImageNet benchmark. From an implementation standpoint, a compact decoder architecture features reduced latency, thus allowing for inference-compatible decoding.
引用
收藏
页数:5
相关论文
共 50 条
  • [1] B-LNN: Inference-time linear model for secure neural network inference
    Wang, Qizheng
    Ma, Wenping
    Wang, Weiwei
    INFORMATION SCIENCES, 2023, 638
  • [2] Quantification of Predictive Uncertainty via Inference-Time Sampling
    Tothova, Katarina
    Ladicky, Lubor
    Thul, Daniel
    Pollefeys, Marc
    Konukoglu, Ender
    UNCERTAINTY FOR SAFE UTILIZATION OF MACHINE LEARNING IN MEDICAL IMAGING, 2022, 13563 : 14 - 25
  • [3] Efficient Neural Image Decoding via Fixed-Point Inference
    Hong, Weixin
    Chen, Tong
    Lu, Ming
    Pu, Shiliang
    Ma, Zhan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (09) : 3618 - 3630
  • [4] On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
    Ji, Tianchu
    Jain, Shraddhan
    Ferdman, Michael
    Milder, Peter
    Schwartz, H. Andrew
    Balasubramanian, Niranjan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4147 - 4157
  • [5] GANs Spatial Control via Inference-Time Adaptive Normalization
    Jakoel, Karin
    Efraim, Liron
    Shaham, Tamar Rott
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 31 - 40
  • [6] LEARNING INFERENCE-TIME DRIFT SENSOR-ACTUATOR FOR DOMAIN GENERALIZATION
    Chen, Shuoshuo
    Tang, Yushun
    Kan, Zhehan
    He, Zhihai
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 5090 - 5094
  • [7] Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval
    Notin, Pascal
    Dias, Mafalda
    Frazer, Jonathan
    Marchena-Hurtado, Javier
    Gomez, Aidan
    Marks, Debora S.
    Gal, Yarin
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [8] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
    Li, Kenneth
    Patel, Oam
    Viegas, Fernanda
    Pfister, Hanspeter
    Wattenberg, Martin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [9] SNICIT: Accelerating Sparse Neural Network Inference via Compression at Inference Time on GPU
    Jiang, Shui
    Huang, Tsung-Wei
    Yu, Bei
    Ho, Tsung-Yi
    PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 51 - 61
  • [10] Neural Image Compression with Regional Decoding
    Jin, Yili
    Li, Jiahao
    Li, Bin
    Lu, Yan
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)