Efficient Neural Compression with Inference-time Decoding

被引：0

作者：

Metz, Clement ^{[1
]}

Bichler, Olivier ^{[2
]}

Dupret, Antoine ^{[3
]}

机构：

[1] Univ Paris Saclay, CEA List, Palaiseau, France

[2] CEA List, Palaiseau, France

[3] CEA Leti, Palaiseau, France

来源：

2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024 | 2024年

关键词：

neural network; quantization; entropy coding; ANS;

D O I：

10.1109/ISCAS58744.2024.10558050

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper explores the combination of neural network quantization and entropy coding for memory footprint minimization. Edge deployment of quantized models is hampered by the harsh Pareto frontier of the accuracy-to-bitwidth tradeoff, causing dramatic accuracy loss below a certain bitwidth. This accuracy loss can be alleviated thanks to mixed precision quantization, allowing for more flexible bitwidth allocation. However, standard mixed precision benefits remain limited due to the 1-bit frontier, that forces each parameter to be encoded on at least 1 bit of data. This paper introduces an approach that combines mixed precision, zero-point quantization and entropy coding to push the compression boundary of Resnets beyond the 1-bit frontier with an accuracy drop below 1% on the ImageNet benchmark. From an implementation standpoint, a compact decoder architecture features reduced latency, thus allowing for inference-compatible decoding.

引用

页数：5

共 50 条

[1] B-LNN: Inference-time linear model for secure neural network inference
Wang, Qizheng
Ma, Wenping
Wang, Weiwei
INFORMATION SCIENCES, 2023, 638
[2] Quantification of Predictive Uncertainty via Inference-Time Sampling
Tothova, Katarina
Ladicky, Lubor
Thul, Daniel
Pollefeys, Marc
Konukoglu, Ender
UNCERTAINTY FOR SAFE UTILIZATION OF MACHINE LEARNING IN MEDICAL IMAGING, 2022, 13563 : 14 - 25
[3] Efficient Neural Image Decoding via Fixed-Point Inference
Hong, Weixin
Chen, Tong
Lu, Ming
Pu, Shiliang
Ma, Zhan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (09) : 3618 - 3630
[4] On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
Ji, Tianchu
Jain, Shraddhan
Ferdman, Michael
Milder, Peter
Schwartz, H. Andrew
Balasubramanian, Niranjan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4147 - 4157
[5] GANs Spatial Control via Inference-Time Adaptive Normalization
Jakoel, Karin
Efraim, Liron
Shaham, Tamar Rott
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 31 - 40
[6] LEARNING INFERENCE-TIME DRIFT SENSOR-ACTUATOR FOR DOMAIN GENERALIZATION
Chen, Shuoshuo
Tang, Yushun
Kan, Zhehan
He, Zhihai
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 5090 - 5094
[7] Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval
Notin, Pascal
Dias, Mafalda
Frazer, Jonathan
Marchena-Hurtado, Javier
Gomez, Aidan
Marks, Debora S.
Gal, Yarin
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[8] Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Li, Kenneth
Patel, Oam
Viegas, Fernanda
Pfister, Hanspeter
Wattenberg, Martin
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[9] SNICIT: Accelerating Sparse Neural Network Inference via Compression at Inference Time on GPU
Jiang, Shui
Huang, Tsung-Wei
Yu, Bei
Ho, Tsung-Yi
PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 51 - 61
[10] Neural Image Compression with Regional Decoding
Jin, Yili
Li, Jiahao
Li, Bin
Lu, Yan
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)

← 1 2 3 4 5 →