Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

被引:20
|
作者
Hawks, Benjamin [1 ]
Duarte, Javier [2 ]
Fraser, Nicholas J. [3 ]
Pappalardo, Alessandro [3 ]
Nhan Tran [1 ,4 ]
Umuroglu, Yaman [3 ]
机构
[1] Fermilab Natl Accelerator Lab, POB 500, Batavia, IL 60510 USA
[2] Univ Calif San Diego, La Jolla, CA 92093 USA
[3] Xilinx Res, Dublin, Ireland
[4] Northwestern Univ, Evanston, IL USA
来源
基金
美国能源部;
关键词
pruning; quantization; neural networks; generalizability; regularization; batch normalization; MODEL COMPRESSION; ACCELERATION;
D O I
10.3389/frai.2021.676564
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantizationaware pruning, and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Quantization-Aware Training With Dynamic and Static Pruning
    An, Sangho
    Shin, Jongyun
    Kim, Jangho
    IEEE ACCESS, 2025, 13 : 57476 - 57484
  • [2] Quantization-Aware Pruning Criterion for Industrial Applications
    Gil, Yoonhee
    Park, Jong-Hyeok
    Baek, Jongchan
    Han, Soohee
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2022, 69 (03) : 3203 - 3213
  • [3] Quantization-aware training for low precision photonic neural networks
    Kirtas, M.
    Oikonomou, A.
    Passalis, N.
    Mourgias-Alexandris, G.
    Moralis-Pegios, M.
    Pleros, N.
    Tefas, A.
    NEURAL NETWORKS, 2022, 155 : 561 - 573
  • [4] Quantization-aware Optimization Approach for CNNs Inference on CPUs
    Chen, Jiasong
    Xie, Zeming
    Liang, Weipeng
    Liu, Bosheng
    Zheng, Xin
    Wu, Jigang
    Xiong, Xiaoming
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 878 - 883
  • [5] Low Precision Quantization-aware Training in Spiking Neural Networks with Differentiable Quantization Function
    Shymyrbay, Ayan
    Fouda, Mohammed E.
    Eltawil, Ahmed
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [6] QuantBayes: Weight Optimization for Memristive Neural Networks via Quantization-Aware Bayesian Inference
    Zhou, Yue
    Hu, Xiaofang
    Wang, Lidan
    Zhou, Guangdong
    Duan, Shukai
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2021, 68 (12) : 4851 - 4861
  • [7] Inference-aware convolutional neural network pruning
    Choudhary, Tejalal
    Mishra, Vipul
    Goswami, Anurag
    Sarangapani, Jagannathan
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2022, 135 : 44 - 56
  • [8] EFFICIENT INFERENCE OF IMAGE-BASED NEURAL NETWORK MODELS IN RECONFIGURABLE SYSTEMS WITH PRUNING AND QUANTIZATION
    Flich, Jose
    Medina, Laura
    Catalan, Izan
    Hernandez, Carles
    Bragagnolo, Andrea
    Auzanneau, Fabrice
    Briand, David
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2491 - 2495
  • [9] Scaling Up Quantization-Aware Neural Architecture Search for Efficient Deep Learning on the Edge
    Lu, Yao
    Rodriguez, Hiram Rayo Torres
    Vogel, Sebastian
    van de Waterlaat, Nick
    Jancura, Pavol
    PROCEEDINGS 2023 IEEE/ACM INTERNATIONAL WORKSHOP ON COMPILERS, DEPLOYMENT, AND TOOLING FOR EDGE AI, CODAI 2023, 2023, : 1 - 5
  • [10] Comparative Study on Quantization-Aware Training of Memristor Crossbars for Reducing Inference Power of Neural Networks at The Edge
    Tien Van Nguyen
    An, Jiyong
    Min, Kyeong-Sik
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,