Memory-Efficient Batch Normalization by One-Pass Computation for On-Device Training

被引：0

作者：

Dai, He ^{[1
]}

Wang, Hang ^{[1
]}

Zhang, Xuchong ^{[2
]}

Sun, Hongbin ^{[2
]}

机构：

[1] Xi An Jiao Tong Univ, Sch Microelect, Xian 710049, Shaanxi, Peoples R China

[2] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Coll Artificial Intelligence, Xian 710049, Shaanxi, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS | 2024年 / 71卷 / 06期

基金：

中国国家自然科学基金;

关键词：

Training; Systolic arrays; Backpropagation; Artificial neural networks; Micromechanical devices; Feedforward systems; Memory management; Memory-efficient accelerator; deep neural networks; batch normalization; on-device training; one-pass computation;

D O I：

10.1109/TCSII.2024.3354738

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Batch normalization (BN) has become ubiquitous in modern deep learning architectures because of its remarkable improvement in deep neural network (DNN) training performance. However, the two-pass computation of statistical estimation and element-wise normalization in BN training requires two accesses to the input data, resulting in a huge increase in off-chip memory traffic during DNN training. In this brief, we propose a novel accelerator, named one-pass normalizer (OPN) to achieve memory-efficient BN for on-device training. Specifically, in terms of dataflow, we propose one-pass computation based on sampling-based range normalization and sparse data recovery techniques to reduce BN off-chip memory access. Regarding the OPN circuit, we propose channel-wise constant extraction to achieve a compact design. Experimental results show that the one-pass computation reduces off-chip memory access of BN by 2.0 similar to 3.8x compared with the previous state-of-the-art designs while maintaining training performance. Moreover, the channel-wise constant extraction saves the gate count and power consumption of OPN by 56% and 73%, respectively.

引用

页码：3186 / 3190

页数：5

共 50 条

[1] Memory-efficient LVCSR search using a one-pass stack decoder
Schuster, M
[J]. COMPUTER SPEECH AND LANGUAGE, 2000, 14 (01): : 47 - 77
[2] ACBN: Approximate Calculated Batch Normalization for Efficient DNN On-Device Training Processor
Li, Baoting
Wang, Hang
Luo, Fujie
Zhang, Xuchong
Sun, Hongbin
Zheng, Nanning
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2023, 31 (06) : 738 - 748
[3] A one-pass real-time decoder using memory-efficient state network
Shao, Jian
Li, Ta
Zhang, Qingqing
Zhao, Qingwei
Yan, Yonghong
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2008, E91D (03) : 529 - 537
[4] LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training
Noh, Seock-Hwan
Park, Junsang
Park, Dahoon
Koo, Jahyun
Choi, Jeik
Kung, Jaeha
[J]. 2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 443 - 450
[5] LOW-RANK GRADIENT APPROXIMATION FOR MEMORY-EFFICIENT ON-DEVICE TRAINING OF DEEP NEURAL NETWORK
Gooneratne, Mary
Sim, Khe Chai
Zadrazil, Petr
Kabel, Andreas
Beaufays, Francoise
Motta, Giovanni
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 3017 - 3021
[6] One-pass tableaux for computation tree logic
Abate, Pietro
Gore, Rajeev
Widmann, Florian
[J]. LOGIC FOR PROGRAMMING, ARTIFICIAL INTELLIGENCE, AND REASONING, PROCEEDINGS, 2007, 4790 : 32 - +
[7] Memory-Efficient Fixpoint Computation
Kim, Sung Kook
Venet, Arnaud J.
Thakur, Aditya, V
[J]. STATIC ANALYSIS (SAS 2020), 2020, 12389 : 35 - 64
[8] Linear time and memory-efficient computation
Regan, KW
[J]. SIAM JOURNAL ON COMPUTING, 1996, 25 (01) : 133 - 168
[9] Computationally efficient, one-pass algorithm for morphological filters
Dokladal, Petr
Dokladalova, Eva
[J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2011, 22 (05) : 411 - 420
[10] Efficient One-Pass Decoding with NNLM for Speech Recognition
Shi, Yongzhe
Zhang, Wei-Qiang
Cai, Meng
Liu, Jia
[J]. IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (04) : 377 - 381

← 1 2 3 4 5 →