Design of Processing-in-Memory With Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training

被引：1

作者：

Han, Wontak ^{[1
]}

Heo, Jaehoon ^{[1
]}

Kim, Junsoo ^{[1
]}

Lim, Sukbin ^{[1
]}

Kim, Joo-Young ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Dept Elect Engn, Daejeon 34141, South Korea

来源：

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS | 2022年 / 12卷 / 02期

关键词：

Training; Computational modeling; Computer architecture; Deep learning; Circuits and systems; Power demand; Neurons; Accelerator architecture; machine learning; processing-in-memory architecture; bit-serial operation; inference; training; sparsity handling; SRAM; energy-efficient architecture; DEEP NEURAL-NETWORKS; SRAM; ACCELERATOR; MACRO;

D O I：

10.1109/JETCAS.2022.3168852

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-in-memory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm(2) die in a 28-nm CMOS logic process. It operates at 50-280MHz with the supply voltage of 0.75-1.05V, dissipating 5.25-51.23mW power in inference and 6.10-37.75mW in training. As a result, it achieves 17.90-161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84-7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.

引用

页码：354 / 366

页数：13

共 50 条

[31] Optical signal processing for energy-efficient dynamic optical path networks
Namiki, Shu
Hasama, Toshifumi
Ishikawa, Hiroshi
2010 36TH EUROPEAN CONFERENCE AND EXHIBITION ON OPTICAL COMMUNICATION (ECOC), VOLS 1 AND 2, 2010,
[32] Computational Storage for an Energy-Efficient Deep Neural Network Training System
Li, Shiju
Tang, Kevin
Lim, Jin
Lee, Chul-Ho
Kim, Jongryool
EURO-PAR 2023: PARALLEL PROCESSING, 2023, 14100 : 304 - 319
[33] Computational SRAM Design Automation using Pushed-Rule Bitcells for Energy-Efficient Vector Processing
Noel, J-P
Egloff, V
Kooli, M.
Gauchi, R.
Portal, J-M
Charles, H-P
Vivet, P.
Giraud, B.
PROCEEDINGS OF THE 2020 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2020), 2020, : 1187 - 1192
[34] AERIS: Area/Energy-Efficient 1T2R ReRAM Based Processing-in-Memory Neural Network System-on-a-Chip
Yue, Jinshan
Liu, Yongpan
Su, Fang
Li, Shuangchen
Yuan, Zhe
Wang, Zhibo
Sun, Wenyu
Li, Xueqing
Yang, Huazhong
24TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE (ASP-DAC 2019), 2019, : 146 - 151
[35] SP-PIM: A Super-Pipelined Processing-In-Memory Accelerator With Local Error Prediction for Area/Energy-Efficient On-Device Learning
Heo, Jaehoon
Kim, Jung-Hoon
Han, Wontak
Kim, Jaeuk
Kim, Joo-Young
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2024, 59 (08) : 2671 - 2683
[36] Energy-Efficient Design of Processing Element for Convolutional Neural Network
Choi, Yeongjae
Bae, Dongmyung
Sim, Jaehyeong
Choi, Seungkyu
Kim, Minhye
Kim, Lee-Sup
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2017, 64 (11) : 1332 - 1336
[37] Design of a Computational Nonvolatile RAM for a Greedy Energy-Efficient VLSI Processor
Mochizuki, Akira
Yube, Naoto
Hanyu, Takahiro
IECON 2015 - 41ST ANNUAL CONFERENCE OF THE IEEE INDUSTRIAL ELECTRONICS SOCIETY, 2015, : 3283 - 3288
[38] Computational design of energy-efficient legged robots: Optimizing for size and actuators
Fadini, G.
Flayols, T.
Del Prete, A.
Mansard, N.
Soueres, P.
2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, : 9898 - 9904
[39] Algorithm/Architecture Co-Design for Energy-Efficient Acceleration of Multi-Task DNN
Shin, Jaekang
Choi, Seungkyu
Ra, Jongwoo
Kim, Lee -Sup
PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 253 - 258
[40] The Hardware and Algorithm Co-Design for Energy-Efficient DNN Processor on Edge/Mobile Devices
Lee, Jinsu
Kang, Sanghoon
Lee, Jinmook
Shin, Dongjoo
Han, Donghyeon
Yoo, Hoi-Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2020, 67 (10) : 3458 - 3470

← 1 2 3 4 5 →