Design of Processing-in-Memory With Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training

被引：1

作者：

Han, Wontak ^{[1
]}

Heo, Jaehoon ^{[1
]}

Kim, Junsoo ^{[1
]}

Lim, Sukbin ^{[1
]}

Kim, Joo-Young ^{[1
]}

机构：

[1] Korea Adv Inst Sci & Technol, Dept Elect Engn, Daejeon 34141, South Korea

来源：

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS | 2022年 / 12卷 / 02期

关键词：

Training; Computational modeling; Computer architecture; Deep learning; Circuits and systems; Power demand; Neurons; Accelerator architecture; machine learning; processing-in-memory architecture; bit-serial operation; inference; training; sparsity handling; SRAM; energy-efficient architecture; DEEP NEURAL-NETWORKS; SRAM; ACCELERATOR; MACRO;

D O I：

10.1109/JETCAS.2022.3168852

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-in-memory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm(2) die in a 28-nm CMOS logic process. It operates at 50-280MHz with the supply voltage of 0.75-1.05V, dissipating 5.25-51.23mW power in inference and 6.10-37.75mW in training. As a result, it achieves 17.90-161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84-7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.

引用

页码：354 / 366

页数：13

共 50 条

[21] Processing-in-memory in High Bandwidth Memory (PIM-HBM) Architecture with Energy-efficient and Low Latency Channels for High Bandwidth System
Kim, Seongguk
Kim, Subin
Cho, Kyungjun
Shin, Taein
Park, Hyunwook
Lho, Daehwan
Park, Shinyoung
Son, Kyungjune
Park, Gapyeol
Kim, Joungho
2019 IEEE 28TH CONFERENCE ON ELECTRICAL PERFORMANCE OF ELECTRONIC PACKAGING AND SYSTEMS (EPEPS 2019), 2019,
[22] PIMCA: A Programmable In-Memory Computing Accelerator for Energy-Efficient DNN Inference
Zhang, Bo
Yin, Shihui
Kim, Minkyu
Saikia, Jyotishman
Kwon, Soonwan
Myung, Sungmeen
Kim, Hyunsoo
Kim, Sang Joon
Seo, Jae-Sun
Seok, Mingoo
IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2023, 58 (05) : 1436 - 1449
[23] HPPU: An Energy-Efficient Sparse DNN Training Processor with Hybrid Weight Pruning
Wang, Yang
Qin, Yubin
Liu, Leibo
Wei, Shaojun
Yin, Shouyi
2021 IEEE 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS), 2021,
[24] Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators
He, Xin
Liu, Jiawen
Xie, Zhen
Chen, Hao
Chen, Guoyang
Zhang, Weifeng
Li, Dong
PROCEEDINGS OF THE 2021 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, ICS 2021, 2021, : 227 - 241
[25] Towards CIM-friendly and Energy-Efficient DNN Accelerator via Bit-level Sparsity
Karimzadeh, Foroozan
Raychowdhury, Arijit
PROCEEDINGS OF THE 2022 IFIP/IEEE 30TH INTERNATIONAL CONFERENCE ON VERY LARGE SCALE INTEGRATION (VLSI-SOC), 2022,
[26] SparseMEM: Energy-efficient Design for In-memory Sparse-based Graph Processing
Zahedi, Mahdi
Custers, Geert
Shahroodi, Taha
Gaydadjiev, Georgi
Wong, Stephan
Hamdioui, Said
2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
[27] NicePIM: Design Space Exploration for Processing-In-Memory DNN Accelerators With 3-D Stacked-DRAM
Wang, Junpeng
Ge, Mengke
Ding, Bo
Xu, Qi
Chen, Song
Kang, Yi
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (05) : 1456 - 1469
[28] LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training
Noh, Seock-Hwan
Park, Junsang
Park, Dahoon
Koo, Jahyun
Choi, Jeik
Kung, Jaeha
2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 443 - 450
[29] A 0.18μm CMOS implementation of an area efficient precise exception handling unit for processing-in-memory systems
Mediratta, S
Steele, C
Singh, R
Sondeen, J
Draper, J
2004 47TH MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL III, CONFERENCE PROCEEDINGS, 2004, : 455 - 458
[30] TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning
Kim, Sangyeob
Lee, Juhyoung
Kang, Sanghoon
Han, Donghyeon
Jo, Wooyoung
Yoo, Hoi-Jun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2022, 69 (04) : 1494 - 1506

← 1 2 3 4 5 →