Software-Hardware Co-Optimization on Partial-Sum Problem for PIM-based Neural Network Accelerator

被引：0

作者：

Wu, Qizhe ^{[1
]}

Tao, Linfeng ^{[1
]}

Liang, Huawen ^{[1
]}

Yuan, Wei ^{[1
]}

Tian, Teng ^{[1
]}

Xue, Shuang ^{[1
]}

Jin, Xi ^{[1
]}

机构：

[1] Univ Sci & Technol China, Chinese Acad Sci, Dept Phys, State Key Lab Particle Detect & Elect,Inst Microe, Hefei 230026, Peoples R China

来源：

2021 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC) | 2021年

关键词：

processing-in-memory; partial sum; memristor; neural network accelerator;

D O I：

10.1109/HPEC49654.2021.9622798

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The crossbar architecture, which is comprised of novel mcmristor devices, enables high-speed and energy-efficient processing-in-memory (PIM) for neural network computing. However, because to the limitations of the manufacturing process, it is difficult to fabricate huge arrays. As a consequence, the neural network's vector-matrix-multiplication (VMM) must split the operands into several arrays to get the partial-sum and then add up the partial results. The neural network (NN) training process, which is often influenced by device variations and ADC quantization noise in the PIM system, does not perceive the partial-sum process. As a consequence, when inferring NN models directly on the PIM platform without taking partial-sum into account, accuracy suffers significantly. This makes it difficult to apply PIM computing to large-scale neural networks. In particular, our work makes the following contributions: (i) We conducted research on the partial-sum issue for crossbar architecture while computing high channel convolution (Cony), and got three lessons as a result. (ii) To address this issue, we offer techniques for avoiding or minimizing partial-sum at the software and hardware levels, respectively. At the software level, we utilized group Cony rather than conventional Cony; at the hardware level, we presented a new architecture for adapting dcpthwise separable Cony. Experiments were conducted using the Cifar10 dataset and the VGG8 network on RRAM crossbar architecture. Results show improvements of 15.53%, 14.55% in accuracy, and 0.28x, 0.94x in energy efficiency on software and hardware levels, respectively, when compared to the conventional PIM scheme.

引用

页数：7

共 36 条

[31] LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-Optimization
Zhao, Shao-Feng
Wang, Fang
Liu, Bo
Feng, Dan
Liu, Yang
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2023, 38 (02) : 328 - 347
[32] LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-Optimization
Shao-Feng Zhao
Fang Wang
Bo Liu
Dan Feng
Yang Liu
Journal of Computer Science and Technology, 2023, 38 : 328 - 347
[33] Hardware/Software co-design SoC-system for a Neural Network trained by Particle Swarm Optimization
Hoshino, Yukinobu
2017 IEEE 10TH INTERNATIONAL WORKSHOP ON COMPUTATIONAL INTELLIGENCE AND APPLICATIONS (IWCIA), 2017, : 1 - 1
[34] Algorithm-Hardware Co-Optimization and Deployment Method for Field-Programmable Gate-Array-Based Convolutional Neural Network Remote Sensing Image Processing
Ni, Shuo
Wei, Xin
Zhang, Ning
Chen, He
REMOTE SENSING, 2023, 15 (24)
[35] A Hardware and Software Co-Design for Energy-Efficient Neural Network Accelerator With Multiplication-Less Folded-Accumulative PE for Radar-Based Hand Gesture Recognition
Li, Fan
Guan, Yunqi
Ye, Wenbin
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (10) : 1964 - 1968
[36] WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor
Xie, Xie
Wu, Chang
2021 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE BIG DATA AND INTELLIGENT SYSTEMS (HPBD&IS), 2021, : 1 - 5

← 1 2 3 4 →