Software-Hardware Co-Optimization on Partial-Sum Problem for PIM-based Neural Network Accelerator

被引:0
|
作者
Wu, Qizhe [1 ]
Tao, Linfeng [1 ]
Liang, Huawen [1 ]
Yuan, Wei [1 ]
Tian, Teng [1 ]
Xue, Shuang [1 ]
Jin, Xi [1 ]
机构
[1] Univ Sci & Technol China, Chinese Acad Sci, Dept Phys, State Key Lab Particle Detect & Elect,Inst Microe, Hefei 230026, Peoples R China
关键词
processing-in-memory; partial sum; memristor; neural network accelerator;
D O I
10.1109/HPEC49654.2021.9622798
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The crossbar architecture, which is comprised of novel mcmristor devices, enables high-speed and energy-efficient processing-in-memory (PIM) for neural network computing. However, because to the limitations of the manufacturing process, it is difficult to fabricate huge arrays. As a consequence, the neural network's vector-matrix-multiplication (VMM) must split the operands into several arrays to get the partial-sum and then add up the partial results. The neural network (NN) training process, which is often influenced by device variations and ADC quantization noise in the PIM system, does not perceive the partial-sum process. As a consequence, when inferring NN models directly on the PIM platform without taking partial-sum into account, accuracy suffers significantly. This makes it difficult to apply PIM computing to large-scale neural networks. In particular, our work makes the following contributions: (i) We conducted research on the partial-sum issue for crossbar architecture while computing high channel convolution (Cony), and got three lessons as a result. (ii) To address this issue, we offer techniques for avoiding or minimizing partial-sum at the software and hardware levels, respectively. At the software level, we utilized group Cony rather than conventional Cony; at the hardware level, we presented a new architecture for adapting dcpthwise separable Cony. Experiments were conducted using the Cifar10 dataset and the VGG8 network on RRAM crossbar architecture. Results show improvements of 15.53%, 14.55% in accuracy, and 0.28x, 0.94x in energy efficiency on software and hardware levels, respectively, when compared to the conventional PIM scheme.
引用
收藏
页数:7
相关论文
共 36 条
  • [31] LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-Optimization
    Zhao, Shao-Feng
    Wang, Fang
    Liu, Bo
    Feng, Dan
    Liu, Yang
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2023, 38 (02) : 328 - 347
  • [32] LayCO: Achieving Least Lossy Accuracy for Most Efficient RRAM-Based Deep Neural Network Accelerator via Layer-Centric Co-Optimization
    Shao-Feng Zhao
    Fang Wang
    Bo Liu
    Dan Feng
    Yang Liu
    Journal of Computer Science and Technology, 2023, 38 : 328 - 347
  • [33] Hardware/Software co-design SoC-system for a Neural Network trained by Particle Swarm Optimization
    Hoshino, Yukinobu
    2017 IEEE 10TH INTERNATIONAL WORKSHOP ON COMPUTATIONAL INTELLIGENCE AND APPLICATIONS (IWCIA), 2017, : 1 - 1
  • [34] Algorithm-Hardware Co-Optimization and Deployment Method for Field-Programmable Gate-Array-Based Convolutional Neural Network Remote Sensing Image Processing
    Ni, Shuo
    Wei, Xin
    Zhang, Ning
    Chen, He
    REMOTE SENSING, 2023, 15 (24)
  • [35] A Hardware and Software Co-Design for Energy-Efficient Neural Network Accelerator With Multiplication-Less Folded-Accumulative PE for Radar-Based Hand Gesture Recognition
    Li, Fan
    Guan, Yunqi
    Ye, Wenbin
    IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2024, 32 (10) : 1964 - 1968
  • [36] WPU: A FPGA-based Scalable, Efficient and Software/Hardware Co-design Deep Neural Network Inference Acceleration Processor
    Xie, Xie
    Wu, Chang
    2021 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE BIG DATA AND INTELLIGENT SYSTEMS (HPBD&IS), 2021, : 1 - 5