SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor With Dynamic Sub-Structured Weight Pruning

被引:3
|
作者
Wang, Yang [1 ,2 ]
Qin, Yubin [1 ,2 ]
Liu, Leibo [1 ,2 ]
Wei, Shaojun [1 ,2 ]
Yin, Shouyi [1 ,2 ]
机构
[1] Tsinghua Univ, Beijing Innovat Ctr Future Chip, Sch Integrated Circuits, Beijing 100084, Peoples R China
[2] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol, Beijing 100084, Peoples R China
关键词
Deep neural network (DNN); training processor; software-hardware co-design; sub-structured weight pruning;
D O I
10.1109/TCSI.2022.3184175
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
When deploying deep neural networks (DNNs), edge devices training is practical to improve model adaptivity for various user-specific scenarios while avoiding privacy disclosure. However, the training computation is intolerable for edge devices. It inspires sparse DNN training (SDT) into the limelight, which reduces training computation by dynamic weight pruning. Generally, SDT has two strategies based on the pruning granularity: the structured or the unstructured. Unfortunately, both of them suffer from limited training efficiency due to the gap between pruning granularity and hardware implementation. The former is hardware-friendly but has a low pruning ratio, indicating limited computation reduction. The latter has a high pruning ratio, but the unbalanced workload decreases utilization and irregular sparsity distribution causes considerable sparsity processing overhead. This paper proposes a software-hardware co-design to bridge the gap for improving the efficiency of SDT. On the algorithm side, a sub- structured pruning method, achieved with hybrid shape-wise and line-wise pruning, generates a high sparsity ratio and keeps the hardware-friendly property. On the hardware side, a sub-structured weight processing unit (SWPU) effectively handles the hybrid sparsity with three techniques. First, SWPU dynamically reorders the computation sequence with hamming-distance-based clustering, balancing the irregular workload. Second, SWPU performs runtime scheduling by exploiting the feature of sub-structured sparse convolution through a detect-before-load controller, which skips redundant memory access and sparsity processing. Third, SWPU performs sparse convolution by compressing operands with spatial disconnect log-based routing and recovers their location with bi-directional switching, avoiding the power-consumed routing logic. Synthesized with 28nm CMOS technology, SWPU can enable 0.56V-to-1.0V supply voltage with a maximum frequency of 675 MHz. It achieves a 50.1% higher pruning ratio than structured pruning and 1.53x higher energy efficiency than unstructured pruning. The peak energy efficiency of SWPU is 126.04TFLOPS/W, outperforming the state-of-the-art training processor by 1.67x. When training a ResNet-18 model, SWPU reduces 3.72x energy and offers 4.69x speedup than previous sparse training processors.
引用
收藏
页码:4014 / 4027
页数:14
相关论文
共 3 条
  • [1] Trainer: An Energy-Efficient Edge-Device Training Processor Supporting Dynamic Weight Pruning
    Wang, Yang
    Qin, Yubin
    Deng, Dazheng
    Wei, Jingchuan
    Chen, Tianbao
    Lin, Xinhan
    Liu, Leibo
    Wei, Shaojun
    Yin, Shouyi
    [J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2022, 57 (10) : 3164 - 3178
  • [2] HPPU: An Energy-Efficient Sparse DNN Training Processor with Hybrid Weight Pruning
    Wang, Yang
    Qin, Yubin
    Liu, Leibo
    Wei, Shaojun
    Yin, Shouyi
    [J]. 2021 IEEE 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS), 2021,
  • [3] PL-NPU: An Energy-Efficient Edge-Device DNN Training Processor With Posit-Based Logarithm-Domain Computing
    Wang, Yang
    Deng, Dazheng
    Liu, Leibo
    Wei, Shaojun
    Yin, Shouyi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2022, 69 (10) : 4042 - 4055