SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator

被引：23

作者：

Wu, Di ^{[1
]}

Fan, Xitian ^{[2
]}

Cao, Wei ^{[3
]}

Wang, Lingli ^{[3
]}

机构：

[1] Fudan Univ, State Key Lab Applicat Specif Integrated Circuit, Shanghai 201203, Peoples R China

[2] Fudan Univ, Sch Comp Sci, Shanghai 201203, Peoples R China

[3] Fudan Univ, Sch Microelect, Shanghai 201203, Peoples R China

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2021年 / 29卷 / 05期

基金：

中国国家自然科学基金;

关键词：

Convolution; Sparse matrices; Acceleration; Load modeling; Kernel; Inference algorithms; Very large scale integration; Convolutional neural network (CNN) acceleration; convolution partition; load balance; sparse; Winograd transformation; ARCHITECTURE;

D O I：

10.1109/TVLSI.2021.3060041

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least 1.53x speedup and 1.8x energy efficiency improvement.

引用

页码：936 / 949

页数：14

共 50 条

[1] SpWMM: A High-Performance Sparse-Winograd Matrix-Matrix Multiplication Accelerator for CNNs
Wu, Di
Cao, Wei
Wang, Lingli
2019 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (ICFPT 2019), 2019, : 255 - 258
[2] A LOW-LATENCY SPARSE-WINOGRAD ACCELERATOR FOR CONVOLUTIONAL NEURAL NETWORKS
Wang, Haonan
Liu, Wenjian
Xu, Tianyi
Lin, Jun
Wang, Zhongfeng
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1448 - 1452
[3] A Dynamically Reconfigurable Accelerator Design Using a Sparse-Winograd Decomposition Algorithm for CNNs
Zhao, Yunping
Lu, Jianzhuang
Chen, Xiaowen
CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 66 (01): : 517 - 535
[4] High-Performance CNN Accelerator on FPGA Using Unified Winograd-GEMM Architecture
Kala, S.
Jose, Babita R.
Mathew, Jimson
Nalesh, S.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 2019, 27 (12) : 2816 - 2828
[5] A High-Performance Accelerator for Floating-Point Matrix Multiplication
Jia, Xun
Wu, Gunning
Xie, Xianghui
2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 396 - 402
[6] Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs
Xie, Kunpeng
Lu, Ye
He, Xinyu
Yi, Dezhi
Dong, Huijuan
Chen, Yao
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2024, 21 (02)
[7] High-Performance Winograd Based Accelerator Architecture for Convolutional Neural Network
Vardhana, M.
Pinto, Rohan
IEEE COMPUTER ARCHITECTURE LETTERS, 2025, 24 (01) : 21 - 24
[8] Sparkle: A High Efficient Sparse Matrix Multiplication Accelerator for Deep Learning
Xu, Shiyao
Jiang, Jingfei
Xu, Jinwei
Liu, Chaorun
He, Yuanhong
Liu, Xiaohang
Gao, Lei
2022 IEEE 40TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD 2022), 2022, : 479 - 486
[9] Anatomy of high-performance matrix multiplication
Goto, Kazushige
Van De Geijn, Robert A.
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2008, 34 (03):
[10] SparseX: A Library for High-Performance Sparse Matrix-Vector Multiplication on Multicore Platforms
Elafrou, Athena
Karakasis, Vasileios
Gkountouvas, Theodoros
Kourtis, Kornilios
Goumas, Georgios
Koziris, Nectarios
ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE, 2018, 44 (03):

← 1 2 3 4 5 →