An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

被引：19

作者：

Fang, Chao ^{[1
]}

Zhou, Aojun ^{[2
]}

Wang, Zhongfeng ^{[1
]}

机构：

[1] Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210008, Peoples R China

[2] Chinese Univ Hong Kong CUHK, CUHK Sensetime Joint Lab, Hong Kong, Peoples R China

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2022年 / 30卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Algorithm-hardware codesign; hardware accelerator; model compression; pruning; Transformer; CNN ACCELERATOR; EFFICIENT;

D O I：

10.1109/TVLSI.2022.3197282

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to the immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere graphics processing units (GPUs) leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. First, from an algorithm perspective, we propose a sparsity inheritance mechanism along with inherited dynamic pruning (IDP) to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed to significantly reduce the storage requirement for deployment. Second, from a hardware perspective, we present a flexible and efficient hardware architecture, namely, STA, to achieve significant speedup when deploying N:M sparse Transformers. STA features not only a computing engine unifying both sparse-dense and dense-dense matrix multiplications with high computational efficiency but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that, compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency. Moreover, STA can achieve 14.47x and 11.33x speedups compared to Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform 2.00 similar to 19.47x faster inference than the state-of-the-art field-programmable gate array (FPGA)-based accelerators for Transformers.

引用

页码：1573 / 1586

页数：14

共 50 条

[1] RSNN: A Software/Hardware Co-Optimized Framework for Sparse Convolutional Neural Networks on FPGAs
You, Weijie
Wu, Chang
[J]. IEEE ACCESS, 2021, 9 : 949 - 960
[2] Algorithm and Hardware Co-Optimized Solution for Large SpMV Problems
Sadi, Fazle
Pileggi, Larry
Franchetti, Franz
[J]. 2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
[3] Algorithm-Hardware Co-Design for Wearable BCIs: An Evolution from Linear Algebra to Transformers
Park, Sunyoung
Byun, Wooseok
Je, Minkyu
Kim, Ji-Hoon
[J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
[4] ALPHA: A Novel Algorithm-Hardware Co-Design for Accelerating DNA Seed Location Filtering
Hameed, Fazal
Khan, Asif Ali
Castrillon, Jeronimo
[J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2022, 10 (03) : 1464 - 1475
[5] An algorithm/hardware co-optimized method to accelerate CNNs with compressed convolutional weights on FPGA
Shang, Jiangwei
Zhang, Zhan
Zhang, Kun
Li, Chuanyou
Qian, Lei
Liu, Hongwei
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2024, 36 (11):
[6] Algorithm-hardware Co-design for Deformable Convolution
Huang, Qijing
Wang, Dequan
Gao, Yizhao
Cai, Yaohui
Dong, Zhen
Wu, Bichen
Keutzer, Kurt
Wawrzynek, John
[J]. FIFTH WORKSHOP ON ENERGY EFFICIENT MACHINE LEARNING AND COGNITIVE COMPUTING - NEURIPS EDITION (EMC2-NIPS 2019), 2019, : 48 - 51
[7] Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Memory
Sadi, Fazle
Akin, Berkin
Popovici, Doru T.
Hoe, James C.
Pileggi, Larry
Franchetti, Franz
[J]. 2014 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2014,
[8] A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining
Peng, Hongwu
Huang, Shaoyi
Chen, Shiyang
Li, Bingbing
Geng, Tong
Li, Ang
Jiang, Weiwen
Wen, Wujie
Bi, Jinbo
Liu, Hang
Ding, Caiwen
[J]. PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 1135 - 1140
[9] A Low-Latency Framework With Algorithm-Hardware Co-Optimization for 3-D Point Cloud
Yu, Yue
Mao, Wendong
Luo, Jiapeng
Wang, Zhongfeng
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (11) : 4221 - 4225
[10] Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices
Zhang, Xinyi
Wu, Yawen
Zhou, Peipei
Tang, Xulong
Hu, Jingtong
[J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2021, 20 (05)

← 1 2 3 4 5 →