An Algorithm-Hardware Co-Optimized Framework for Accelerating N:M Sparse Transformers

被引:19
|
作者
Fang, Chao [1 ]
Zhou, Aojun [2 ]
Wang, Zhongfeng [1 ]
机构
[1] Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210008, Peoples R China
[2] Chinese Univ Hong Kong CUHK, CUHK Sensetime Joint Lab, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Algorithm-hardware codesign; hardware accelerator; model compression; pruning; Transformer; CNN ACCELERATOR; EFFICIENT;
D O I
10.1109/TVLSI.2022.3197282
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The Transformer has been an indispensable staple in deep learning. However, for real-life applications, it is very challenging to deploy efficient Transformers due to the immense parameters and operations of models. To relieve this burden, exploiting sparsity is an effective approach to accelerate Transformers. Newly emerging Ampere graphics processing units (GPUs) leverage a 2:4 sparsity pattern to achieve model acceleration, while it can hardly meet the diverse algorithm and hardware constraints when deploying models. By contrast, we propose an algorithm-hardware co-optimized framework to flexibly and efficiently accelerate Transformers by utilizing general N:M sparsity patterns. First, from an algorithm perspective, we propose a sparsity inheritance mechanism along with inherited dynamic pruning (IDP) to obtain a series of N:M sparse candidate Transformers rapidly. A model compression scheme is further proposed to significantly reduce the storage requirement for deployment. Second, from a hardware perspective, we present a flexible and efficient hardware architecture, namely, STA, to achieve significant speedup when deploying N:M sparse Transformers. STA features not only a computing engine unifying both sparse-dense and dense-dense matrix multiplications with high computational efficiency but also a scalable softmax module eliminating the latency from intermediate off-chip data communication. Experimental results show that, compared to other methods, N:M sparse Transformers, generated using IDP, achieves an average of 6.7% improvement on accuracy with high training efficiency. Moreover, STA can achieve 14.47x and 11.33x speedups compared to Intel i9-9900X and NVIDIA RTX 2080 Ti, respectively, and perform 2.00 similar to 19.47x faster inference than the state-of-the-art field-programmable gate array (FPGA)-based accelerators for Transformers.
引用
收藏
页码:1573 / 1586
页数:14
相关论文
共 50 条
  • [1] RSNN: A Software/Hardware Co-Optimized Framework for Sparse Convolutional Neural Networks on FPGAs
    You, Weijie
    Wu, Chang
    [J]. IEEE ACCESS, 2021, 9 : 949 - 960
  • [2] Algorithm and Hardware Co-Optimized Solution for Large SpMV Problems
    Sadi, Fazle
    Pileggi, Larry
    Franchetti, Franz
    [J]. 2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
  • [3] Algorithm-Hardware Co-Design for Wearable BCIs: An Evolution from Linear Algebra to Transformers
    Park, Sunyoung
    Byun, Wooseok
    Je, Minkyu
    Kim, Ji-Hoon
    [J]. 2024 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, ISCAS 2024, 2024,
  • [4] ALPHA: A Novel Algorithm-Hardware Co-Design for Accelerating DNA Seed Location Filtering
    Hameed, Fazal
    Khan, Asif Ali
    Castrillon, Jeronimo
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2022, 10 (03) : 1464 - 1475
  • [5] An algorithm/hardware co-optimized method to accelerate CNNs with compressed convolutional weights on FPGA
    Shang, Jiangwei
    Zhang, Zhan
    Zhang, Kun
    Li, Chuanyou
    Qian, Lei
    Liu, Hongwei
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2024, 36 (11):
  • [6] Algorithm-hardware Co-design for Deformable Convolution
    Huang, Qijing
    Wang, Dequan
    Gao, Yizhao
    Cai, Yaohui
    Dong, Zhen
    Wu, Bichen
    Keutzer, Kurt
    Wawrzynek, John
    [J]. FIFTH WORKSHOP ON ENERGY EFFICIENT MACHINE LEARNING AND COGNITIVE COMPUTING - NEURIPS EDITION (EMC2-NIPS 2019), 2019, : 48 - 51
  • [7] Algorithm/Hardware Co-optimized SAR Image Reconstruction with 3D-stacked Logic in Memory
    Sadi, Fazle
    Akin, Berkin
    Popovici, Doru T.
    Hoe, James C.
    Pileggi, Larry
    Franchetti, Franz
    [J]. 2014 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2014,
  • [8] A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining
    Peng, Hongwu
    Huang, Shaoyi
    Chen, Shiyang
    Li, Bingbing
    Geng, Tong
    Li, Ang
    Jiang, Weiwen
    Wen, Wujie
    Bi, Jinbo
    Liu, Hang
    Ding, Caiwen
    [J]. PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022, 2022, : 1135 - 1140
  • [9] A Low-Latency Framework With Algorithm-Hardware Co-Optimization for 3-D Point Cloud
    Yu, Yue
    Mao, Wendong
    Luo, Jiapeng
    Wang, Zhongfeng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II-EXPRESS BRIEFS, 2023, 70 (11) : 4221 - 4225
  • [10] Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices
    Zhang, Xinyi
    Wu, Yawen
    Zhou, Peipei
    Tang, Xulong
    Hu, Jingtong
    [J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2021, 20 (05)