A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

被引:16
|
作者
Peng, Hongwu [1 ]
Huang, Shaoyi [1 ]
Chen, Shiyang [2 ]
Li, Bingbing [1 ]
Geng, Tong [3 ]
Li, Ang [3 ]
Jiang, Weiwen [4 ]
Wen, Wujie [5 ]
Bi, Jinbo [1 ]
Liu, Hang [2 ]
Ding, Caiwen [1 ]
机构
[1] Univ Connecticut, Storrs, CT 06269 USA
[2] Stevens Inst Technol, Hoboken, NJ USA
[3] Pacific Northwest Natl Lab, Richland, WA USA
[4] George Mason Univ, Fairfax, VA USA
[5] Lehigh Univ, Bethlehem, PA USA
关键词
Transformer; Attention; BERT; Length adaptive; FPGA;
D O I
10.1145/3489517.3530585
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 x and 2.6 x speedup compared to CPU and GPU implementation, and 4 x higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.
引用
收藏
页码:1135 / 1140
页数:6
相关论文
共 50 条
  • [1] Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices
    Zhang, Xinyi
    Wu, Yawen
    Zhou, Peipei
    Tang, Xulong
    Hu, Jingtong
    [J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2021, 20 (05)
  • [2] Algorithm-hardware Co-design for Deformable Convolution
    Huang, Qijing
    Wang, Dequan
    Gao, Yizhao
    Cai, Yaohui
    Dong, Zhen
    Wu, Bichen
    Keutzer, Kurt
    Wawrzynek, John
    [J]. FIFTH WORKSHOP ON ENERGY EFFICIENT MACHINE LEARNING AND COGNITIVE COMPUTING - NEURIPS EDITION (EMC2-NIPS 2019), 2019, : 48 - 51
  • [3] High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design
    Anupreetham, Anupreetham
    Ibrahim, Mohamed
    Hall, Mathew
    Boutros, Andrew
    Kuzhively, Ajay
    Mohanty, Abinash
    Nurvitadhi, Eriko
    Betz, Vaughn
    Cao, Yu
    Seo, Jae-Sun
    [J]. ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2024, 17 (01)
  • [4] Toolflow for the algorithm-hardware co-design of memristive ANN accelerators
    Wabnitz, Malte
    Gemmeke, Tobias
    [J]. Memories - Materials, Devices, Circuits and Systems, 2023, 5
  • [5] Algorithm-Hardware Co-design of a Discontinuous Galerkin Shallow-Water Model for a Dataflow Architecture on FPGA
    Kenter, Tobias
    Shambhu, Adesh
    Faghih-Naini, Sara
    Aizinger, Vadym
    [J]. PROCEEDINGS OF THE PLATFORM FOR ADVANCED SCIENTIFIC COMPUTING CONFERENCE (PASC '21), 2021,
  • [6] Algorithm-Hardware Co-design for BQSR Acceleration in Genome Analysis ToolKit
    Lo, Michael
    Fang, Zhenman
    Wang, Jie
    Zhou, Peipei
    Chang, Mau-Chung Frank
    Cong, Jason
    [J]. 28TH IEEE INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2020, : 157 - 166
  • [7] Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
    Yang, Yifan
    Huang, Qijing
    Wu, Bichen
    Zhang, Tianjun
    Ma, Liang
    Gambardella, Giulio
    Blott, Michaela
    Lavagno, Luciano
    Vissers, Kees
    Wawrzynek, John
    Keutzer, Kurt
    [J]. PROCEEDINGS OF THE 2019 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'19), 2019, : 23 - 32
  • [8] Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference
    Tambe, Thierry
    Yang, En-Yu
    Wan, Zishen
    Deng, Yuntian
    Reddi, Vijay Janapa
    Rush, Alexander
    Brooks, David
    Wei, Gu-Yeon
    [J]. PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,
  • [9] Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms
    Zhao, Jieru
    Zeng, Pai
    Shen, Guan
    Chen, Quan
    Guo, Minyi
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (09) : 2783 - 2796
  • [10] Algorithm-Hardware Co-Design in Computing Systems: From Embedded Systems to the Cloud
    Guan, Wenkai
    Ababei, Cristinel
    [J]. 2020 11TH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING WORKSHOPS (IGSC), 2020,