A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

被引：16

作者：

Peng, Hongwu ^{[1
]}

Huang, Shaoyi ^{[1
]}

Chen, Shiyang ^{[2
]}

Li, Bingbing ^{[1
]}

Geng, Tong ^{[3
]}

Li, Ang ^{[3
]}

Jiang, Weiwen ^{[4
]}

Wen, Wujie ^{[5
]}

Bi, Jinbo ^{[1
]}

Liu, Hang ^{[2
]}

Ding, Caiwen ^{[1
]}

机构：

[1] Univ Connecticut, Storrs, CT 06269 USA

[2] Stevens Inst Technol, Hoboken, NJ USA

[3] Pacific Northwest Natl Lab, Richland, WA USA

[4] George Mason Univ, Fairfax, VA USA

[5] Lehigh Univ, Bethlehem, PA USA

来源：

PROCEEDINGS OF THE 59TH ACM/IEEE DESIGN AUTOMATION CONFERENCE, DAC 2022 | 2022年

关键词：

Transformer; Attention; BERT; Length adaptive; FPGA;

D O I：

10.1145/3489517.3530585

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable triumphs, the prolonged turnaround time of Transformer models is a widely recognized roadblock. The variety of sequence lengths imposes additional computing overhead where inputs need to be zero-padded to the maximum sentence length in the batch to accommodate the parallel computing platforms. This paper targets the field-programmable gate array (FPGA) and proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. Particularly, we develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. The proposed sparse attention operator brings the complexity of attention-based models down to linear complexity and alleviates the off-chip memory traffic. The proposed length-aware resource hardware scheduling algorithm dynamically allocates the hardware resources to fill up the pipeline slots and eliminates bubbles for NLP tasks. Experiments show that our design has very small accuracy loss and has 80.2 x and 2.6 x speedup compared to CPU and GPU implementation, and 4 x higher energy efficiency than state-of-the-art GPU accelerator optimized via CUBLAS GEMM.

引用

页码：1135 / 1140

页数：6

共 50 条

[1] Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices
Zhang, Xinyi
Wu, Yawen
Zhou, Peipei
Tang, Xulong
Hu, Jingtong
[J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2021, 20 (05)
[2] Algorithm-hardware Co-design for Deformable Convolution
Huang, Qijing
Wang, Dequan
Gao, Yizhao
Cai, Yaohui
Dong, Zhen
Wu, Bichen
Keutzer, Kurt
Wawrzynek, John
[J]. FIFTH WORKSHOP ON ENERGY EFFICIENT MACHINE LEARNING AND COGNITIVE COMPUTING - NEURIPS EDITION (EMC2-NIPS 2019), 2019, : 48 - 51
[3] High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design
Anupreetham, Anupreetham
Ibrahim, Mohamed
Hall, Mathew
Boutros, Andrew
Kuzhively, Ajay
Mohanty, Abinash
Nurvitadhi, Eriko
Betz, Vaughn
Cao, Yu
Seo, Jae-Sun
[J]. ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS, 2024, 17 (01)
[4] Toolflow for the algorithm-hardware co-design of memristive ANN accelerators
Wabnitz, Malte
Gemmeke, Tobias
[J]. Memories - Materials, Devices, Circuits and Systems, 2023, 5
[5] Algorithm-Hardware Co-design of a Discontinuous Galerkin Shallow-Water Model for a Dataflow Architecture on FPGA
Kenter, Tobias
Shambhu, Adesh
Faghih-Naini, Sara
Aizinger, Vadym
[J]. PROCEEDINGS OF THE PLATFORM FOR ADVANCED SCIENTIFIC COMPUTING CONFERENCE (PASC '21), 2021,
[6] Algorithm-Hardware Co-design for BQSR Acceleration in Genome Analysis ToolKit
Lo, Michael
Fang, Zhenman
Wang, Jie
Zhou, Peipei
Chang, Mau-Chung Frank
Cong, Jason
[J]. 28TH IEEE INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM), 2020, : 157 - 166
[7] Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on Embedded FPGAs
Yang, Yifan
Huang, Qijing
Wu, Bichen
Zhang, Tianjun
Ma, Liang
Gambardella, Giulio
Blott, Michaela
Lavagno, Luciano
Vissers, Kees
Wawrzynek, John
Keutzer, Kurt
[J]. PROCEEDINGS OF THE 2019 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS (FPGA'19), 2019, : 23 - 32
[8] Algorithm-Hardware Co-Design of Adaptive Floating-Point Encodings for Resilient Deep Learning Inference
Tambe, Thierry
Yang, En-Yu
Wan, Zishen
Deng, Yuntian
Reddi, Vijay Janapa
Rush, Alexander
Brooks, David
Wei, Gu-Yeon
[J]. PROCEEDINGS OF THE 2020 57TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), 2020,
[9] Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms
Zhao, Jieru
Zeng, Pai
Shen, Guan
Chen, Quan
Guo, Minyi
[J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (09) : 2783 - 2796
[10] Algorithm-Hardware Co-Design in Computing Systems: From Embedded Systems to the Cloud
Guan, Wenkai
Ababei, Cristinel
[J]. 2020 11TH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING WORKSHOPS (IGSC), 2020,

← 1 2 3 4 5 →