CTA: Hardware-Software Co-design for Compressed Token Attention Mechanism

被引：2

作者：

Wang, Haoran ^{[1
,2
]}

Xu, Haobo ^{[1
]}

Wang, Ying ^{[1
,2
,3
]}

Han, Yinhe ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, CICS, Inst Comp Technol, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Zhejiang Lab, Hangzhou, Peoples R China

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, HPCA | 2023年

基金：

中国国家自然科学基金;

关键词：

PRODUCT QUANTIZATION; EFFICIENT;

D O I：

10.1109/HPCA56546.2023.10070997

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The attention mechanism is becoming an integral part of modern neural networks, bringing breakthroughs to Natural Language Processing (NLP) applications and even Computer Vision (CV) applications. Unfortunately, the superiority of attention mechanism comes from its ability to model relations between any two positions in long sequence, which incurs high inference overhead. For state-of-the-art AI workloads such as Bert or GPT-2, attention mechanism is reported to account up to 50% of the inference overhead. Previous works seek to alleviate this performance bottleneck by removing useless relations for each position and accelerate position-specific operations. However their attempts require selecting from a sequence of relations once for each position, which is essentially frequent on-the-fly pruning and breaks the inherent parallelism in attention mechanism. In this paper, we propose CTA, an algorithm-architecture co-designed solution that can substantially reduce theoretic complexity of attention mechanism, enabling significant speedup and energy saving. Inspired by the fact that the feature sequence encoded by attention mechanism contain a large number of semantic feature repetition, we propose a novel approximation scheme that can efficiently remove that repetition, only calculating attention among necessary features thus reducing computation complexity quadratically. To utilize this algorithmic bonus and empower high performance attention mechanism inference, we devise specialized architecture to efficiently support the proposed approximation scheme. Extensive experiments show that, on average, CTA achieves 27.7x speedup, 634.0x energy savings with no accuracy loss, and 44.2x speedup, 950.0x energy savings with around 1% accuracy loss over Nvidia V100-SXM2 GPU. Also, CTA achieves 22.8x speedup, 479.6x energy savings over ELSA accelerator+GPU system.

引用

页码：429 / 441

页数：13

共 50 条

[1] Hardware-Software Co-Design for Content-Based Sparse Attention
Tang, Rui
Zhang, Xiaoyu
Liu, Rui
Luo, Zhejian
Chen, Xiaoming
Han, Yinhe
[J]. 2023 IEEE 41ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN, ICCD, 2023, : 415 - 418
[2] AES Hardware-Software Co-Design in WSN
Otero, Carlos Tadeo Ortega
Tse, Jonathan
Manohar, Rajit
[J]. 21ST IEEE INTERNATIONAL SYMPOSIUM ON ASYNCHRONOUS CIRCUITS AND SYSTEMS (ASYNC 2015), 2015, : 85 - 92
[3] Hardware-Software Co-Design of AES on FPGA
Baskaran, Saambhavi
Rajalakshmi, Pachamuthu
[J]. PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 1118 - 1122
[4] Hardware-Software Co-Design for Decimal Multiplication
Mian, Riaz-ul-haque
Shintani, Michihiro
Inoue, Michiko
[J]. COMPUTERS, 2021, 10 (02) : 1 - 19
[5] HARDWARE-SOFTWARE CO-DESIGN OF EMBEDDED SYSTEMS
WOLF, WH
[J]. PROCEEDINGS OF THE IEEE, 1994, 82 (07) : 967 - 989
[6] Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms
Zhao, Jieru
Zeng, Pai
Shen, Guan
Chen, Quan
Guo, Minyi
[J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (09) : 2783 - 2796
[7] ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
Ham, Tae Jun
Lee, Yejin
Seo, Seong Hoon
Kim, Soosung
Choi, Hyunji
Jung, Sung Jun
Lee, Jae W.
[J]. 2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 692 - 705
[8] Hardware-Software Co-Design Based Obfuscation of Hardware Accelerators
Chakraborty, Abhishek
Srivastava, Ankur
[J]. 2019 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2019), 2019, : 549 - 554
[9] Hardware-software co-design of a fingerprint matcher on card
Fons, Mariano
Fons, Francisco
Canto, Enrique
Lopez, Mariano
[J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ELECTRO/INFORMATION TECHNOLOGY, 2006, : 113 - 118
[10] Hardware-software co-design of an iris recognition algorithm
Lopez, M.
Daugman, J.
Canto, E.
[J]. IET INFORMATION SECURITY, 2011, 5 (01) : 60 - 68

← 1 2 3 4 5 →