CTA: Hardware-Software Co-design for Compressed Token Attention Mechanism

被引:2
|
作者
Wang, Haoran [1 ,2 ]
Xu, Haobo [1 ]
Wang, Ying [1 ,2 ,3 ]
Han, Yinhe [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, CICS, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Zhejiang Lab, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
PRODUCT QUANTIZATION; EFFICIENT;
D O I
10.1109/HPCA56546.2023.10070997
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The attention mechanism is becoming an integral part of modern neural networks, bringing breakthroughs to Natural Language Processing (NLP) applications and even Computer Vision (CV) applications. Unfortunately, the superiority of attention mechanism comes from its ability to model relations between any two positions in long sequence, which incurs high inference overhead. For state-of-the-art AI workloads such as Bert or GPT-2, attention mechanism is reported to account up to 50% of the inference overhead. Previous works seek to alleviate this performance bottleneck by removing useless relations for each position and accelerate position-specific operations. However their attempts require selecting from a sequence of relations once for each position, which is essentially frequent on-the-fly pruning and breaks the inherent parallelism in attention mechanism. In this paper, we propose CTA, an algorithm-architecture co-designed solution that can substantially reduce theoretic complexity of attention mechanism, enabling significant speedup and energy saving. Inspired by the fact that the feature sequence encoded by attention mechanism contain a large number of semantic feature repetition, we propose a novel approximation scheme that can efficiently remove that repetition, only calculating attention among necessary features thus reducing computation complexity quadratically. To utilize this algorithmic bonus and empower high performance attention mechanism inference, we devise specialized architecture to efficiently support the proposed approximation scheme. Extensive experiments show that, on average, CTA achieves 27.7x speedup, 634.0x energy savings with no accuracy loss, and 44.2x speedup, 950.0x energy savings with around 1% accuracy loss over Nvidia V100-SXM2 GPU. Also, CTA achieves 22.8x speedup, 479.6x energy savings over ELSA accelerator+GPU system.
引用
收藏
页码:429 / 441
页数:13
相关论文
共 50 条
  • [1] Hardware-Software Co-Design for Content-Based Sparse Attention
    Tang, Rui
    Zhang, Xiaoyu
    Liu, Rui
    Luo, Zhejian
    Chen, Xiaoming
    Han, Yinhe
    [J]. 2023 IEEE 41ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN, ICCD, 2023, : 415 - 418
  • [2] AES Hardware-Software Co-Design in WSN
    Otero, Carlos Tadeo Ortega
    Tse, Jonathan
    Manohar, Rajit
    [J]. 21ST IEEE INTERNATIONAL SYMPOSIUM ON ASYNCHRONOUS CIRCUITS AND SYSTEMS (ASYNC 2015), 2015, : 85 - 92
  • [3] Hardware-Software Co-Design of AES on FPGA
    Baskaran, Saambhavi
    Rajalakshmi, Pachamuthu
    [J]. PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI'12), 2012, : 1118 - 1122
  • [4] Hardware-Software Co-Design for Decimal Multiplication
    Mian, Riaz-ul-haque
    Shintani, Michihiro
    Inoue, Michiko
    [J]. COMPUTERS, 2021, 10 (02) : 1 - 19
  • [5] HARDWARE-SOFTWARE CO-DESIGN OF EMBEDDED SYSTEMS
    WOLF, WH
    [J]. PROCEEDINGS OF THE IEEE, 1994, 82 (07) : 967 - 989
  • [6] Hardware-Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms
    Zhao, Jieru
    Zeng, Pai
    Shen, Guan
    Chen, Quan
    Guo, Minyi
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 2024, 43 (09) : 2783 - 2796
  • [7] ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks
    Ham, Tae Jun
    Lee, Yejin
    Seo, Seong Hoon
    Kim, Soosung
    Choi, Hyunji
    Jung, Sung Jun
    Lee, Jae W.
    [J]. 2021 ACM/IEEE 48TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA 2021), 2021, : 692 - 705
  • [8] Hardware-Software Co-Design Based Obfuscation of Hardware Accelerators
    Chakraborty, Abhishek
    Srivastava, Ankur
    [J]. 2019 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI 2019), 2019, : 549 - 554
  • [9] Hardware-software co-design of a fingerprint matcher on card
    Fons, Mariano
    Fons, Francisco
    Canto, Enrique
    Lopez, Mariano
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ELECTRO/INFORMATION TECHNOLOGY, 2006, : 113 - 118
  • [10] Hardware-software co-design of an iris recognition algorithm
    Lopez, M.
    Daugman, J.
    Canto, E.
    [J]. IET INFORMATION SECURITY, 2011, 5 (01) : 60 - 68