HeatViT: Hardware-Efficient Adaptive Token Pruning for Vision Transformers

被引:25
|
作者
Dong, Peiyan [1 ]
Sun, Mengshu [1 ]
Lu, Alec [2 ]
Xie, Yanyue [1 ]
Liu, Kenneth [2 ]
Kong, Zhenglun [1 ]
Meng, Xin [1 ]
Li, Zhengang [1 ]
Lin, Xue [1 ]
Fang, Zhenman [2 ]
Wang, Yanzhi [1 ]
机构
[1] Northeastern Univ, Boston, MA 02115 USA
[2] Simon Fraser Univ, Burnaby, BC, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
Vision Transformer; FPGA Accelerator; Hardware and Software Co-design; Data-level Sparsity;
D O I
10.1109/HPCA56546.2023.10071047
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
While vision transformers (ViTs) have continuously achieved new milestones in the field of computer vision, their sophisticated network architectures with high computation and memory costs have impeded their deployment on resource-limited edge devices. In this paper, we propose a hardware-efficient image-adaptive token pruning framework called HeatViT for efficient yet accurate ViT acceleration on embedded FPGAs. Based on the inherent computational patterns in ViTs, we first adopt an effective, hardware-efficient, and learnable head-evaluation token selector, which can be progressively inserted before transformer blocks to dynamically identify and consolidate the non-informative tokens from input images. Moreover, we implement the token selector on hardware by adding miniature control logic to heavily reuse existing hardware components built for the backbone ViT. To improve the hardware efficiency, we further employ 8-bit fixed-point quantization and propose polynomial approximations with regularization effect on quantization error for the frequently used nonlinear functions in ViTs. Compared to existing ViT pruning studies, under the similar computation cost, HeatViT can achieve 0.7%similar to 8.9% higher accuracy; while under the similar model accuracy, HeatViT can achieve more than 28.4%similar to 65.3% computation reduction, for various widely used ViTs, including DeiT-T, DeiT-S, DeiT-B, LV-ViT-S, and LV-ViT-M, on the ImageNet dataset. Compared to the baseline hardware accelerator, our implementations of HeatViT on the Xilinx ZCU102 FPGA achieve 3.46x similar to 4.89x speedup with a trivial resource utilization overhead of 8%similar to 11% more DSPs and 5%similar to 8% more LUTs.
引用
收藏
页码:442 / 455
页数:14
相关论文
共 50 条
  • [1] Adaptive Token Sampling for Efficient Vision Transformers
    Fayyaz, Mohsen
    Koohpayegani, Soroush Abbasi
    Jafari, Farnoush Rezaei
    Sengupta, Sunando
    Joze, Hamid Reza Vaezi
    Sommerlade, Eric
    Pirsiavash, Hamed
    Gall, Juergen
    COMPUTER VISION, ECCV 2022, PT XI, 2022, 13671 : 396 - 414
  • [2] Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation
    Tang, Quan
    Zhang, Bowen
    Liu, Jiajun
    Liu, Fagui
    Liu, Yifan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 777 - 786
  • [3] An Attention-Based Token Pruning Method for Vision Transformers
    Luo, Kaicheng
    Li, Huaxiong
    Zhou, Xianzhong
    Huang, Bing
    ROUGH SETS, IJCRS 2022, 2022, 13633 : 274 - 288
  • [4] Learned Token Pruning for Transformers
    Kim, Sehoon
    Shen, Sheng
    Thorsley, David
    Gholami, Amir
    Kwon, Woosuk
    Hassoun, Joseph
    Keutzer, Kurt
    PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 784 - 794
  • [5] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification
    Rao, Yongming
    Zhao, Wenliang
    Liu, Benlin
    Lu, Jiwen
    Zhou, Jie
    Hsieh, Cho-Jui
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [6] Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers
    Wei, Siyuan
    Ye, Tianzhu
    Zhang, Shen
    Tang, Yao
    Liang, Jiajun
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2092 - 2101
  • [7] Making Vision Transformers Efficient from A Token Sparsification View
    Chang, Shuning
    Wang, Pichao
    Lin, Ming
    Wang, Fan
    Zhang, David Junhao
    Jin, Rong
    Shou, Mike Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6195 - 6205
  • [8] ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
    Norouzi, Narges
    Orlova, Svetlana
    de Geus, Daan
    Dubbelman, Gijs
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 15773 - 15782
  • [9] Make a Long Image Short: Adaptive Token Length for Vision Transformers
    Zhou, Qiqi
    Zhu, Yichen
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, ECML PKDD 2023, PT II, 2023, 14170 : 69 - 85
  • [10] Hardware-efficient color correlation-adaptive demosaicing with multifiltering
    Lee, Seung Hyun
    Choi, Dong Yoon
    Song, Byung Cheol
    JOURNAL OF ELECTRONIC IMAGING, 2019, 28 (01)