A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking

被引:5
|
作者
Papa, Lorenzo [1 ,2 ]
Russo, Paolo [1 ]
Amerini, Irene [1 ]
Zhou, Luping [2 ]
机构
[1] Sapienza Univ Rome, Dept Comp Control & Management Engn, I-00185 Rome, Italy
[2] Univ Sydney, Sch Elect & Informat Engn, Fac Engn, Sydney, NSW 2006, Australia
关键词
Computer vision; computational efficiency; vision transformer;
D O I
10.1109/TPAMI.2024.3392941
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper first mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.
引用
收藏
页码:7682 / 7700
页数:19
相关论文
共 50 条
  • [31] An overview of benchmarking techniques for multi-objective evolutionary algorithms
    Ang, KH
    Li, Y
    SOFT COMPUTING AND INDUSTRY: RECENT APPLICATIONS, 2002, : 337 - 348
  • [32] ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformers
    You, Haoran
    Shi, Huihong
    Guo, Yipin
    Lin, Yingyan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [33] Making Vision Transformers Efficient from A Token Sparsification View
    Chang, Shuning
    Wang, Pichao
    Lin, Ming
    Wang, Fan
    Zhang, David Junhao
    Jin, Rong
    Shou, Mike Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6195 - 6205
  • [34] Long-Short Transformer: Efficient Transformers for Language and Vision
    Zhu, Chen
    Ping, Wei
    Xiao, Chaowei
    Shoeybi, Mohammad
    Goldstein, Tom
    Anandkumar, Anima
    Catanzaro, Bryan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [35] Real-time Style Transfer with Efficient Vision Transformers
    Benmeziane, Hadjer
    Ouarnoughi, Hamza
    El Maghraoui, Kaoutar
    Niar, Smail
    PROCEEDINGS OF THE 5TH INTERNATIONAL WORKSHOP ON EDGE SYSTEMS, ANALYTICS AND NETWORKING (EDGESYS'22), 2022, : 31 - 36
  • [36] Distilling efficient Vision Transformers from CNNs for semantic segmentation
    Zheng, Xu
    Luo, Yunhao
    Zhou, Pengyuan
    Wang, Lin
    PATTERN RECOGNITION, 2025, 158
  • [37] Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking
    Kang, Ben
    Chen, Xin
    Wang, Dong
    Peng, Houwen
    Lu, Huchuan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 9578 - 9587
  • [38] The Role of Masking for Efficient Supervised Knowledge Distillation of Vision Transformers
    Son, Seungwoo
    Ryu, Jegwang
    Le, Namhoon
    Lee, Jaeho
    COMPUTER VISION - ECCV 2024, PT LXVII, 2025, 15125 : 379 - 396
  • [39] Efficient feature selection for pre-trained vision transformers
    Huang, Lan
    Zeng, Jia
    Yu, Mengqiang
    Ding, Weiping
    Bai, Xingyu
    Wang, Kangping
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 254
  • [40] Computer vision algorithms and hardware implementations: A survey
    Feng, Xin
    Jiang, Youni
    Yang, Xuejiao
    Du, Ming
    Li, Xin
    INTEGRATION-THE VLSI JOURNAL, 2019, 69 : 309 - 320