A Survey on Efficient Vision Transformers: Algorithms, Techniques, and Performance Benchmarking

被引：5

作者：

Papa, Lorenzo ^{[1
,2
]}

Russo, Paolo ^{[1
]}

Amerini, Irene ^{[1
]}

Zhou, Luping ^{[2
]}

机构：

[1] Sapienza Univ Rome, Dept Comp Control & Management Engn, I-00185 Rome, Italy

[2] Univ Sydney, Sch Elect & Informat Engn, Fac Engn, Sydney, NSW 2006, Australia

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 12期

关键词：

Computer vision; computational efficiency; vision transformer;

D O I：

10.1109/TPAMI.2024.3392941

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision Transformer (ViT) architectures are becoming increasingly popular and widely employed to tackle computer vision applications. Their main feature is the capacity to extract global information through the self-attention mechanism, outperforming earlier convolutional neural networks. However, ViT deployment and performance have grown steadily with their size, number of trainable parameters, and operations. Furthermore, self-attention's computational and memory cost quadratically increases with the image resolution. Generally speaking, it is challenging to employ these architectures in real-world applications due to many hardware and environmental restrictions, such as processing and computational capabilities. Therefore, this survey investigates the most efficient methodologies to ensure sub-optimal estimation performances. More in detail, four efficient categories will be analyzed: compact architecture, pruning, knowledge distillation, and quantization strategies. Moreover, a new metric called Efficient Error Rate has been introduced in order to normalize and compare models' features that affect hardware devices at inference time, such as the number of parameters, bits, FLOPs, and model size. Summarizing, this paper first mathematically defines the strategies used to make Vision Transformer efficient, describes and discusses state-of-the-art methodologies, and analyzes their performances over different application scenarios. Toward the end of this paper, we also discuss open challenges and promising research directions.

引用

页码：7682 / 7700

页数：19

共 50 条

[1] Transformers in Vision: A Survey
Khan, Salman
Naseer, Muzammal
Hayat, Munawar
Zamir, Syed Waqas
Khan, Fahad Shahbaz
Shah, Mubarak
ACM COMPUTING SURVEYS, 2022, 54 (10S)
[2] Efficient Transformers: A Survey
Tay, Yi
Dehghani, Mostafa
Bahri, Dara
Metzler, Donald
ACM COMPUTING SURVEYS, 2023, 55 (06)
[3] Vision Transformers in Image Restoration: A Survey
Ali, Anas M.
Benjdira, Bilel
Koubaa, Anis
El-Shafai, Walid
Khan, Zahid
Boulila, Wadii
SENSORS, 2023, 23 (05)
[4] Vision transformers for dense prediction: A survey
Zuo, Shuangquan
Xiao, Yun
Chang, Xiaojun
Wang, Xuanhong
KNOWLEDGE-BASED SYSTEMS, 2022, 253
[5] A Comprehensive Survey of Transformers for Computer Vision
Jamil, Sonain
Piran, Md. Jalil
Kwon, Oh-Jin
DRONES, 2023, 7 (05)
[6] Patch Slimming for Efficient Vision Transformers
Tang, Yehui
Han, Kai
Wang, Yunhe
Xu, Chang
Guo, Jianyuan
Xu, Chao
Tao, Dacheng
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12155 - 12164
[7] Efficient Vision Transformers with Partial Attention
Vo, Xuan-Thuy
Nguyen, Duy-Linh
Priadana, Adri
Jo, Kang-Hyun
COMPUTER VISION - ECCV 2024, PT LXXXIII, 2025, 15141 : 298 - 317
[8] A Survey on Efficient Training of Transformers
Zhuang, Bohan
Liu, Jing
Pan, Zizheng
He, Haoyu
Weng, Yuetian
Shen, Chunhua
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6823 - 6831
[9] A survey of techniques for designing I/O-efficient algorithms
Maheshwari, A
Zeh, N
ALGORITHMS FOR MEMORY HIERARCHIES: ADVANCED LECTURES, 2003, 2625 : 36 - 61
[10] Vision Transformers for Image Classification: A Comparative Survey
Wang, Yaoli
Deng, Yaojun
Zheng, Yuanjin
Chattopadhyay, Pratik
Wang, Lipo
TECHNOLOGIES, 2025, 13 (01)

← 1 2 3 4 5 →