Super Vision Transformer

被引:8
|
作者
Lin, Mingbao [1 ,2 ]
Chen, Mengzhao [1 ]
Zhang, Yuxin [1 ]
Shen, Chunhua [3 ]
Ji, Rongrong [1 ]
Cao, Liujuan [1 ]
机构
[1] Minist Educ China, Sch Informat, Key Lab Multimedia Trusted Percept & Efficient Com, Xiamen, Peoples R China
[2] Tencent Youtu Lab, Shanghai, Peoples R China
[3] Zhejiang Univ, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Hardware efficiency; Supernet; Vision transformer;
D O I
10.1007/s11263-023-01861-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We attempt to reduce the computational costs in vision transformers (ViTs), which increase quadratically in the token number. We present a novel training paradigm that trains only one ViT model at a time, but is capable of providing improved image recognition performance with various computational costs. Here, the trained ViT model, termed super vision transformer (SuperViT), is empowered with the versatile ability to solve incoming patches of multiple sizes as well as preserve informative tokens with multiple keeping rates (the ratio of keeping tokens) to achieve good hardware efficiency for inference, given that the available hardware resources often change from time to time. Experimental results on ImageNet demonstrate that our SuperViT can considerably reduce the computational costs of ViT models with even performance increase. For example, we reduce 2 x FLOPs of DeiT-S while increasing the Top-1 accuracy by 0.2% and 0.7% for 1.5 x reduction. Also, our SuperViT significantly outperforms existing studies on efficient vision transformers. For example, when consuming the same amount of FLOPs, our SuperViT surpasses the recent state-of-the-art EViT by 1.1% when using DeiT-S as their backbones. The project of this work is made publicly available at https://github.com/lmbxmu/SuperViT.
引用
收藏
页码:3136 / 3151
页数:16
相关论文
共 50 条
  • [31] Adder Attention for Vision Transformer
    Shu, Han
    Wang, Jiahao
    Chen, Hanting
    Li, Lin
    Yang, Yujiu
    Wang, Yunhe
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [32] On the Faithfulness of Vision Transformer Explanations
    Wu, Junyi
    Kang, Weitai
    Tang, Hao
    Hong, Yuan
    Yan, Yan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 10936 - 10945
  • [33] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [34] Spiking Convolutional Vision Transformer
    Talafha, Sameerah
    Rekabdar, Banafsheh
    Mousas, Christos
    Ekenna, Chinwe
    2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 225 - 226
  • [35] Towards Robust Vision Transformer
    Mao, Xiaofeng
    Qi, Gege
    Chen, Yuefeng
    Li, Xiaodan
    Duan, Ranjie
    Ye, Shaokai
    He, Yuan
    Xue, Hui
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12032 - 12041
  • [36] A lightweight vision transformer with symmetric modules for vision tasks
    Liang, Shengjun
    Yu, Mingxin
    Lu, Wenshuai
    Ji, Xinglong
    Tang, Xiongxin
    Liu, Xiaolin
    You, Rui
    INTELLIGENT DATA ANALYSIS, 2023, 27 (06) : 1741 - 1757
  • [37] A Vision Transformer Approach for Efficient Near-Field SAR Super-Resolution under Array Perturbation
    Smith, Josiah W.
    Alimam, Yusef
    Vedula, Geetika
    Torlak, Murat
    PROCEEDINGS OF THE 2022 IEEE TEXAS SYMPOSIUM ON WIRELESS AND MICROWAVE CIRCUITS AND SYSTEMS (WMCS), 2022,
  • [38] FLatten Transformer: Vision Transformer using Focused Linear Attention
    Han, Dongchen
    Pan, Xuran
    Han, Yizeng
    Song, Shiji
    Huang, Gao
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5938 - 5948
  • [39] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
    Liu, Ze
    Lin, Yutong
    Cao, Yue
    Hu, Han
    Wei, Yixuan
    Zhang, Zheng
    Lin, Stephen
    Guo, Baining
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9992 - 10002
  • [40] Orthogonal Transformer: An Efficient Vision Transformer Backbone with Token Orthogonalization
    Huang, Huaibo
    Zhou, Xiaoqiang
    He, Ran
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,