ResT: An Efficient Transformer for Visual Recognition

被引:0
|
作者
Zhang, Qing-Long [1 ]
Yang, Yu -Bin [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 21023, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Positional encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Target Focused Shallow Transformer Framework for Efficient Visual Tracking
    Rahman, Md Maklachur
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23409 - 23410
  • [32] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
  • [33] VST plus plus : Efficient and Stronger Visual Saliency Transformer
    Liu, Nian
    Luo, Ziyang
    Zhang, Ni
    Han, Junwei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (11) : 7300 - 7316
  • [34] Efficient Mining of Optimal AND/OR Patterns for Visual Recognition
    Weng, Chaoqun
    Yuan, Junsong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (05) : 626 - 635
  • [35] A Bayesian model for efficient visual search and recognition
    Elazary, Lior
    Itti, Laurent
    VISION RESEARCH, 2010, 50 (14) : 1338 - 1352
  • [36] Variable-hyperparameter visual transformer for efficient image inpainting
    Campana, Jose Luis Flores
    Decker, Luis Gustavo Lorgus
    Souza, Marcos Roberto e
    Maia, Helena de Almeida
    Pedrini, Helio
    COMPUTERS & GRAPHICS-UK, 2023, 113 : 57 - 68
  • [37] Food recognition via an efficient neural network with transformer grouping
    Sheng, Guorui
    Sun, Shuqi
    Liu, Chengxu
    Yang, Yancun
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 11465 - 11481
  • [38] Mixed Attention and Channel Shift Transformer for Efficient Action Recognition
    Lu, Xiusheng
    Hao, Yanbin
    Cheng, Lechao
    Zhao, Sicheng
    Li, Yutao
    Song, Mingli
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (03)
  • [39] Audio-Visual Action Recognition Using Transformer Fusion Network
    Kim, Jun-Hwa
    Won, Chee Sun
    APPLIED SCIENCES-BASEL, 2024, 14 (03):
  • [40] AVT: AU-ASSISTED VISUAL TRANSFORMER FOR FACIAL EXPRESSION RECOGNITION
    Jin, Rijin
    Zhao, Sirui
    Hao, Zhongkai
    Xu, Yifan
    Xu, Tong
    Chen, Enhong
    2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2661 - 2665