ResT: An Efficient Transformer for Visual Recognition

被引:0
|
作者
Zhang, Qing-Long [1 ]
Yang, Yu -Bin [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 21023, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Positional encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer
    Zhou, Xinyuan
    Lan, Shiyong
    Wa, Wenwu
    Li, Xinyang
    Zhou, Siyuan
    Yang, Hongyu
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 233 - 245
  • [22] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
    Jiao, Jiayu
    Tang, Yu-Ming
    Lin, Kun-Yu
    Gao, Yipeng
    Ma, Andy J.
    Wang, Yaowei
    Zheng, Wei-Shi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8906 - 8919
  • [23] Combining ArcFace and Visual Transformer Mechanisms for Biometric Periocular Recognition
    Manesco, Joao Renato Ribeiro
    Marana, Aparecido Nilceu
    IEEE LATIN AMERICA TRANSACTIONS, 2023, 21 (07) : 814 - 820
  • [24] Hybrid CNN-Transformer Features for Visual Place Recognition
    Wang, Yuwei
    Qiu, Yuanying
    Cheng, Peitao
    Zhang, Junyu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1109 - 1122
  • [25] DPT: Deformable Patch-based Transformer for Visual Recognition
    Chen, Zhiyang
    Zhu, Yousong
    Zhao, Chaoyang
    Hu, Guosheng
    Zeng, Wei
    Wang, Jinqiao
    Tang, Ming
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2899 - 2907
  • [26] NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition
    Liu, Hao
    Jiang, Xinghua
    Li, Xin
    Bao, Zhimin
    Jiang, Deqiang
    Ren, Bo
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12063 - 12072
  • [27] TSE DeepLab: An efficient visual transformer for medical image segmentation
    Yang, Jingdong
    Tu, Jun
    Zhang, Xiaolin
    Yu, Shaoqing
    Zheng, Xianyou
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 80
  • [28] Adaptively bypassing vision transformer blocks for efficient visual tracking
    Yang, Xiangyang
    Zeng, Dan
    Wang, Xucheng
    Wu, You
    Ye, Hengzhou
    Zhao, Qijun
    Li, Shuiwang
    PATTERN RECOGNITION, 2025, 161
  • [29] Efficient visual transformer transferring from neural ODE perspective
    Niu, Hao
    Luo, Fengming
    Yuan, Bo
    Zhang, Yi
    Wang, Jianyong
    ELECTRONICS LETTERS, 2024, 60 (17)
  • [30] Structured Pruning for Efficient Visual Place Recognition
    Grainge, Oliver
    Milford, Michael
    Bodala, Indu
    Ramchurn, Sarvapali D.
    Ehsan, Shoaib
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (02): : 2024 - 2031