ResT: An Efficient Transformer for Visual Recognition

被引:0
|
作者
Zhang, Qing-Long [1 ]
Yang, Yu -Bin [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 21023, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Positional encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Transformer-based Convolution-free Visual Place Recognition
    Urban, Anna
    Kwolek, Bogdan
    2022 17TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV), 2022, : 161 - 166
  • [42] CSPFormer: A cross-spatial pyramid transformer for visual place recognition
    Li, Zhenyu
    Xu, Pengjie
    NEUROCOMPUTING, 2024, 580
  • [43] A PRE-TRAINED AUDIO-VISUAL TRANSFORMER FOR EMOTION RECOGNITION
    Minh Tran
    Soleymani, Mohammad
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4698 - 4702
  • [44] Facial expression recognition with grid-wise attention and visual transformer
    Huang, Qionghao
    Huang, Changqin
    Wang, Xizhe
    Jiang, Fan
    INFORMATION SCIENCES, 2021, 580 : 35 - 54
  • [45] Visual Speech Recognition in Natural Scenes Based on Spatial Transformer Networks
    Yu, Jin
    Wang, Shilin
    2020 IEEE 14TH INTERNATIONAL CONFERENCE ON ANTI-COUNTERFEITING, SECURITY, AND IDENTIFICATION (ASID), 2020, : 1 - 5
  • [46] Single visual model based on transformer for digital instrument reading recognition
    Li, Xiang
    Zeng, Changchang
    Yao, Yong
    Zhang, Sen
    Zhang, Haiding
    Yang, Suixian
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2025, 36 (01)
  • [47] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
    Song, Qiya
    Sun, Bin
    Li, Shutao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038
  • [48] Transformer-Prompted Network: Efficient Audio-Visual Segmentation via Transformer and Prompt Learning
    Wang, Yusen
    Qian, Xiaohong
    Zhou, Wujie
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 516 - 520
  • [49] DPT-tracker: Dual pooling transformer for efficient visual tracking
    Fang, Yang
    Xie, Bailian
    Khairuddin, Uswah
    Min, Zijian
    Jiang, Bingbing
    Li, Weisheng
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2024, 9 (04) : 948 - 959
  • [50] DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
    Liang, Yuxuan
    Zhou, Pan
    Zimmermann, Roger
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 577 - 595