ResT: An Efficient Transformer for Visual Recognition

被引:0
|
作者
Zhang, Qing-Long [1 ]
Yang, Yu -Bin [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 21023, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Positional encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] FlexFormer: Flexible Transformer for efficient visual recognition *
    Fan, Xinyi
    Liu, Huajun
    PATTERN RECOGNITION LETTERS, 2023, 169 : 95 - 101
  • [2] A Hybrid Visual Transformer for Efficient Deep Human Activity Recognition
    Djenouri, Youcef
    Belbachir, Ahmed Nabil
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 721 - 730
  • [3] ETR: An Efficient Transformer for Re-ranking in Visual Place Recognition
    Zhang, Hao
    Chen, Xin
    Jing, Heming
    Zheng, Yingbin
    Wu, Yuan
    Jin, Cheng
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5654 - 5663
  • [4] Contextual Transformer Networks for Visual Recognition
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Mei, Tao
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 1489 - 1500
  • [5] Efficient Visual Recognition
    Li Liu
    Matti Pietikäinen
    Jie Qin
    Wanli Ouyang
    Luc Van Gool
    International Journal of Computer Vision, 2020, 128 : 1997 - 2001
  • [6] Efficient Visual Recognition
    Liu, Li
    Pietikainen, Matti
    Qin, Jie
    Ouyang, Wanli
    Van Gool, Luc
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (8-9) : 1997 - 2001
  • [7] Recursive Spatial Transformer (ReST) for Alignment-Free Face Recognition
    Wu, Wanglong
    Kan, Meina
    Liu, Xin
    Yang, Yi
    Shan, Shiguang
    Chen, Xilin
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3792 - 3800
  • [8] VTST: Efficient Visual Tracking With a Stereoscopic Transformer
    Gu, Fengwei
    Lu, Jun
    Cai, Chengtao
    Zhu, Qidan
    Ju, Zhaojie
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (03): : 2401 - 2416
  • [9] Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
    Kim, Sehoon
    Gholami, Amir
    Shaw, Albert
    Lee, Nicholas
    Mangalam, Karttikeya
    Malik, Jitendra
    Mahoney, Michael W.
    Keutzer, Kurt
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [10] Context Transformer and Adaptive Method with Visual Transformer for Robust Facial Expression Recognition
    Xiong, Lingxin
    Zhang, Jicun
    Zheng, Xiaojia
    Wang, Yuxin
    APPLIED SCIENCES-BASEL, 2024, 14 (04):