FlexFormer: Flexible Transformer for efficient visual recognition *

被引:7
|
作者
Fan, Xinyi [1 ]
Liu, Huajun [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Vision transformer; Frequency analysis; Image classification;
D O I
10.1016/j.patrec.2023.03.028
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformers have shown overwhelming superiority in computer vision communities compared with convolutional neural networks. Nevertheless, the understanding of multi-head self attentions, as the de facto ingredient of Transformers, is still limited, which leads to surging interest in explaining its core ideology. A notable theory interprets that, unlike high-frequency sensitive convolutions, self-attention be-haves like a generalized spatial smoothing and blurs the high spatial-frequency signals with depth in-creasing. In this paper, we design a Conv-MSA structure to extract efficient local contextual information and remedy the inherent drawback of self-attention. Accordingly, a flexible transformer structure named FlexFormer, with linear computational complexity on input image size, is proposed. Experimental results on several visual recognition benchmarks show that our FlexFormer achieved the state-of-the-art results on visual recognition tasks with fewer parameters and higher computational efficiency. (c) 2023 Elsevier B.V. All rights reserved.
引用
收藏
页码:95 / 101
页数:7
相关论文
共 50 条
  • [1] ResT: An Efficient Transformer for Visual Recognition
    Zhang, Qing-Long
    Yang, Yu -Bin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] A Hybrid Visual Transformer for Efficient Deep Human Activity Recognition
    Djenouri, Youcef
    Belbachir, Ahmed Nabil
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 721 - 730
  • [3] ETR: An Efficient Transformer for Re-ranking in Visual Place Recognition
    Zhang, Hao
    Chen, Xin
    Jing, Heming
    Zheng, Yingbin
    Wu, Yuan
    Jin, Cheng
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5654 - 5663
  • [4] Contextual Transformer Networks for Visual Recognition
    Li, Yehao
    Yao, Ting
    Pan, Yingwei
    Mei, Tao
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 1489 - 1500
  • [5] Efficient Visual Recognition
    Li Liu
    Matti Pietikäinen
    Jie Qin
    Wanli Ouyang
    Luc Van Gool
    International Journal of Computer Vision, 2020, 128 : 1997 - 2001
  • [6] Efficient Visual Recognition
    Liu, Li
    Pietikainen, Matti
    Qin, Jie
    Ouyang, Wanli
    Van Gool, Luc
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (8-9) : 1997 - 2001
  • [7] FLEXIBLE CODING IN VISUAL WORD RECOGNITION
    PUGH, K
    REXER, K
    KATZ, L
    BULLETIN OF THE PSYCHONOMIC SOCIETY, 1992, 30 (06) : 460 - 460
  • [8] VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection
    Cui, Yu
    Farazi, Moshiur
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4079 - 4086
  • [9] VTST: Efficient Visual Tracking With a Stereoscopic Transformer
    Gu, Fengwei
    Lu, Jun
    Cai, Chengtao
    Zhu, Qidan
    Ju, Zhaojie
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2024, 8 (03): : 2401 - 2416
  • [10] Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
    Kim, Sehoon
    Gholami, Amir
    Shaw, Albert
    Lee, Nicholas
    Mangalam, Karttikeya
    Malik, Jitendra
    Mahoney, Michael W.
    Keutzer, Kurt
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,