FlexFormer: Flexible Transformer for efficient visual recognition *

被引:7
|
作者
Fan, Xinyi [1 ]
Liu, Huajun [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing, Peoples R China
关键词
Vision transformer; Frequency analysis; Image classification;
D O I
10.1016/j.patrec.2023.03.028
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformers have shown overwhelming superiority in computer vision communities compared with convolutional neural networks. Nevertheless, the understanding of multi-head self attentions, as the de facto ingredient of Transformers, is still limited, which leads to surging interest in explaining its core ideology. A notable theory interprets that, unlike high-frequency sensitive convolutions, self-attention be-haves like a generalized spatial smoothing and blurs the high spatial-frequency signals with depth in-creasing. In this paper, we design a Conv-MSA structure to extract efficient local contextual information and remedy the inherent drawback of self-attention. Accordingly, a flexible transformer structure named FlexFormer, with linear computational complexity on input image size, is proposed. Experimental results on several visual recognition benchmarks show that our FlexFormer achieved the state-of-the-art results on visual recognition tasks with fewer parameters and higher computational efficiency. (c) 2023 Elsevier B.V. All rights reserved.
引用
收藏
页码:95 / 101
页数:7
相关论文
共 50 条
  • [21] Robust and Efficient Modulation Recognition with Pyramid Signal Transformer
    Su, He
    Fan, Xinyi
    Liu, Huajun
    2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 1868 - 1874
  • [22] Vision Transformer for Fast and Efficient Scene Text Recognition
    Atienza, Rowel
    DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT I, 2021, 12821 : 319 - 334
  • [23] Visual-Haptic-Kinesthetic Object Recognition with Multimodal Transformer
    Zhou, Xinyuan
    Lan, Shiyong
    Wa, Wenwu
    Li, Xinyang
    Zhou, Siyuan
    Yang, Hongyu
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 233 - 245
  • [24] DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition
    Jiao, Jiayu
    Tang, Yu-Ming
    Lin, Kun-Yu
    Gao, Yipeng
    Ma, Andy J.
    Wang, Yaowei
    Zheng, Wei-Shi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8906 - 8919
  • [25] Combining ArcFace and Visual Transformer Mechanisms for Biometric Periocular Recognition
    Manesco, Joao Renato Ribeiro
    Marana, Aparecido Nilceu
    IEEE LATIN AMERICA TRANSACTIONS, 2023, 21 (07) : 814 - 820
  • [26] Visual communications of tomorrow: Natural, efficient, and flexible
    Konrad, J
    IEEE COMMUNICATIONS MAGAZINE, 2001, 39 (01) : 126 - 133
  • [27] Hybrid CNN-Transformer Features for Visual Place Recognition
    Wang, Yuwei
    Qiu, Yuanying
    Cheng, Peitao
    Zhang, Junyu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (03) : 1109 - 1122
  • [28] DPT: Deformable Patch-based Transformer for Visual Recognition
    Chen, Zhiyang
    Zhu, Yousong
    Zhao, Chaoyang
    Hu, Guosheng
    Zeng, Wei
    Wang, Jinqiao
    Tang, Ming
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2899 - 2907
  • [29] NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition
    Liu, Hao
    Jiang, Xinghua
    Li, Xin
    Bao, Zhimin
    Jiang, Deqiang
    Ren, Bo
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12063 - 12072
  • [30] Flexible and efficient mobile optical mark recognition
    Sahin, Suhap
    Ilkin, Sumeyya
    JOURNAL OF ELECTRONIC IMAGING, 2018, 27 (03)