DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引:0
|
作者
Li, Ke [1 ]
Wang, Di [1 ]
Liu, Gang [1 ]
Zhu, Wenxuan [1 ]
Zhong, Haodi [1 ]
Wang, Quan [1 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
关键词
Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;
D O I
10.1016/j.neunet.2024.106653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Multi-Scale Vision Transformer for Defect Object Detection
    Lou, Liangshan
    Lu, Ke
    Xue, Jian
    Procedia Computer Science, 2023, 222 : 397 - 406
  • [2] Generative EO/IR multi-scale vision transformer for improved object detection
    Christian, Jonathan
    Bright, Max
    Summers, Jason
    Olson, Ashley
    Havens, Tim
    SYNTHETIC DATA FOR ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING: TOOLS, TECHNIQUES, AND APPLICATIONS II, 2024, 13035
  • [3] Grouped multi-scale vision transformer for medical image segmentation
    Zexuan Ji
    Zheng Chen
    Xiao Ma
    Scientific Reports, 15 (1)
  • [4] DeepFake detection with multi-scale convolution and vision transformer
    Lin, Hao
    Huang, Wenmin
    Luo, Weiqi
    Lu, Wei
    DIGITAL SIGNAL PROCESSING, 2023, 134
  • [5] Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation
    Gu, Jiaqi
    Kwon, Hyoukjun
    Wang, Dilin
    Ye, Wei
    Li, Meng
    Chen, Yu-Hsin
    Lai, Liangzhen
    Chandra, Vikas
    Pan, David Z.
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12084 - 12093
  • [6] A Novel Multi-Scale Transformer for Object Detection in Aerial Scenes
    Lu, Guanlin
    He, Xiaohui
    Wang, Qiang
    Shao, Faming
    Wang, Hongwei
    Wang, Jinkang
    DRONES, 2022, 6 (08)
  • [7] FPDT: a multi-scale feature pyramidal object detection transformer
    Huang, Kailai
    Wen, Mi
    Wang, Chen
    Ling, Lina
    JOURNAL OF APPLIED REMOTE SENSING, 2023, 17 (02)
  • [8] ANGLE TOKENIZATION GUIDED MULTI-SCALE VISION TRANSFORMER FOR ORIENTED OBJECT DETECTION IN REMOTE SENSING IMAGERY
    Zhang, Cong
    Liu, Tianshan
    Lam, Kin-Man
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 3063 - 3066
  • [9] Multi-Scale Polar Object Detection Based on Computer Vision
    Ding, Shifeng
    Zeng, Dinghan
    Zhou, Li
    Han, Sen
    Li, Fang
    Wang, Qingkai
    WATER, 2023, 15 (19)
  • [10] A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation
    Chen, Wuyang
    Du, Xianzhi
    Yang, Fan
    Beyer, Lucas
    Zhai, Xiaohua
    Lin, Tsung-Yi
    Chen, Huizhong
    Li, Jing
    Song, Xiaodan
    Wang, Zhangyang
    Zhou, Denny
    COMPUTER VISION, ECCV 2022, PT X, 2022, 13670 : 711 - 727