DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引:0
|
作者
Li, Ke [1 ]
Wang, Di [1 ]
Liu, Gang [1 ]
Zhu, Wenxuan [1 ]
Zhong, Haodi [1 ]
Wang, Quan [1 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
关键词
Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;
D O I
10.1016/j.neunet.2024.106653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] A Robust Image Semantic Communication System With Multi-Scale Vision Transformer
    Peng, Xiang
    Qin, Zhijin
    Tao, Xiaoming
    Lu, Jianhua
    Letaief, Khaled B.
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2025, 43 (04) : 1278 - 1291
  • [42] MSEDTNet: Multi-Scale Encoder and Decoder with Transformer for Bladder Tumor Segmentation
    Wang, Yixing
    Ye, Xiufen
    ELECTRONICS, 2022, 11 (20)
  • [43] EMSViT: Efficient Multi Scale Vision Transformer for Biomedical Image Segmentation
    Sagar, Abhinav
    BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES, BRAINLES 2021, PT I, 2022, 12962 : 39 - 51
  • [44] MESTrans: Multi-scale embedding spatial transformer for medical image segmentation
    Liu, Yatong
    Zhu, Yu
    Xin, Ying
    Zhang, Yanan
    Yang, Dawei
    Xu, Tao
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2023, 233
  • [45] Multi-Scale Liver Tumor Segmentation Algorithm by Fusing Convolution and Transformer
    Chen, Lifang
    Luo, Shiyong
    Computer Engineering and Applications, 2024, 60 (04) : 270 - 279
  • [46] Hierarchical Transformer with Multi-Scale Parallel Aggregation for Breast Tumor Segmentation
    Xia, Ping
    Wang, Yudie
    Lei, Bangjun
    Peng, Cheng
    Zhang, Guangyi
    Tang, Tinglong
    LASER & OPTOELECTRONICS PROGRESS, 2025, 62 (02)
  • [47] MUSTER: A Multi-Scale Transformer-Based Decoder for Semantic Segmentation
    Xu, Jing
    Shi, Wentao
    Gao, Pan
    Li, Qizhu
    Wang, Zhengwei
    IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025, 9 (01): : 202 - 212
  • [48] Lightweight Object Detection Combined with Multi-Scale Dilated-Convolution and Multi-Scale Deconvolution
    Yi, Qingming
    Lü, Renyi
    Shi, Min
    Luo, Aiwen
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2022, 50 (12): : 41 - 48
  • [49] Feature detection in the context of multi-scale vision model
    Peli, E
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 1999, 40 (04) : S43 - S43
  • [50] MULTI-SCALE REINFORCEMENT LEARNING STRATEGY FOR OBJECT DETECTION
    Luo, Yihao
    Cao, Xiang
    Zhang, Juntao
    Pan, Leixilan
    Wang, Tianjiang
    Feng, Qi
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2015 - 2019