DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引：0

作者：

Li, Ke ^{[1
]}

Wang, Di ^{[1
]}

Liu, Gang ^{[1
]}

Zhu, Wenxuan ^{[1
]}

Zhong, Haodi ^{[1
]}

Wang, Quan ^{[1
]}

机构：

[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China

来源：

NEURAL NETWORKS | 2024年 / 180卷

关键词：

Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;

D O I：

10.1016/j.neunet.2024.106653

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.

引用

页数：14

共 50 条

[41] A Robust Image Semantic Communication System With Multi-Scale Vision Transformer
Peng, Xiang
Qin, Zhijin
Tao, Xiaoming
Lu, Jianhua
Letaief, Khaled B.
IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2025, 43 (04) : 1278 - 1291
[42] MSEDTNet: Multi-Scale Encoder and Decoder with Transformer for Bladder Tumor Segmentation
Wang, Yixing
Ye, Xiufen
ELECTRONICS, 2022, 11 (20)
[43] EMSViT: Efficient Multi Scale Vision Transformer for Biomedical Image Segmentation
Sagar, Abhinav
BRAINLESION: GLIOMA, MULTIPLE SCLEROSIS, STROKE AND TRAUMATIC BRAIN INJURIES, BRAINLES 2021, PT I, 2022, 12962 : 39 - 51
[44] MESTrans: Multi-scale embedding spatial transformer for medical image segmentation
Liu, Yatong
Zhu, Yu
Xin, Ying
Zhang, Yanan
Yang, Dawei
Xu, Tao
COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2023, 233
[45] Multi-Scale Liver Tumor Segmentation Algorithm by Fusing Convolution and Transformer
Chen, Lifang
Luo, Shiyong
Computer Engineering and Applications, 2024, 60 (04) : 270 - 279
[46] Hierarchical Transformer with Multi-Scale Parallel Aggregation for Breast Tumor Segmentation
Xia, Ping
Wang, Yudie
Lei, Bangjun
Peng, Cheng
Zhang, Guangyi
Tang, Tinglong
LASER & OPTOELECTRONICS PROGRESS, 2025, 62 (02)
[47] MUSTER: A Multi-Scale Transformer-Based Decoder for Semantic Segmentation
Xu, Jing
Shi, Wentao
Gao, Pan
Li, Qizhu
Wang, Zhengwei
IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL INTELLIGENCE, 2025, 9 (01): : 202 - 212
[48] Lightweight Object Detection Combined with Multi-Scale Dilated-Convolution and Multi-Scale Deconvolution
Yi, Qingming
Lü, Renyi
Shi, Min
Luo, Aiwen
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2022, 50 (12): : 41 - 48
[49] Feature detection in the context of multi-scale vision model
Peli, E
INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 1999, 40 (04) : S43 - S43
[50] MULTI-SCALE REINFORCEMENT LEARNING STRATEGY FOR OBJECT DETECTION
Luo, Yihao
Cao, Xiang
Zhang, Juntao
Pan, Leixilan
Wang, Tianjiang
Feng, Qi
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 2015 - 2019

← 1 2 3 4 5 →