DiagSWin: A multi-scale vision transformer with diagonal-shaped windows for object detection and segmentation

被引:0
|
作者
Li, Ke [1 ]
Wang, Di [1 ]
Liu, Gang [1 ]
Zhu, Wenxuan [1 ]
Zhong, Haodi [1 ]
Wang, Quan [1 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
关键词
Vision transformer; Multi-scale; Diagonal-shaped windows; Object detection and semantic segmentation;
D O I
10.1016/j.neunet.2024.106653
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, Vision Transformer and its variants have demonstrated remarkable performance on various computer vision tasks, thanks to its competence in capturing global visual dependencies through self-attention. However, global self-attention suffers from high computational cost due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection and semantic segmentation). Many recent works have attempted to reduce the cost by applying fine-grained local attention, but these approaches cripple the long-range modeling power of the original self-attention mechanism. Furthermore, these approaches usually have similar receptive fields within each layer, thus limiting the ability of each self-attention layer to capture multi-scale features, resulting in performance degradation when handling images with objects of different scales. To address these issues, we develop the Diagonal-shaped Window (DiagSWin) attention mechanism for modeling attentions in diagonal regions at hybrid scales per attention layer. The key idea of DiagSWin attention is to inject multi-scale receptive field sizes into tokens: before computing the self-attention matrix, each token attends its closest surrounding tokens at fine granularity and the tokens far away at coarse granularity. This mechanism is able to effectively capture multi-scale context information while reducing computational complexity. With DiagSwin attention, we present a new variant of Vision Transformer models, called DiagSWin Transformers, and demonstrate their superiority in extensive experiments across various tasks. Specifically, the DiagSwin Transformer with a large size achieves 84.4% Top-1 accuracy and outperforms the SOTA CSWin Transformer on ImageNet with 40% fewer model size and computation cost. When employed as backbones, DiagSWin Transformers achieve significant improvements over the current SOTA modules. In addition, our DiagSWin-Base model yields 51.1 box mAP and 45.8 mask mAP on COCO for object detection and segmentation, and 52.3 mIoU on the ADE20K for semantic segmentation.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Feature Enhancement for Multi-scale Object Detection
    Huicheng Zheng
    Jiajie Chen
    Lvran Chen
    Ye Li
    Zhiwei Yan
    Neural Processing Letters, 2020, 51 : 1907 - 1919
  • [32] Multi-OCDTNet: A Novel Multi-Scale Object Context Dilated Transformer Network for Retinal Blood Vessel Segmentation
    Wu, Chengwei
    Guo, Min
    Ma, Miao
    Wang, Kaiguang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (11)
  • [33] Multi-OCDTNet: A Novel Multi-Scale Object Context Dilated Transformer Network for Retinal Blood Vessel Segmentation
    Wu, Chengwei
    Guo, Min
    Ma, Miao
    Wang, Kaiguang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023,
  • [34] MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection
    Kim, Bumsoo
    Mun, Jonghwan
    On, Kyoung-Woon
    Shin, Minchul
    Lee, Junhyun
    Kim, Eun-Sol
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19556 - 19565
  • [35] NLFFTNet: A non-local feature fusion transformer network for multi-scale object detection
    Zeng, Kai
    Ma, Qian
    Wu, Jiawen
    Xiang, Sijia
    Shen, Tao
    Zhang, Lei
    NEUROCOMPUTING, 2022, 493 : 15 - 27
  • [36] Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection
    Xiao, Zhibin
    Xie, Pengwei
    Wang, Guijin
    MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 352 - 363
  • [37] AugDETR: Improving Multi-scale Learning for Detection Transformer
    Dong, Jinpeng
    Lin, Yutong
    Li, Chen
    Zhou, Sanping
    Zheng, Nanning
    COMPUTER VISION - ECCV 2024, PT XXIV, 2025, 15082 : 238 - 255
  • [38] Attention to the Scale : Deep Multi-Scale Salient Object Detection
    Zhang, Jing
    Dai, Yuchao
    Li, Bo
    He, Mingyi
    2017 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING - TECHNIQUES AND APPLICATIONS (DICTA), 2017, : 105 - 111
  • [39] Multi-Scale Segmentation of Forest Areas and Tree Detection in LiDAR Images by the Attentive Vision Method
    Palenichka, Roman
    Doyon, Frederik
    Lakhssassi, Ahmed
    Zaremba, Marek B.
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2013, 6 (03) : 1313 - 1323
  • [40] MSA-MaxNet: Multi-Scale Attention Enhanced Multi-Axis Vision Transformer Network for Medical Image Segmentation
    Wu, Wei
    Huang, Junfeng
    Zhang, Mingxuan
    Li, Yichen
    Yu, Qijia
    Zhao, Qi
    JOURNAL OF CELLULAR AND MOLECULAR MEDICINE, 2024, 28 (24)