Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

被引:0
|
作者
Fang, Sikai [1 ]
Lu, Xiaofeng [1 ,2 ]
Huang, Yifan [1 ]
Sun, Guangling [1 ]
Liu, Xuefeng [1 ]
机构
[1] Shanghai Univ, Sch Commun & Informat Engn, 99 Shangda Rd, Shanghai 200444, Peoples R China
[2] Shanghai Univ, Wenzhou Inst, Wenzhou, Peoples R China
关键词
Dynamic gate; Multiscale; Object detection; Self-attention; Vision transformer;
D O I
10.1007/s11042-024-18234-8
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The self-attention-based vision transformer has powerful feature extraction capabilities and has demonstrated competitive performance in several tasks. However, the conventional self-attention mechanism that exhibits global perceptual properties while favoring large-scale objects, room for improvement still remains in terms of performance at other scales during object detection. To circumvent this issue, the dynamic gate-assisted network (DGANet), a novel yet simple framework, is proposed to enhance the multiscale generalization capability of the vision transformer structure. First, we design the dynamic multi-headed self-attention mechanism (DMH-SAM), which dynamically selects the self-attention components and uses a local-to-global self-attention pattern that enables the model to learn features of objects at different scales autonomously, while reducing the computational effort. Then, we propose a dynamic multiscale encoder (DMEncoder), which weights and encodes the feature maps with different perceptual fields to self-adapt the performance gap of the network for each scale object. Extensive ablation and comparison experiments have proven the effectiveness of the proposed method. Its detection accuracy for small, medium and large targets has reached 27.6, 47.4 and 58.5 respectively, even better than the most advanced target detection methods, while its model complexity down 23%.
引用
收藏
页码:67213 / 67229
页数:17
相关论文
共 50 条
  • [31] CNN-TRANSFORMER WITH SELF-ATTENTION NETWORK FOR SOUND EVENT DETECTION
    Wakayama, Keigo
    Saito, Shoichiro
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 806 - 810
  • [32] PCP-YOLO: an approach integrating non-deep feature enhancement module and polarized self-attention for small object detection of multiscale defects
    Wang, Penglin
    Shi, Donghui
    Aguilar, Jose
    Signal, Image and Video Processing, 2025, 19 (01)
  • [33] A small object detection architecture with concatenated detection heads and multi-head mixed self-attention mechanism
    Mu, Jianhong
    Su, Qinghua
    Wang, Xiyu
    Liang, Wenhui
    Xu, Sheng
    Wan, Kaizheng
    Journal of Real-Time Image Processing, 2024, 21 (06)
  • [34] SST-YOLOv5s: advancing real-time blood cell object detection through multi-headed attention mechanism
    Mingyu Zhang
    Jiaqing Chen
    Signal, Image and Video Processing, 2025, 19 (3)
  • [35] Dynamic self-attention with vision synchronization networks for video question answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Shen, Shixun
    Tian, Peng
    Li, Lang
    Li, Zhoujun
    PATTERN RECOGNITION, 2022, 132
  • [36] Towards robust diagnosis of COVID-19 using vision self-attention transformer
    Fozia Mehboob
    Abdul Rauf
    Richard Jiang
    Abdul Khader Jilani Saudagar
    Khalid Mahmood Malik
    Muhammad Badruddin Khan
    Mozaherul Hoque Abdul Hasnat
    Abdullah AlTameem
    Mohammed AlKhathami
    Scientific Reports, 12
  • [37] PSLT: A Light-Weight Vision Transformer With Ladder Self-Attention and Progressive Shift
    Wu, Gaojie
    Zheng, Wei-Shi
    Lu, Yutong
    Tian, Qi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (09) : 11120 - 11135
  • [38] Towards robust diagnosis of COVID-19 using vision self-attention transformer
    Mehboob, Fozia
    Rauf, Abdul
    Jiang, Richard
    Saudagar, Abdul Khader Jilani
    Malik, Khalid Mahmood
    Khan, Muhammad Badruddin
    Hasnat, Mozaherul Hoque Abdul
    AlTameem, Abdullah
    AlKhathami, Mohammed
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [39] Multi-Scale Vision Transformer for Defect Object Detection
    Lou, Liangshan
    Lu, Ke
    Xue, Jian
    Procedia Computer Science, 2023, 222 : 397 - 406
  • [40] Efficient Road Traffic Video Congestion Classification Based on the Multi-Head Self-Attention Vision Transformer Model
    Khalladi, Sofiane Abdelkrim
    Ouessai, Asmaa
    Benamara, Nadir Kamel
    Keche, Mokhtar
    TRANSPORT AND TELECOMMUNICATION JOURNAL, 2024, 25 (01) : 20 - 30