Learning Cross-Attention Discriminators via Alternating TimeSpace Transformers for Visual Tracking

被引:8
|
作者
Wang, Wuwei [1 ]
Zhang, Ke [2 ,3 ]
Su, Yu [2 ,3 ]
Wang, Jingyu [4 ]
Wang, Qi [4 ]
机构
[1] Xian Univ Posts & Telecommun, Sch Automat, Xian 710121, Peoples R China
[2] Northwestern Polytech Univ, Sch Astronaut, Xian 710072, Peoples R China
[3] Northwestern Polytech Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[4] Northwestern Polytech Univ, Sch Astronaut, Sch Artificial Intelligence Opt & Elect IOPEN, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Target tracking; Visualization; Correlation; Computer architecture; Task analysis; Adaptation models; Cross-attention discriminator; multistage Transformers; spatiotemporal information; visual tracking; CORRELATION FILTERS; AWARE;
D O I
10.1109/TNNLS.2023.3282905
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent "CNN + Transformer" trackers on various benchmarks while our ATST requires significantly less training data.
引用
收藏
页码:15156 / 15169
页数:14
相关论文
共 50 条
  • [1] Audio-Visual Cross-Attention Network for Robotic Speaker Tracking
    Qian, Xinyuan
    Wang, Zhengdong
    Wang, Jiadong
    Guan, Guohui
    Li, Haizhou
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 550 - 562
  • [2] Audio-Visual Speaker Verification via Joint Cross-Attention
    Rajasekhar, Gnana Praveen
    Alam, Jahangir
    SPEECH AND COMPUTER, SPECOM 2023, PT II, 2023, 14339 : 18 - 31
  • [3] Multi-level Cross-attention Siamese Network For Visual Object Tracking
    Zhang, Jianwei
    Wang, Jingchao
    Zhang, Huanlong
    Miao, Mengen
    Cai, Zengyu
    Chen, Fuguo
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (12): : 3976 - 3990
  • [4] SCATT: Transformer tracking with symmetric cross-attention
    Zhang, Jianming
    Chen, Wentao
    Dai, Jiangxin
    Zhang, Jin
    APPLIED INTELLIGENCE, 2024, 54 (08) : 6069 - 6084
  • [5] Deblurring transformer tracking with conditional cross-attention
    Sun, Fuming
    Zhao, Tingting
    Zhu, Bing
    Jia, Xu
    Wang, Fasheng
    MULTIMEDIA SYSTEMS, 2023, 29 (03) : 1131 - 1144
  • [6] Deblurring transformer tracking with conditional cross-attention
    Fuming Sun
    Tingting Zhao
    Bing Zhu
    Xu Jia
    Fasheng Wang
    Multimedia Systems, 2023, 29 : 1131 - 1144
  • [7] Video question answering via grounded cross-attention network learning
    Ye, Yunan
    Zhang, Shifeng
    Li, Yimeng
    Qian, Xufeng
    Tang, Siliang
    Pu, Shiliang
    Xiao, Jun
    INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [8] Vision Transformers with Cross-Attention Pyramids for Class-Agnostic Counting
    Jiban, Md Jibanul Haque
    Mahalanobis, Abhijit
    Lobo, Niels Da Vitoria
    2024 9TH INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING, ICSIP, 2024, : 689 - 695
  • [9] Multi-scale network with shared cross-attention for audio–visual correlation learning
    Jiwei Zhang
    Yi Yu
    Suhua Tang
    Wei Li
    Jianming Wu
    Neural Computing and Applications, 2023, 35 : 20173 - 20187
  • [10] Cross-modal recipe retrieval via parallel- and cross-attention networks learning
    Cao, Da
    Chu, Jingjing
    Zhu, Ningbo
    Nie, Liqiang
    KNOWLEDGE-BASED SYSTEMS, 2020, 193