Learning Cross-Attention Discriminators via Alternating TimeSpace Transformers for Visual Tracking

被引:8
|
作者
Wang, Wuwei [1 ]
Zhang, Ke [2 ,3 ]
Su, Yu [2 ,3 ]
Wang, Jingyu [4 ]
Wang, Qi [4 ]
机构
[1] Xian Univ Posts & Telecommun, Sch Automat, Xian 710121, Peoples R China
[2] Northwestern Polytech Univ, Sch Astronaut, Xian 710072, Peoples R China
[3] Northwestern Polytech Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[4] Northwestern Polytech Univ, Sch Astronaut, Sch Artificial Intelligence Opt & Elect IOPEN, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Target tracking; Visualization; Correlation; Computer architecture; Task analysis; Adaptation models; Cross-attention discriminator; multistage Transformers; spatiotemporal information; visual tracking; CORRELATION FILTERS; AWARE;
D O I
10.1109/TNNLS.2023.3282905
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent "CNN + Transformer" trackers on various benchmarks while our ATST requires significantly less training data.
引用
收藏
页码:15156 / 15169
页数:14
相关论文
共 50 条
  • [31] Semantic Image Synthesis via Class-Adaptive Cross-Attention
    Fontanini, Tomaso
    Ferrari, Claudio
    Lisanti, Giuseppe
    Bertozzi, Massimo
    Prati, Andrea
    IEEE ACCESS, 2025, 13 : 10326 - 10339
  • [32] Semantic Image Synthesis via Class-Adaptive Cross-Attention
    Fontanini, Tomaso
    Ferrari, Claudio
    Lisanti, Giuseppe
    Bertozzi, Massimo
    Prati, Andrea
    IEEE Access, 2025, 13 : 10326 - 10339
  • [33] Bidirectional feature fusion via cross-attention transformer for chrysanthemum classification
    Chen, Yifan
    Yang, Xichen
    Yan, Hui
    Liu, Jia
    Jiang, Jian
    Mao, Zhongyuan
    Wang, Tianshu
    PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)
  • [34] RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker
    Li, Yunfeng
    Wang, Bo
    Sun, Jiuran
    Wu, Xueyi
    Li, Ye
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (03) : 2260 - 2275
  • [35] Robust Image Watermarking based on Cross-Attention and Invariant Domain Learning
    Dasgupta, Agnibh
    Thong, Xin
    2023 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE, CSCI 2023, 2023, : 1125 - 1132
  • [36] Visual object tracking via non-local correlation attention learning
    Gao, Long
    Liu, Pan
    Ning, Jifeng
    Li, Yunsong
    KNOWLEDGE-BASED SYSTEMS, 2022, 254
  • [37] Improving Pneumonia Localization via Cross-Attention on Medical Images and Reports
    Bhalodia, Riddhish
    Hatamizadeh, Ali
    Tam, Leo
    Xu, Ziyue
    Wang, Xiaosong
    Turkbey, Evrim
    Xu, Daguang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT II, 2021, 12902 : 571 - 581
  • [38] Dual Cross-Attention for Video Object Segmentation via Uncertainty Refinement
    Hong, Jiahao
    Zhang, Wei
    Feng, Zhiwei
    Zhang, Wenqiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7710 - 7725
  • [39] Semantic Image Synthesis via Class-Adaptive Cross-Attention
    Fontanini, Tomaso
    Ferrari, Claudio
    Lisanti, Giuseppe
    Bertozzi, Massimo
    Prati, Andrea
    IEEE ACCESS, 2025, 13 : 10326 - 10339
  • [40] Hyperspectral Image Classification via Cascaded Spatial Cross-Attention Network
    Zhang, Bo
    Chen, Yaxiong
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 899 - 913