Learning Cross-Attention Discriminators via Alternating TimeSpace Transformers for Visual Tracking

被引:8
|
作者
Wang, Wuwei [1 ]
Zhang, Ke [2 ,3 ]
Su, Yu [2 ,3 ]
Wang, Jingyu [4 ]
Wang, Qi [4 ]
机构
[1] Xian Univ Posts & Telecommun, Sch Automat, Xian 710121, Peoples R China
[2] Northwestern Polytech Univ, Sch Astronaut, Xian 710072, Peoples R China
[3] Northwestern Polytech Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[4] Northwestern Polytech Univ, Sch Astronaut, Sch Artificial Intelligence Opt & Elect IOPEN, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Target tracking; Visualization; Correlation; Computer architecture; Task analysis; Adaptation models; Cross-attention discriminator; multistage Transformers; spatiotemporal information; visual tracking; CORRELATION FILTERS; AWARE;
D O I
10.1109/TNNLS.2023.3282905
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent "CNN + Transformer" trackers on various benchmarks while our ATST requires significantly less training data.
引用
收藏
页码:15156 / 15169
页数:14
相关论文
共 50 条
  • [21] Cross-Attention of Disentangled Modalities for 3D Human Mesh Recovery with Transformers
    Cho, Junhyeong
    Youwang, Kim
    Oh, Tae-Hyun
    COMPUTER VISION - ECCV 2022, PT I, 2022, 13661 : 342 - 359
  • [22] Ship Detection in SAR Images via Cross-Attention Mechanism
    Lv, Yilong
    Li, Min
    CANADIAN JOURNAL OF REMOTE SENSING, 2022, 48 (06) : 764 - 778
  • [23] Multi-Granularity Cross-Attention Network for Visual Question Answering
    Wang, Yue
    Gao, Wei
    Cheng, Xinzhou
    Wang, Xin
    Zhao, Huiying
    Xie, Zhipu
    Xu, Lexi
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2098 - 2103
  • [24] Learning Cross-Attention Point Transformer With Global Porous Sampling
    Duan, Yueqi
    Sun, Haowen
    Yan, Juncheng
    Lu, Jiwen
    Zhou, Jie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6283 - 6297
  • [25] Unsupervised Cross-Domain Rumor Detection with Contrastive Learning and Cross-Attention
    Ran, Hongyan
    Jia, Caiyan
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13510 - 13518
  • [26] Learning attention modules for visual tracking
    Jun Wang
    Chenchen Meng
    Chengzhi Deng
    Yuanyun Wang
    Signal, Image and Video Processing, 2022, 16 : 2149 - 2156
  • [27] Learning attention modules for visual tracking
    Wang, Jun
    Meng, Chenchen
    Deng, Chengzhi
    Wang, Yuanyun
    SIGNAL IMAGE AND VIDEO PROCESSING, 2022, 16 (08) : 2149 - 2156
  • [28] Adaptive Multi-Feature Fusion Visual Target Tracking Based on Siamese Neural Network with Cross-Attention Mechanism
    Zhou, Qian
    Xia, Haoran
    Yan, Hongzheng
    Yang, Ming
    Chen, Shidong
    2022 22ND IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2022), 2022, : 307 - 316
  • [29] An efficient object tracking based on multi-head cross-attention transformer
    Dai, Jiahai
    Li, Huimin
    Jiang, Shan
    Yang, Hongwei
    EXPERT SYSTEMS, 2025, 42 (02)
  • [30] Context and Attribute-Aware Sequential Recommendation via Cross-Attention
    Rashed, Ahmed
    Elsayed, Shereen
    Schmidt-Thieme, Lars
    PROCEEDINGS OF THE 16TH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2022, 2022, : 71 - 80