Learning Cross-Attention Discriminators via Alternating TimeSpace Transformers for Visual Tracking

被引:8
|
作者
Wang, Wuwei [1 ]
Zhang, Ke [2 ,3 ]
Su, Yu [2 ,3 ]
Wang, Jingyu [4 ]
Wang, Qi [4 ]
机构
[1] Xian Univ Posts & Telecommun, Sch Automat, Xian 710121, Peoples R China
[2] Northwestern Polytech Univ, Sch Astronaut, Xian 710072, Peoples R China
[3] Northwestern Polytech Univ, Shenzhen Res Inst, Shenzhen 518057, Peoples R China
[4] Northwestern Polytech Univ, Sch Astronaut, Sch Artificial Intelligence Opt & Elect IOPEN, Xian 710072, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Target tracking; Visualization; Correlation; Computer architecture; Task analysis; Adaptation models; Cross-attention discriminator; multistage Transformers; spatiotemporal information; visual tracking; CORRELATION FILTERS; AWARE;
D O I
10.1109/TNNLS.2023.3282905
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the past few years, visual tracking methods with convolution neural networks (CNNs) have gained great popularity and success. However, the convolution operation of CNNs struggles to relate spatially distant information, which limits the discriminative power of trackers. Very recently, several Transformer-assisted tracking approaches have emerged to alleviate the above issue by combining CNNs with Transformers to enhance the feature representation. In contrast to the methods mentioned above, this article explores a pure Transformer-based model with a novel semi-Siamese architecture. Both the time-space self-attention module used to construct the feature extraction backbone and the cross-attention discriminator used to estimate the response map solely leverage attention without convolution. Inspired by the recent vision transformers (ViTs), we propose the multistage alternating time-space Transformers (ATSTs) to learn robust feature representation. Specifically, temporal and spatial tokens at each stage are alternately extracted and encoded by separate Transformers. Subsequently, a cross-attention discriminator is proposed to directly generate response maps of the search region without additional prediction heads or correlation filters. Experimental results show that our ATST-based model attains favorable results against state-of-the-art convolutional trackers. Moreover, it shows comparable performance with recent "CNN + Transformer" trackers on various benchmarks while our ATST requires significantly less training data.
引用
收藏
页码:15156 / 15169
页数:14
相关论文
共 50 条
  • [41] VISUAL QUESTION ANSWERING IN REMOTE SENSING WITH CROSS-ATTENTION AND MULTIMODAL INFORMATION BOTTLENECK
    Songara, Jayesh
    Pande, Shivam
    Choudhury, Shabnam
    Banerjee, Biplab
    Velmurugan, Rajbabu
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 6278 - 6281
  • [42] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour M.
    Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [43] Dynamic flexible flow shop scheduling via cross-attention networks and multi-agent reinforcement learning
    Zheng, Jinlong
    Zhao, Yixin
    Li, Yinya
    Li, Jianfeng
    Wang, Liangeng
    Yuan, Di
    JOURNAL OF MANUFACTURING SYSTEMS, 2025, 80 : 395 - 411
  • [44] Image-text multimodal classification via cross-attention contextual transformer with modality-collaborative learning
    Shi, Qianyao
    Xu, Wanru
    Miao, Zhenjiang
    JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (04)
  • [45] Spatial Prior-Guided Bi-Directional Cross-Attention Transformers for Tooth Instance Segmentation
    Li, Pengcheng
    Gao, Chenqiang
    Lian, Chunfeng
    Meng, Deyu
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (11) : 3936 - 3948
  • [46] Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
    Zhao, Heng
    Zhou, Joey Tianyi
    Ong, Yew-Soon
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1523 - 1533
  • [47] Cross-Attention Fusion of Visual and Geometric Features for Large-Vocabulary Arabic Lipreading
    Daou, Samar
    Ben-Hamadou, Achraf
    Rekik, Ahmed
    Kallel, Abdelaziz
    TECHNOLOGIES, 2025, 13 (01)
  • [48] Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention
    Praveen, R. Gnana
    Alam, Jahangir
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION, FG 2024, 2024,
  • [49] A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition
    Praveen, R. Gnana
    de Melo, Wheidima Carneiro
    Ullah, Nasib
    Aslam, Haseeb
    Zeeshan, Osama
    Denorme, Theo
    Pedersoli, Marco
    Koerich, Alessandro L.
    Bacon, Simon
    Cardinal, Patrick
    Granger, Eric
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 2485 - 2494
  • [50] Hierarchical Deep Reinforcement Learning with Cross-attention and Planning for Autonomous Roundabout Navigation
    Montgomery, Bennet
    Muise, Christian
    Givigi, Sidney
    2024 IEEE CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING, CCECE 2024, 2024, : 417 - 423