MixFormer: End-to-End Tracking with Iterative Mixed Attention

被引:347
|
作者
Cui, Yutao [1 ]
Jiang, Cheng [1 ]
Wang, Limin [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01324
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.
引用
收藏
页码:13598 / 13608
页数:11
相关论文
共 50 条
  • [21] Gigapixel end-to-end training using streaming and attention
    Dooper, Stephan
    Pinckaers, Hans
    Aswolinskiy, Witali
    Hebeda, Konnie
    Jarkman, Sofia
    van der Laak, Jeroen
    Litjens, Geert
    BIGPICTURE Consortium
    MEDICAL IMAGE ANALYSIS, 2023, 88
  • [22] End-to-End Attention Pooling for Histopathology Image Classification
    Liu, Juan
    Zuo, Zhiqun
    Chen, Yuqi
    Xiao, Di
    Pang, Baochuan
    Cao, Dehua
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2024, 49 (07): : 1070 - 1078
  • [23] End-to-end Spatiotemporal Attention Model for Autonomous Driving
    Zhao, Ruijie
    Zhang, Yanxin
    Huang, Zhiqing
    Yin, Chenkun
    PROCEEDINGS OF 2020 IEEE 4TH INFORMATION TECHNOLOGY, NETWORKING, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (ITNEC 2020), 2020, : 2649 - 2653
  • [24] End-to-end multitask Siamese network with residual hierarchical attention for real-time object tracking
    Huang, Wenhui
    Gu, Jason
    Ma, Xin
    Li, Yibin
    APPLIED INTELLIGENCE, 2020, 50 (06) : 1908 - 1921
  • [25] End-to-end DeepNCC framework for robust visual tracking
    Dai, Kaiheng
    Wang, Yuehuan
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 70
  • [26] DensSiam: End-to-End Densely-Siamese Network with Self-Attention Model for Object Tracking
    Abdelpakey, Mohamed H.
    Shehata, Mohamed S.
    Mohamed, Mostafa M.
    ADVANCES IN VISUAL COMPUTING, ISVC 2018, 2018, 11241 : 463 - 473
  • [27] An Improved End-to-End Multi-Target Tracking Method Based on Transformer Self-Attention
    Hong, Yong
    Li, Deren
    Luo, Shupei
    Chen, Xin
    Yang, Yi
    Wang, Mi
    REMOTE SENSING, 2022, 14 (24)
  • [28] End-to-end multitask Siamese network with residual hierarchical attention for real-time object tracking
    Wenhui Huang
    Jason Gu
    Xin Ma
    Yibin Li
    Applied Intelligence, 2020, 50 : 1908 - 1921
  • [29] Provenance Tracking for End-to-End Machine Learning Pipelines
    Grafberger, Stefan
    Groth, Paul
    Schelter, Sebastian
    COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 1512 - 1512
  • [30] End-to-end deep metric network for visual tracking
    Tian, Shengjing
    Shen, Shuwei
    Tian, Guoqiang
    Liu, Xiuping
    Yin, Baocai
    VISUAL COMPUTER, 2020, 36 (06): : 1219 - 1232