MixFormer: End-to-End Tracking with Iterative Mixed Attention

被引:347
|
作者
Cui, Yutao [1 ]
Jiang, Cheng [1 ]
Wang, Limin [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01324
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.
引用
收藏
页码:13598 / 13608
页数:11
相关论文
共 50 条
  • [31] End-to-end deep metric network for visual tracking
    Shengjing Tian
    Shuwei Shen
    Guoqiang Tian
    Xiuping Liu
    Baocai Yin
    The Visual Computer, 2020, 36 : 1219 - 1232
  • [32] Learning Diverse Models for End-to-End Ensemble Tracking
    Wang, Ning
    Zhou, Wengang
    Li, Houqiang
    IEEE Transactions on Image Processing, 2021, 30 : 2220 - 2231
  • [33] End-to-End Feature Decontaminated Network for UAV Tracking
    Zuo, Haobo
    Fu, Changhong
    Li, Sihang
    Ye, Junjie
    Zheng, Guangze
    2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 12130 - 12137
  • [34] End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
    Wang, Zhong-Qiu
    Le Roux, Jonathan
    Wang, DeLiang
    Hershey, John R.
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2708 - 2712
  • [35] PyTrack: An end-to-end analysis toolkit for eye tracking
    Ghose, Upamanyu
    Srinivasan, Arvind A.
    Boyce, W. Paul
    Xu, Hong
    Chng, Eng Siong
    BEHAVIOR RESEARCH METHODS, 2020, 52 (06) : 2588 - 2603
  • [36] Iterative Compression of End-to-End ASR Model using AutoML
    Mehrotra, Abhinav
    Dudziak, Lukasz
    Yeo, Jinsu
    Lee, Young-yoon
    Vipperla, Ravichander
    Abdelfattah, Mohamed S.
    Bhattacharya, Sourav
    Ishtiaq, Samin
    Ramos, Alberto Gil C. P.
    Lee, SangJeong
    Kim, Daehyun
    Lane, Nicholas D.
    INTERSPEECH 2020, 2020, : 3361 - 3365
  • [37] Toward End-to-End Object Detection and Tracking on the Edge
    Tabkhi, Hamed
    SEC 2017: 2017 THE SECOND ACM/IEEE SYMPOSIUM ON EDGE COMPUTING (SEC'17), 2017,
  • [38] PyTrack: An end-to-end analysis toolkit for eye tracking
    Upamanyu Ghose
    Arvind A. Srinivasan
    W. Paul Boyce
    Hong Xu
    Eng Siong Chng
    Behavior Research Methods, 2020, 52 : 2588 - 2603
  • [39] Learning Diverse Models for End-to-End Ensemble Tracking
    Wang, Ning
    Zhou, Wengang
    Li, Houqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2220 - 2231
  • [40] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
    Rybicka, Magdalena
    Villalba, Jesus
    Dehak, Najim
    Kowalczyk, Konrad
    INTERSPEECH 2022, 2022, : 5090 - 5094