MixFormer: End-to-End Tracking with Iterative Mixed Attention

被引:347
|
作者
Cui, Yutao [1 ]
Jiang, Cheng [1 ]
Wang, Limin [1 ]
Wu, Gangshan [1 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR52688.2022.01324
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.
引用
收藏
页码:13598 / 13608
页数:11
相关论文
共 50 条
  • [41] Mixed estimation approach to end-to-end network traffic
    Jiang, Ding-De
    Zhao, Zu-Yao
    Xu, Hong-Wei
    Wang, Xing-Wei
    Guangzi Xuebao/Acta Photonica Sinica, 2014, 43 (07):
  • [42] End-to-end Music-mixed Speech Recognition
    Woo, Jeongwoo
    Mimura, Masato
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 800 - 804
  • [43] Attention-based end-to-end image defogging network
    Yang, Yan
    Zhang, Chen
    Jiang, Peipei
    Yue, Hui
    ELECTRONICS LETTERS, 2020, 56 (15) : 759 - +
  • [44] An End-to-end Speech Recognition Algorithm based on Attention Mechanism
    Chen, Jia-nan
    Gao, Shuang
    Sun, Han-zhe
    Liu, Xiao-hui
    Wang, Zi-ning
    Zheng, Yan
    PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 2935 - 2940
  • [45] Gated End-to-End Memory Network Based on Attention Mechanism
    Zhou, Bin
    Dang, Xin
    2018 INTERNATIONAL CONFERENCE ON ORANGE TECHNOLOGIES (ICOT), 2018,
  • [46] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
  • [47] NEAT: Neural Attention Fields for End-to-End Autonomous Driving
    Chitta, Kashyap
    Prakash, Aditya
    Geiger, Andreas
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15773 - 15783
  • [48] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
    Lin, Kevin
    Li, Linjie
    Lin, Chung-Ching
    Ahmed, Faisal
    Gan, Zhe
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
  • [49] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    INTERSPEECH 2019, 2019, : 4395 - 4399
  • [50] Multi-channel Attention for End-to-End Speech Recognition
    Braun, Stefan
    Neil, Daniel
    Anumula, Jithendar
    Ceolini, Enea
    Liu, Shih-Chii
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 17 - 21