MixFormer: End-to-End Tracking with Iterative Mixed Attention

被引：347

作者：

Cui, Yutao ^{[1
]}

Jiang, Cheng ^{[1
]}

Wang, Limin ^{[1
]}

Wu, Gangshan ^{[1
]}

机构：

[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Jiangsu, Peoples R China

来源：

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52688.2022.01324

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Tracking often uses a multi-stage pipeline of feature extraction, target information integration, and bounding box estimation. To simplify this pipeline and unify the process of feature extraction and target information integration, we present a compact tracking framework, termed as MixFormer, built upon transformers. Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration. This synchronous modeling scheme allows to extract target-specific discriminative features and perform extensive communication between target and search area. Based on MAM, we build our MixFormer tracking framework simply by stacking multiple MAMs with progressive patch embedding and placing a localization head on top. In addition, to handle multiple target templates during online tracking, we devise an asymmetric attention scheme in MAM to reduce computational cost, and propose an effective score prediction module to select high-quality templates. Our MixFormer sets a new state-of-the-art performance on five tracking benchmarks, including LaSOT, TrackingNet, VOT2020, GOT-10k, and UAV123. In particular, our MixFormer-L achieves NP score of 79.9% on LaSOT, 88.9% on TrackingNet and EAO of 0.555 on VOT2020. We also perform in-depth ablation studies to demonstrate the effectiveness of simultaneous feature extraction and information integration. Code and trained models are publicly available at https://github.com/MCG-NJU/MixFormer.

引用

页码：13598 / 13608

页数：11

共 50 条

[41] Mixed estimation approach to end-to-end network traffic
Jiang, Ding-De
Zhao, Zu-Yao
Xu, Hong-Wei
Wang, Xing-Wei
Guangzi Xuebao/Acta Photonica Sinica, 2014, 43 (07):
[42] End-to-end Music-mixed Speech Recognition
Woo, Jeongwoo
Mimura, Masato
Yoshii, Kazuyoshi
Kawahara, Tatsuya
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 800 - 804
[43] Attention-based end-to-end image defogging network
Yang, Yan
Zhang, Chen
Jiang, Peipei
Yue, Hui
ELECTRONICS LETTERS, 2020, 56 (15) : 759 - +
[44] An End-to-end Speech Recognition Algorithm based on Attention Mechanism
Chen, Jia-nan
Gao, Shuang
Sun, Han-zhe
Liu, Xiao-hui
Wang, Zi-ning
Zheng, Yan
PROCEEDINGS OF THE 39TH CHINESE CONTROL CONFERENCE, 2020, : 2935 - 2940
[45] Gated End-to-End Memory Network Based on Attention Mechanism
Zhou, Bin
Dang, Xin
2018 INTERNATIONAL CONFERENCE ON ORANGE TECHNOLOGIES (ICOT), 2018,
[46] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
Watanabe, Shinji
Hori, Takaaki
Kim, Suyoun
Hershey, John R.
Hayashi, Tomoki
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
[47] NEAT: Neural Attention Fields for End-to-End Autonomous Driving
Chitta, Kashyap
Prakash, Aditya
Geiger, Andreas
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15773 - 15783
[48] SWINBERT: End-to-End Transformers with Sparse Attention for Video Captioning
Lin, Kevin
Li, Linjie
Lin, Chung-Ching
Ahmed, Faisal
Gan, Zhe
Liu, Zicheng
Lu, Yumao
Wang, Lijuan
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17928 - 17937
[49] Self-Attention Transducers for End-to-End Speech Recognition
Tian, Zhengkun
Yi, Jiangyan
Tao, Jianhua
Bai, Ye
Wen, Zhengqi
INTERSPEECH 2019, 2019, : 4395 - 4399
[50] Multi-channel Attention for End-to-End Speech Recognition
Braun, Stefan
Neil, Daniel
Anumula, Jithendar
Ceolini, Enea
Liu, Shih-Chii
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 17 - 21

← 1 2 3 4 5 →