DMTrack: learning deformable masked visual representations for single object tracking

被引：0

作者：

Abdelaziz, Omar ^{[1
]}

Shehata, Mohamed ^{[1
]}

机构：

[1] Univ British Columbia, Dept Comp Sci Math Phys & Stat, 3333 Univ Way, Kelowna, BC V1V1V7, Canada

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2025年 / 19卷 / 01期

关键词：

Single object tracking; Deformable convolutions; Vision transformers; One-stream trackers;

D O I：

10.1007/s11760-024-03713-0

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Single object tracking is still challenging because it requires localizing an arbitrary object in a sequence of frames, given only its appearance in the first frame of the sequence. Many trackers, especially those leveraging the Vision Transformer (ViT) backbone, have achieved superior performance. However, the gap between the performance metrics measured on the training data and those on the test data is still large. To alleviate this issue, we propose the deformable masking module in the transformer-based trackers. The deformable masking module is injected after each layer of the ViT. First, It masks out complete vectors of the output representations of the ViT layer. After that, the masked representations are fed into a deformable convolution to reconstruct new reliable representations. The output of the last layer of the ViT is modified by fusing it with all intermediate outputs of the deformable masking modules to produce a final robust attentional feature map. We extensively evaluate the performance of our model, named DMTrack, on seven different tracking benchmarks. Our model outperforms the previous state-of-the-art techniques by (+2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+\,2\%$$\end{document}) while having fewer parameters (-92.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\,92.4\%$$\end{document}). Moreover, our model matches the performance of much larger models in terms of parameters, indicating our training strategy's effectiveness.

引用

页数：15

共 50 条

[31] Learning Linear Regression via Single-Convolutional Layer for Visual Object Tracking
Chen, Kai
Tao, Wenbing
IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (01) : 86 - 97
[32] Severely Blurred Object Tracking by Learning Deep Image Representations
Ding, Jianwei
Huang, Yongzhen
Liu, Wei
Huang, Kaiqi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2016, 26 (02) : 319 - 331
[33] Maximum Entropy Reinforced Single Object Visual Tracking
Liu, Chenghuan
Huynh, Du Q.
Reynolds, Mark
ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 2744 - 2751
[34] Visual and Language Collaborative Learning for RGBT Object Tracking
Wang, Jiahao
Liu, Fang
Jiao, Licheng
Gao, Yingjia
Wang, Hao
Li, Shuo
Li, Lingling
Chen, Puhua
Liu, Xu
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12770 - 12781
[35] Online dual dictionary learning for visual object tracking
Xu Cheng
Yifeng Zhang
Lin Zhou
Guojun Lu
Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 10881 - 10896
[36] Visual Object Tracking via Joint Learning Method
Tian, Wei
Lv, Jingyuan
2014 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS, 2014, : 1163 - 1167
[37] Learning object intrinsic structure for robust visual tracking
Wang, Q
Xu, GY
Ai, HZ
2003 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL II, PROCEEDINGS, 2003, : 227 - 233
[38] Online dual dictionary learning for visual object tracking
Cheng, Xu
Zhang, Yifeng
Zhou, Lin
Lu, Guojun
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 12 (12) : 10881 - 10896
[39] Learning object-uncertainty policy for visual tracking
He, Xuedong
Chen, Calvin Yu-Chian
INFORMATION SCIENCES, 2022, 582 : 60 - 72
[40] Learning Spatial Fusion and Matching for Visual Object Tracking
Xiao, Wei
Zhang, Zili
PRICAI 2022: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2022, 13631 : 352 - 367

← 1 2 3 4 5 →