DMTrack: learning deformable masked visual representations for single object tracking

被引：0

作者：

Abdelaziz, Omar ^{[1
]}

Shehata, Mohamed ^{[1
]}

机构：

[1] Univ British Columbia, Dept Comp Sci Math Phys & Stat, 3333 Univ Way, Kelowna, BC V1V1V7, Canada

来源：

SIGNAL IMAGE AND VIDEO PROCESSING | 2025年 / 19卷 / 01期

关键词：

Single object tracking; Deformable convolutions; Vision transformers; One-stream trackers;

D O I：

10.1007/s11760-024-03713-0

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Single object tracking is still challenging because it requires localizing an arbitrary object in a sequence of frames, given only its appearance in the first frame of the sequence. Many trackers, especially those leveraging the Vision Transformer (ViT) backbone, have achieved superior performance. However, the gap between the performance metrics measured on the training data and those on the test data is still large. To alleviate this issue, we propose the deformable masking module in the transformer-based trackers. The deformable masking module is injected after each layer of the ViT. First, It masks out complete vectors of the output representations of the ViT layer. After that, the masked representations are fed into a deformable convolution to reconstruct new reliable representations. The output of the last layer of the ViT is modified by fusing it with all intermediate outputs of the deformable masking modules to produce a final robust attentional feature map. We extensively evaluate the performance of our model, named DMTrack, on seven different tracking benchmarks. Our model outperforms the previous state-of-the-art techniques by (+2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$+\,2\%$$\end{document}) while having fewer parameters (-92.4%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\,92.4\%$$\end{document}). Moreover, our model matches the performance of much larger models in terms of parameters, indicating our training strategy's effectiveness.

引用

页数：15

共 50 条

[41] Online learning of multiple detectors for visual object tracking
Quan, Wei
Chen, Jin-Xiong
Yu, Nan-Yang
Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2014, 42 (05): : 875 - 882
[42] Visual object tracking by correlation filters and online learning
Zhang, Xin
Xia, Gui-Song
Lu, Qikai
Shen, Weiming
Zhang, Liangpei
ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2018, 140 : 77 - 89
[43] Deformable Object Tracking With Gated Fusion
Liu, Wenxi
Song, Yibing
Chen, Dengsheng
He, Shengfeng
Yu, Yuanlong
Yan, Tao
Hancke, Gehard P.
Lau, Rynson W. H.
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (08) : 3766 - 3777
[44] SeqTrack: Sequence to Sequence Learning for Visual Object Tracking
Chen, Xin
Peng, Houwen
Wang, Dong
Lu, Huchuan
Hu, Han
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14572 - 14581
[45] Learning Dynamic Siamese Network for Visual Object Tracking
Guo, Qing
Feng, Wei
Zhou, Ce
Huang, Rui
Wan, Liang
Wang, Song
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1781 - 1789
[46] Object tracking using deformable templates
Zhong, Y
Jain, AK
Dubuisson-Jolly, MP
SIXTH INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1998, : 440 - 445
[47] Object tracking using deformable templates
Zhong, Y
Jain, AK
Dubuisson-Jolly, MP
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (05) : 544 - 549
[48] DIMBA: discretely masked black-box attack in single object tracking
Xiangyu Yin
Wenjie Ruan
Jonathan Fieldsend
Machine Learning, 2024, 113 : 1705 - 1723
[49] DIMBA: discretely masked black-box attack in single object tracking
Yin, Xiangyu
Ruan, Wenjie
Fieldsend, Jonathan
MACHINE LEARNING, 2024, 113 (04) : 1705 - 1723
[50] Integrating visual perception and manipulation for autonomous learning of object representations
Schiebener, David
Morimoto, Jun
Asfour, Tamim
Ude, Ales
ADAPTIVE BEHAVIOR, 2013, 21 (05) : 328 - 345

← 1 2 3 4 5 →