RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker

被引：0

作者：

Li, Yunfeng ^{[1
]}

Wang, Bo ^{[1
]}

Sun, Jiuran ^{[1
]}

Wu, Xueyi ^{[1
]}

Li, Ye ^{[1
]}

机构：

[1] Harbin Engn Univ, Natl Key Lab Autonomous Marine Vehicle Technol, Harbin 150001, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

RGB-sonar tracking; spatial cross attention; transformer network;

D O I：

10.1109/TCSVT.2024.3497214

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Underwater camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has received little attention in previous research. Therefore, this paper introduces a new and challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of the RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87,000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses significant challenges to the currently popular SOT trackers. Second, we propose two RGB-S trackers, which are called SCANet and SCANet-Refine. They include a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer, an attention refinement module, and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities, and that SCANet and SCANet-Refine achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.

引用

页码：2260 / 2275

页数：16

共 50 条

[31] Cross-Attention Spectral-Spatial Network for Hyperspectral Image Classification
Yang, Kai
Sun, Hao
Zou, Chunbo
Lu, Xiaoqiang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[32] Hyperspectral Image Classification via Cascaded Spatial Cross-Attention Network
Zhang, Bo
Chen, Yaxiong
Xiong, Shengwu
Lu, Xiaoqiang
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 899 - 913
[33] Word2Pix: Word to Pixel Cross-Attention Transformer in Visual Grounding
Zhao, Heng
Zhou, Joey Tianyi
Ong, Yew-Soon
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1523 - 1533
[34] TSMCF: Transformer-Based SAR and Multispectral Cross-Attention Fusion for Cloud Removal
Zhu, Hongming
Wang, Zeju
Han, Letong
Xu, Manxin
Li, Weiqi
Liu, Qin
Liu, Sicong
Du, Bowen
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2025, 18 : 6710 - 6720
[35] Learning Cross-Attention Discriminators via Alternating TimeSpace Transformers for Visual Tracking
Wang, Wuwei
Zhang, Ke
Su, Yu
Wang, Jingyu
Wang, Qi
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15156 - 15169
[36] Spatio-spectral Cross-Attention Transformer for Hyperspectral image and Multispectral image fusion
Qin, Xilei
Song, Huihui
Fan, Jiaqing
Zhang, Kaihua
REMOTE SENSING LETTERS, 2023, 14 (12) : 1303 - 1314
[37] Cross-attention Spatio-temporal Context Transformer for Semantic Segmentation of Historical Maps
Wu, Sidi
Chen, Yizi
Schindler, Konrad
Hurni, Lorenz
31ST ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS, ACM SIGSPATIAL GIS 2023, 2023, : 106 - 114
[38] Remote sensing image change detection based on swin transformer and cross-attention mechanism
Yan, Weidong
Cao, Li
Yan, Pei
Zhu, Chaosheng
Wang, Mengtian
EARTH SCIENCE INFORMATICS, 2025, 18 (01)
[39] MedTrans: Intelligent Computing for Medical Diagnosis Using Multiscale Cross-Attention Vision Transformer
Xu, Yang
Hong, Yuan
Li, Xinchen
Hu, Mu
IEEE ACCESS, 2024, 12 : 146575 - 146586
[40] Reducing carbon emissions in the architectural design process via transformer with cross-attention mechanism
Li, Huadong
Yang, Xia
Zhu, Hai Luo
FRONTIERS IN ECOLOGY AND EVOLUTION, 2023, 11

← 1 2 3 4 5 →