RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker

被引：0

作者：

Li, Yunfeng ^{[1
]}

Wang, Bo ^{[1
]}

Sun, Jiuran ^{[1
]}

Wu, Xueyi ^{[1
]}

Li, Ye ^{[1
]}

机构：

[1] Harbin Engn Univ, Natl Key Lab Autonomous Marine Vehicle Technol, Harbin 150001, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

RGB-sonar tracking; spatial cross attention; transformer network;

D O I：

10.1109/TCSVT.2024.3497214

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Underwater camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has received little attention in previous research. Therefore, this paper introduces a new and challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of the RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87,000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses significant challenges to the currently popular SOT trackers. Second, we propose two RGB-S trackers, which are called SCANet and SCANet-Refine. They include a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer, an attention refinement module, and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities, and that SCANet and SCANet-Refine achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.

引用

页码：2260 / 2275

页数：16

共 50 条

[41] DCCAT: Dual-Coordinate Cross-Attention Transformer for thrombus segmentation on coronary OCT
Chu, Miao
De Maria, Giovanni Luigi
Dai, Ruobing
Benenati, Stefano
Yu, Wei
Zhong, Jiaxin
Kotronias, Rafail
Walsh, Jason
Andreaggi, Stefano
Zuccarelli, Vittorio
Chai, Jason
Channon, Keith
Banning, Adrian
Tu, Shengxian
MEDICAL IMAGE ANALYSIS, 2024, 97
[42] Dual cross-attention Transformer network for few-shot image semantic segmentation
Liu, Yu
Guo, Yingchun
Zhu, Ye
Yu, Ming
CHINESE JOURNAL OF LIQUID CRYSTALS AND DISPLAYS, 2024, 39 (11) : 1494 - 1505
[43] Twins transformer: Cross-attention based two-branch transformer network for rotating bearing fault diagnosis
Li, Jie
Bao, Yu
Liu, Wenxin
Ji, Pengxiang
Wang, Lekang
Wang, Zhongbing
MEASUREMENT, 2023, 223
[44] Multi-level Cross-attention Siamese Network For Visual Object Tracking
Zhang, Jianwei
Wang, Jingchao
Zhang, Huanlong
Miao, Mengen
Cai, Zengyu
Chen, Fuguo
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2022, 16 (12): : 3976 - 3990
[45] Invisible gas detection: An RGB-thermal cross attention network and a new benchmark
Wang, Jue
Lin, Yuxiang
Zhao, Qi
Luo, Dong
Chen, Shuaibao
Chen, Wei
Peng, Xiaojiang
COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 248
[46] DTCA: Dual-Branch Transformer with Cross-Attention for EEG and Eye Movement Data Fusion
Zhang, Xiaoshan
Shi, Enze
Yu, Sigang
Zhang, Shu
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT II, 2024, 15002 : 141 - 151
[47] CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection
Lin, Wei-Dong
Deng, Yu-Yan
Gao, Yang
Wang, Ning
Liu, Ling-Qiao
Zhang, Lei
Wang, Peng
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2024, 39 (02) : 460 - 471
[48] Cross-Attention Fusion Learning of Transformer-CNN Features for Person Re-Identification
Xiang, Jun
Zhang, Jincheng
Jiang, Xiaoping
Hou, Jianhua
Computer Engineering and Applications, 2024, 60 (16) : 94 - 104
[49] CAF-ViT: A cross-attention based Transformer network for underwater acoustic target recognition
Dong, Wenfeng
Fu, Jin
Zou, Nan
Zhao, Chunpeng
Miao, Yixin
Shen, Zheng
OCEAN ENGINEERING, 2025, 318
[50] KMT-PLL: K-Means Cross-Attention Transformer for Partial Label Learning
Fan, Jinfu
Huang, Linqing
Gong, Chaoyu
You, Yang
Gan, Min
Wang, Zhongjie
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (02) : 2789 - 2800

← 1 2 3 4 5 →