A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing

被引：3

作者：

Zhang, Xiong ^{[1
]}

Li, Weipeng ^{[1
]}

Wang, Xu ^{[1
]}

Wang, Luyao ^{[1
]}

Zheng, Fuzhong ^{[1
]}

Wang, Long ^{[1
]}

Zhang, Haisu ^{[1
]}

机构：

[1] Natl Univ Def Technol, Sch Informat & Commun, Wuhan 430074, Peoples R China

来源：

REMOTE SENSING | 2023年 / 15卷 / 18期

基金：

中国国家自然科学基金;

关键词：

cross-modal retrieval; remote sensing images; fusion encoding method; joint representation; contrastive learning;

D O I：

10.3390/rs15184637

中图分类号：

X [环境科学、安全科学];

学科分类号：

08 ; 0830 ;

摘要：

In recent years, there has been a growing interest in remote sensing image-text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image-text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model's consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.

引用

页数：22

共 50 条

[41] Cross-Modal Image-Text Retrieval with Semantic Consistency
Chen, Hui
Ding, Guiguang
Lin, Zijin
Zhao, Sicheng
Han, Jungong
[J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
[42] Rethinking Benchmarks for Cross-modal Image-text Retrieval
Chen, Weijing
Yao, Linli
Jin, Qin
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1241 - 1251
[43] Multisensor Fusion and Explicit Semantic Preserving-Based Deep Hashing for Cross-Modal Remote Sensing Image Retrieval
Sun, Yuxi
Feng, Shanshan
Ye, Yunming
Li, Xutao
Kang, Jian
Huang, Zhichao
Luo, Chuyao
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[44] Multi-view visual semantic embedding for cross-modal image–text retrieval
Li, Zheng
Guo, Caili
Wang, Xin
Zhang, Hao
Hu, Lin
[J]. Pattern Recognition, 2025, 159
[45] CROSS-MODAL DEEP METRIC LEARNING WITH MULTI-TASK REGULARIZATION
Huang, Xin
Peng, Yuxin
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 943 - 948
[46] Global-Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image-Text Retrieval
Hu, Gang
Wen, Zaidao
Lv, Yafei
Zhang, Jianting
Wu, Qian
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
[47] Plasticity-Stability Preserving Multi-Task Learning for Remote Sensing Image Retrieval
Sumbul, Gencer
Demir, Begum
[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60
[48] Plasticity-Stability Preserving Multi-Task Learning for Remote Sensing Image Retrieval
Sumbul, Gencer
Demir, Beguem
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[49] Colour image cross-modal retrieval method based on multi-modal visual data fusion
Liu, Xiangyuan
[J]. International Journal of Computational Intelligence Studies, 2023, 12 (1-2) : 118 - 129
[50] A Cross-Modal Guiding and Fusion Method for Multi-Modal RSVP-based Image Retrieval
Mao, Jiayu
Qiu, Shuang
Li, Dan
Wei, Wei
He, Huiguang
[J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,

← 1 2 3 4 5 →