A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing

被引:3
|
作者
Zhang, Xiong [1 ]
Li, Weipeng [1 ]
Wang, Xu [1 ]
Wang, Luyao [1 ]
Zheng, Fuzhong [1 ]
Wang, Long [1 ]
Zhang, Haisu [1 ]
机构
[1] Natl Univ Def Technol, Sch Informat & Commun, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
cross-modal retrieval; remote sensing images; fusion encoding method; joint representation; contrastive learning;
D O I
10.3390/rs15184637
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
In recent years, there has been a growing interest in remote sensing image-text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image-text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model's consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Cross-Modal Image-Text Retrieval with Semantic Consistency
    Chen, Hui
    Ding, Guiguang
    Lin, Zijin
    Zhao, Sicheng
    Han, Jungong
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1749 - 1757
  • [42] Rethinking Benchmarks for Cross-modal Image-text Retrieval
    Chen, Weijing
    Yao, Linli
    Jin, Qin
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 1241 - 1251
  • [43] Multisensor Fusion and Explicit Semantic Preserving-Based Deep Hashing for Cross-Modal Remote Sensing Image Retrieval
    Sun, Yuxi
    Feng, Shanshan
    Ye, Yunming
    Li, Xutao
    Kang, Jian
    Huang, Zhichao
    Luo, Chuyao
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [44] Multi-view visual semantic embedding for cross-modal image–text retrieval
    Li, Zheng
    Guo, Caili
    Wang, Xin
    Zhang, Hao
    Hu, Lin
    [J]. Pattern Recognition, 2025, 159
  • [45] CROSS-MODAL DEEP METRIC LEARNING WITH MULTI-TASK REGULARIZATION
    Huang, Xin
    Peng, Yuxin
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 943 - 948
  • [46] Global-Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image-Text Retrieval
    Hu, Gang
    Wen, Zaidao
    Lv, Yafei
    Zhang, Jianting
    Wu, Qian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [47] Plasticity-Stability Preserving Multi-Task Learning for Remote Sensing Image Retrieval
    Sumbul, Gencer
    Demir, Begum
    [J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60
  • [48] Plasticity-Stability Preserving Multi-Task Learning for Remote Sensing Image Retrieval
    Sumbul, Gencer
    Demir, Beguem
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [49] Colour image cross-modal retrieval method based on multi-modal visual data fusion
    Liu, Xiangyuan
    [J]. International Journal of Computational Intelligence Studies, 2023, 12 (1-2) : 118 - 129
  • [50] A Cross-Modal Guiding and Fusion Method for Multi-Modal RSVP-based Image Retrieval
    Mao, Jiayu
    Qiu, Shuang
    Li, Dan
    Wei, Wei
    He, Huiguang
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,