A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing

被引:3
|
作者
Zhang, Xiong [1 ]
Li, Weipeng [1 ]
Wang, Xu [1 ]
Wang, Luyao [1 ]
Zheng, Fuzhong [1 ]
Wang, Long [1 ]
Zhang, Haisu [1 ]
机构
[1] Natl Univ Def Technol, Sch Informat & Commun, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
cross-modal retrieval; remote sensing images; fusion encoding method; joint representation; contrastive learning;
D O I
10.3390/rs15184637
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
In recent years, there has been a growing interest in remote sensing image-text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image-text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model's consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.
引用
收藏
页数:22
相关论文
共 50 条
  • [31] MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing
    Yuan, Zhiqiang
    Zhang, Wenkai
    Tian, Changyuan
    Mao, Yongqiang
    Zhou, Ruixue
    Wang, Hongqi
    Fu, Kun
    Sun, Xian
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 115
  • [32] Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval
    Ji, Zhong
    Lin, Zhigang
    Wang, Haoran
    Pang, Yanwei
    Li, Xuelong
    [J]. PATTERN RECOGNITION, 2024, 151
  • [33] Knowledge-Aware Text-Image Retrieval for Remote Sensing Images
    Mi, Li
    Dai, Xianjie
    Castillo-Navarro, Javiera
    Tuia, Devis
    [J]. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62
  • [34] A fusion-based contrastive learning model for cross-modal remote sensing retrieval
    Li, Haoran
    Xiong, Wei
    Cui, Yaqi
    Xiong, Zhenyu
    [J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (09) : 3359 - 3386
  • [35] Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images
    Chen, Yaxiong
    Huang, Jirui
    Xiong, Shengwu
    Lu, Xiaoqiang
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 17
  • [36] Masking-Based Cross-Modal Remote Sensing Image-Text Retrieval via Dynamic Contrastive Learning
    Zhao, Zuopeng
    Miao, Xiaoran
    He, Chen
    Hu, Jianfeng
    Min, Bingbing
    Gao, Yumeng
    Liu, Ying
    Pharksuwan, Kanyaphakphachsorn
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [37] Robust Cross-Modal Remote Sensing Image Retrieval via Maximal Correlation Augmentation
    Wang, Zhuoyue
    Wang, Xueqian
    Li, Gang
    Li, Chengxi
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [38] A NOVEL SELF-SUPERVISED CROSS-MODAL IMAGE RETRIEVAL METHOD IN REMOTE SENSING
    Sumbul, Gencer
    Mueller, Markus
    Demir, Beguem
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2426 - 2430
  • [39] UNSUPERVISED CONTRASTIVE HASHING FOR CROSS-MODAL RETRIEVAL IN REMOTE SENSING
    Mikriukov, Georgii
    Ravanbakhsh, Mahdyar
    Demir, Begum
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4463 - 4467
  • [40] Cross-modal Image-Text Retrieval with Multitask Learning
    Luo, Junyu
    Shen, Ying
    Ao, Xiang
    Zhao, Zhou
    Yang, Min
    [J]. PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 2309 - 2312