A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing

被引:3
|
作者
Zhang, Xiong [1 ]
Li, Weipeng [1 ]
Wang, Xu [1 ]
Wang, Luyao [1 ]
Zheng, Fuzhong [1 ]
Wang, Long [1 ]
Zhang, Haisu [1 ]
机构
[1] Natl Univ Def Technol, Sch Informat & Commun, Wuhan 430074, Peoples R China
基金
中国国家自然科学基金;
关键词
cross-modal retrieval; remote sensing images; fusion encoding method; joint representation; contrastive learning;
D O I
10.3390/rs15184637
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
In recent years, there has been a growing interest in remote sensing image-text cross-modal retrieval due to the rapid development of space information technology and the significant increase in the volume of remote sensing image data. Remote sensing images have unique characteristics that make the cross-modal retrieval task challenging. Firstly, the semantics of remote sensing images are fine-grained, meaning they can be divided into multiple basic units of semantic expression. Different combinations of basic units of semantic expression can generate diverse text descriptions. Additionally, these images exhibit variations in resolution, color, and perspective. To address these challenges, this paper proposes a multi-task guided fusion encoder (MTGFE) based on the multimodal fusion encoding method, the progressiveness of which has been proved in the cross-modal retrieval of natural images. By jointly training the model with three tasks: image-text matching (ITM), masked language modeling (MLM), and the newly introduced multi-view joint representations contrast (MVJRC), we enhance its capability to capture fine-grained correlations between remote sensing images and texts. Specifically, the MVJRC task is designed to improve the model's consistency in joint representation expression and fine-grained correlation, particularly for remote sensing images with significant differences in resolution, color, and angle. Furthermore, to address the computational complexity associated with large-scale fusion models and improve retrieval efficiency, this paper proposes a retrieval filtering method, which achieves higher retrieval efficiency while minimizing accuracy loss. Extensive experiments were conducted on four public datasets to evaluate the proposed method, and the results validate its effectiveness.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information
    Yuan, Zhiqiang
    Zhang, Wenkai
    Tian, Changyuan
    Rong, Xuee
    Zhang, Zhengyuan
    Wang, Hongqi
    Fu, Kun
    Sun, Xian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [2] Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network
    Yu, Hongfeng
    Yao, Fanglong
    Lu, Wanxuan
    Liu, Nayu
    Li, Peiguang
    You, Hongjian
    Sun, Xian
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 812 - 824
  • [3] MGAN: Attempting a Multimodal Graph Attention Network for Remote Sensing Cross-Modal Text-Image Retrieval
    Wang, Zhiming
    Dong, Zhihua
    Yang, Xiaoyu
    Wang, Zhiguo
    Yin, Guangqiang
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 261 - 273
  • [4] Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning
    Zhang, Weihang
    Li, Jihao
    Li, Shuoke
    Chen, Jialiang
    Zhang, Wenkai
    Gao, Xin
    Sun, Xian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [5] An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval
    He, Liu
    Liu, Shuyan
    An, Ran
    Zhuo, Yudong
    Tao, Jian
    [J]. MATHEMATICS, 2023, 11 (10)
  • [6] Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval
    Zhang, Shun
    Li, Yupeng
    Mei, Shaohui
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [7] Improving text-image cross-modal retrieval with contrastive loss
    Chumeng Zhang
    Yue Yang
    Junbo Guo
    Guoqing Jin
    Dan Song
    An An Liu
    [J]. Multimedia Systems, 2023, 29 : 569 - 575
  • [8] Improving text-image cross-modal retrieval with contrastive loss
    Zhang, Chumeng
    Yang, Yue
    Guo, Junbo
    Jin, Guoqing
    Song, Dan
    Liu, An An
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (02) : 569 - 575
  • [9] A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text-Image Retrieval
    Yang, Lei
    Feng, Yong
    Zhou, Mingling
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (13)
  • [10] CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval
    Wang, Zihao
    Liu, Xihui
    Li, Hongsheng
    Sheng, Lu
    Yan, Junjie
    Wang, Xiaogang
    Shao, Jing
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5763 - 5772