An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

被引:5
|
作者
He, Liu [1 ]
Liu, Shuyan [1 ]
An, Ran [1 ]
Zhuo, Yudong [1 ]
Tao, Jian [1 ]
机构
[1] China Aeropolytechnol Estab, Dept Big Data Res & Applicat Technol, Beijing 100028, Peoples R China
关键词
remote sensing cross-modal text-image retrieval; vision-language fusion; multi-modal learning; multitask optimization;
D O I
10.3390/math11102279
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human-computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information
    Yuan, Zhiqiang
    Zhang, Wenkai
    Tian, Changyuan
    Rong, Xuee
    Zhang, Zhengyuan
    Wang, Hongqi
    Fu, Kun
    Sun, Xian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [2] A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing
    Zhang, Xiong
    Li, Weipeng
    Wang, Xu
    Wang, Luyao
    Zheng, Fuzhong
    Wang, Long
    Zhang, Haisu
    [J]. REMOTE SENSING, 2023, 15 (18)
  • [3] Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning
    Zhang, Weihang
    Li, Jihao
    Li, Shuoke
    Chen, Jialiang
    Zhang, Wenkai
    Gao, Xin
    Sun, Xian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [4] Text-Image Matching for Cross-Modal Remote Sensing Image Retrieval via Graph Neural Network
    Yu, Hongfeng
    Yao, Fanglong
    Lu, Wanxuan
    Liu, Nayu
    Li, Peiguang
    You, Hongjian
    Sun, Xian
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 812 - 824
  • [5] MGAN: Attempting a Multimodal Graph Attention Network for Remote Sensing Cross-Modal Text-Image Retrieval
    Wang, Zhiming
    Dong, Zhihua
    Yang, Xiaoyu
    Wang, Zhiguo
    Yin, Guangqiang
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 261 - 273
  • [6] Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval
    Zhang, Shun
    Li, Yupeng
    Mei, Shaohui
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [7] Multimodal Image Fusion Framework for End-to-End Remote Sensing Image Registration
    Li, Liangzhi
    Han, Ling
    Ding, Mingtao
    Cao, Hongye
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [8] Improving text-image cross-modal retrieval with contrastive loss
    Chumeng Zhang
    Yue Yang
    Junbo Guo
    Guoqing Jin
    Dan Song
    An An Liu
    [J]. Multimedia Systems, 2023, 29 : 569 - 575
  • [9] Improving text-image cross-modal retrieval with contrastive loss
    Zhang, Chumeng
    Yang, Yue
    Guo, Junbo
    Jin, Guoqing
    Song, Dan
    Liu, An An
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (02) : 569 - 575
  • [10] A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text-Image Retrieval
    Yang, Lei
    Feng, Yong
    Zhou, Mingling
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    [J]. JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS, 2023, 32 (13)