An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

被引:5
|
作者
He, Liu [1 ]
Liu, Shuyan [1 ]
An, Ran [1 ]
Zhuo, Yudong [1 ]
Tao, Jian [1 ]
机构
[1] China Aeropolytechnol Estab, Dept Big Data Res & Applicat Technol, Beijing 100028, Peoples R China
关键词
remote sensing cross-modal text-image retrieval; vision-language fusion; multi-modal learning; multitask optimization;
D O I
10.3390/math11102279
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human-computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing
    Cheng, Qimin
    Zhou, Yuzhuo
    Fu, Peng
    Xu, Yuan
    Zhang, Liang
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 4284 - 4297
  • [22] A fusion-based contrastive learning model for cross-modal remote sensing retrieval
    Li, Haoran
    Xiong, Wei
    Cui, Yaqi
    Xiong, Zhenyu
    [J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2022, 43 (09) : 3359 - 3386
  • [23] A TEXTURE AND SALIENCY ENHANCED IMAGE LEARNING METHOD FOR CROSS-MODAL REMOTE SENSING IMAGE-TEXT RETRIEVAL
    Yang, Rui
    Zhang, Di
    Guo, YanHe
    Wang, Shuang
    [J]. IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 4895 - 4898
  • [24] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
    Li, Zhengxin
    Zhao, Wenzhe
    Du, Xuanyi
    Zhou, Guangyao
    Zhang, Songlin
    [J]. REMOTE SENSING, 2024, 16 (01)
  • [25] Transformer vision-language tracking via proxy token guided cross-modal fusion
    Zhao, Haojie
    Wang, Xiao
    Wang, Dong
    Lu, Huchuan
    Ruan, Xiang
    [J]. PATTERN RECOGNITION LETTERS, 2023, 168 : 10 - 16
  • [26] Learning Text-image Joint Embedding for Efficient Cross-modal Retrieval with Deep Feature Engineering
    Xie, Zhongwei
    Liu, Ling
    Wu, Yanzhao
    Zhong, Luo
    Li, Lin
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (04)
  • [27] MULTI-SCALE INTERACTIVE TRANSFORMER FOR REMOTE SENSING CROSS-MODAL IMAGE-TEXT RETRIEVAL
    Wang, Yijing
    Ma, Jingjing
    Li, Mingteng
    Tang, Xu
    Han, Xiao
    Jiao, Licheng
    [J]. 2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 839 - 842
  • [28] Interacting-Enhancing Feature Transformer for Cross-Modal Remote-Sensing Image and Text Retrieval
    Tang, Xu
    Wang, Yijing
    Ma, Jingjing
    Zhang, Xiangrong
    Liu, Fang
    Jiao, Licheng
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [29] Masking-Based Cross-Modal Remote Sensing Image-Text Retrieval via Dynamic Contrastive Learning
    Zhao, Zuopeng
    Miao, Xiaoran
    He, Chen
    Hu, Jianfeng
    Min, Bingbing
    Gao, Yumeng
    Liu, Ying
    Pharksuwan, Kanyaphakphachsorn
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [30] AN END-TO-END ADVERSARIAL HASHING METHOD FOR UNSUPERVISED MULTISPECTRAL REMOTE SENSING IMAGE RETRIEVAL
    Chen, Xuelei
    Lu, Cunyue
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1536 - 1540