An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval

被引:5
|
作者
He, Liu [1 ]
Liu, Shuyan [1 ]
An, Ran [1 ]
Zhuo, Yudong [1 ]
Tao, Jian [1 ]
机构
[1] China Aeropolytechnol Estab, Dept Big Data Res & Applicat Technol, Beijing 100028, Peoples R China
关键词
remote sensing cross-modal text-image retrieval; vision-language fusion; multi-modal learning; multitask optimization;
D O I
10.3390/math11102279
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Remote sensing cross-modal text-image retrieval (RSCTIR) has recently attracted extensive attention due to its advantages of fast extraction of remote sensing image information and flexible human-computer interaction. Traditional RSCTIR methods mainly focus on improving the performance of uni-modal feature extraction separately, and most rely on pre-trained object detectors to obtain better local feature representation, which not only lack multi-modal interaction information, but also cause the training gap between the pre-trained object detector and the retrieval task. In this paper, we propose an end-to-end RSCTIR framework based on vision-language fusion (EnVLF) consisting of two uni-modal (vision and language) encoders and a muti-modal encoder which can be optimized by multitask training. Specifically, to achieve an end-to-end training process, we introduce a vision transformer module for image local features instead of a pre-trained object detector. By semantic alignment of visual and text features, the vision transformer module achieves the same performance as pre-trained object detectors for image local features. In addition, the trained multi-modal encoder can improve the top-one and top-five ranking performances after retrieval processing. Experiments on common RSICD and RSITMD datasets demonstrate that our EnVLF can obtain state-of-the-art retrieval performance.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] Cross-Modal Multitask Transformer for End-to-End Multimodal Aspect-Based Sentiment Analysis
    Yang, Li
    Na, Jin-Cheon
    Yu, Jianfei
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (05)
  • [42] Consistency Center-Based Deep Cross-Modal Hashing for Multisource Remote Sensing Image Retrieval
    Sun, Yuxi
    Ye, Yunming
    Kang, Jian
    Fernandez-Beltran, Ruben
    Li, Xutao
    Xiong, Zhenyu
    Huang, Xu
    Plaza, Antonio
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [43] A NOVEL SELF-SUPERVISED CROSS-MODAL IMAGE RETRIEVAL METHOD IN REMOTE SENSING
    Sumbul, Gencer
    Mueller, Markus
    Demir, Beguem
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2426 - 2430
  • [44] Robust Cross-Modal Remote Sensing Image Retrieval via Maximal Correlation Augmentation
    Wang, Zhuoyue
    Wang, Xueqian
    Li, Gang
    Li, Chengxi
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [45] Cross-modal Transfer Learning via Multi-grained Alignment for End-to-End Spoken Language Understanding
    Zhu, Yi
    Wang, Zexun
    Liu, Hang
    Wang, Peiying
    Feng, Mingchao
    Chen, Meng
    He, Xiaodong
    [J]. INTERSPEECH 2022, 2022, : 1131 - 1135
  • [46] PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
    Guo, Zixin
    Wang, Tzu-Jui Julius
    Pehlivan, Selen
    Radman, Abduljalil
    Laaksonen, Jorma
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 2261 - 2265
  • [47] COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
    Lu, Haoyu
    Fei, Nanyi
    Huo, Yuqi
    Gao, Yizhao
    Lu, Zhiwu
    Wen, Ji-Rong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15671 - 15680
  • [48] Image-text bidirectional learning network based cross-modal retrieval
    Li, Zhuoyi
    Lu, Huibin
    Fu, Hao
    Gu, Guanghua
    [J]. NEUROCOMPUTING, 2022, 483 : 148 - 159
  • [49] Remote sensing image description based on word embedding and end-to-end deep learning
    Yuan Wang
    Hongbing Ma
    Kuerban Alifu
    Yalong Lv
    [J]. Scientific Reports, 11
  • [50] Remote sensing image description based on word embedding and end-to-end deep learning
    Wang, Yuan
    Ma, Hongbing
    Alifu, Kuerban
    Lv, Yalong
    [J]. SCIENTIFIC REPORTS, 2021, 11 (01)