Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

被引:1
|
作者
Li, Zhengxin [1 ,2 ,3 ]
Zhao, Wenzhe [1 ,2 ]
Du, Xuanyi [1 ,3 ]
Zhou, Guangyao [1 ,2 ]
Zhang, Songlin [1 ,2 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Key Lab Spatial Informat Proc & Applicat Syst Tech, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 101408, Peoples R China
关键词
semantic retrieving; attention mechanism; image captioning; remote sensing;
D O I
10.3390/rs16010196
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Two-stage remote sensing image captioning (RSIC) methods have achieved promising results by incorporating additional pre-trained remote sensing tasks to extract supplementary information and improve caption quality. However, these methods face limitations in semantic comprehension, as pre-trained detectors/classifiers are constrained by predefined labels, leading to an oversight of the intricate and diverse details present in remote sensing images (RSIs). Additionally, the handling of auxiliary remote sensing tasks separately can introduce challenges in ensuring seamless integration and alignment with the captioning process. To address these problems, we propose a novel cross-modal retrieval and semantic refinement (CRSR) RSIC method. Specifically, we employ a cross-modal retrieval model to retrieve relevant sentences of each image. The words in these retrieved sentences are then considered as primary semantic information, providing valuable supplementary information for the captioning process. To further enhance the quality of the captions, we introduce a semantic refinement module that refines the primary semantic information, which helps to filter out misleading information and emphasize visually salient semantic information. A Transformer Mapper network is introduced to expand the representation of image features beyond the retrieved supplementary information with learnable queries. Both the refined semantic tokens and visual features are integrated and fed into a cross-modal decoder for caption generation. Through extensive experiments, we demonstrate the superiority of our CRSR method over existing state-of-the-art approaches on the RSICD, the UCM-Captions, and the Sydney-Captions datasets
引用
收藏
页数:20
相关论文
共 50 条
  • [41] SAM: cross-modal semantic alignments module for image-text retrieval
    Park, Pilseo
    Jang, Soojin
    Cho, Yunsung
    Kim, Youngbin
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12363 - 12377
  • [42] Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval
    Zeng, Sheng
    Liu, Changhong
    Zhou, Jun
    Chen, Yong
    Jiang, Aiwen
    Li, Hanxi
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 239 - 248
  • [43] MCRN: A Multi-source Cross-modal Retrieval Network for remote sensing
    Yuan, Zhiqiang
    Zhang, Wenkai
    Tian, Changyuan
    Mao, Yongqiang
    Zhou, Ruixue
    Wang, Hongqi
    Fu, Kun
    Sun, Xian
    [J]. INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 115
  • [44] SAM: cross-modal semantic alignments module for image-text retrieval
    Pilseo Park
    Soojin Jang
    Yunsung Cho
    Youngbin Kim
    [J]. Multimedia Tools and Applications, 2024, 83 : 12363 - 12377
  • [45] A Fine-Grained Semantic Alignment Method Specific to Aggregate Multi-Scale Information for Cross-Modal Remote Sensing Image Retrieval
    Zheng, Fuzhong
    Wang, Xu
    Wang, Luyao
    Zhang, Xiong
    Zhu, Hongze
    Wang, Long
    Zhang, Haisu
    [J]. SENSORS, 2023, 23 (20)
  • [46] Exploring Uni-Modal Feature Learning on Entities and Relations for Remote Sensing Cross-Modal Text-Image Retrieval
    Zhang, Shun
    Li, Yupeng
    Mei, Shaohui
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [47] Cross-Modal Contrastive Learning With Spatiotemporal Context for Correlation-Aware Multiscale Remote Sensing Image Retrieval
    Zhu, Lilu
    Wang, Yang
    Hu, Yanfeng
    Su, Xiaolu
    Fu, Kun
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [48] MGAN: Attempting a Multimodal Graph Attention Network for Remote Sensing Cross-Modal Text-Image Retrieval
    Wang, Zhiming
    Dong, Zhihua
    Yang, Xiaoyu
    Wang, Zhiguo
    Yin, Guangqiang
    [J]. PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND NETWORKS, VOL II, CENET 2023, 2024, 1126 : 261 - 273
  • [49] A Fusion Encoder with Multi-Task Guidance for Cross-Modal Text-Image Retrieval in Remote Sensing
    Zhang, Xiong
    Li, Weipeng
    Wang, Xu
    Wang, Luyao
    Zheng, Fuzhong
    Wang, Long
    Zhang, Haisu
    [J]. REMOTE SENSING, 2023, 15 (18)
  • [50] Hypersphere-Based Remote Sensing Cross-Modal Text-Image Retrieval via Curriculum Learning
    Zhang, Weihang
    Li, Jihao
    Li, Shuoke
    Chen, Jialiang
    Zhang, Wenkai
    Gao, Xin
    Sun, Xian
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61