Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

被引:1
|
作者
Li, Zhengxin [1 ,2 ,3 ]
Zhao, Wenzhe [1 ,2 ]
Du, Xuanyi [1 ,3 ]
Zhou, Guangyao [1 ,2 ]
Zhang, Songlin [1 ,2 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100190, Peoples R China
[2] Chinese Acad Sci, Key Lab Spatial Informat Proc & Applicat Syst Tech, Beijing 100190, Peoples R China
[3] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 101408, Peoples R China
关键词
semantic retrieving; attention mechanism; image captioning; remote sensing;
D O I
10.3390/rs16010196
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Two-stage remote sensing image captioning (RSIC) methods have achieved promising results by incorporating additional pre-trained remote sensing tasks to extract supplementary information and improve caption quality. However, these methods face limitations in semantic comprehension, as pre-trained detectors/classifiers are constrained by predefined labels, leading to an oversight of the intricate and diverse details present in remote sensing images (RSIs). Additionally, the handling of auxiliary remote sensing tasks separately can introduce challenges in ensuring seamless integration and alignment with the captioning process. To address these problems, we propose a novel cross-modal retrieval and semantic refinement (CRSR) RSIC method. Specifically, we employ a cross-modal retrieval model to retrieve relevant sentences of each image. The words in these retrieved sentences are then considered as primary semantic information, providing valuable supplementary information for the captioning process. To further enhance the quality of the captions, we introduce a semantic refinement module that refines the primary semantic information, which helps to filter out misleading information and emphasize visually salient semantic information. A Transformer Mapper network is introduced to expand the representation of image features beyond the retrieved supplementary information with learnable queries. Both the refined semantic tokens and visual features are integrated and fed into a cross-modal decoder for caption generation. Through extensive experiments, we demonstrate the superiority of our CRSR method over existing state-of-the-art approaches on the RSICD, the UCM-Captions, and the Sydney-Captions datasets
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Deep Cross-Modal Retrieval for Remote Sensing Image and Audio
    Guo Mao
    Yuan Yuan
    Lu Xiaoqiang
    [J]. 2018 10TH IAPR WORKSHOP ON PATTERN RECOGNITION IN REMOTE SENSING (PRRS), 2018,
  • [2] A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing
    Cheng, Qimin
    Zhou, Yuzhuo
    Fu, Peng
    Xu, Yuan
    Zhang, Liang
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 4284 - 4297
  • [3] Remote Sensing Cross-Modal Retrieval by Deep Image-Voice Hashing
    Zhang, Yichao
    Zheng, Xiangtao
    Lu, Xiaoqiang
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 9327 - 9338
  • [4] Cross-Modal Hashing With Feature Semi-Interaction and Semantic Ranking for Remote Sensing Ship Image Retrieval
    Sun, Yuxi
    Ye, Yunming
    Kang, Jian
    Fernandez-Beltran, Ruben
    Ban, Yifang
    Hafner, Sebastian
    Li, Xutao
    Luo, Chuyao
    Plaza, Antonio
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 15
  • [5] Deep Cross-Modal ImageVoice Retrieval in Remote Sensing
    Chen, Yaxiong
    Lu, Xiaoqiang
    Wang, Shuai
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (10): : 7049 - 7061
  • [6] HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning
    Yang, Zhigang
    Li, Qiang
    Yuan, Yuan
    Wang, Qi
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 11
  • [7] Multisensor Fusion and Explicit Semantic Preserving-Based Deep Hashing for Cross-Modal Remote Sensing Image Retrieval
    Sun, Yuxi
    Feng, Shanshan
    Ye, Yunming
    Li, Xutao
    Kang, Jian
    Huang, Zhichao
    Luo, Chuyao
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [8] Robust Cross-Modal Remote Sensing Image Retrieval via Maximal Correlation Augmentation
    Wang, Zhuoyue
    Wang, Xueqian
    Li, Gang
    Li, Chengxi
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [9] A NOVEL SELF-SUPERVISED CROSS-MODAL IMAGE RETRIEVAL METHOD IN REMOTE SENSING
    Sumbul, Gencer
    Mueller, Markus
    Demir, Beguem
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 2426 - 2430
  • [10] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192