Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

被引:17
|
作者
Zhang, Xiangrong [1 ]
Li, Yunpeng [1 ]
Wang, Xin [1 ]
Liu, Feixiang [1 ]
Wu, Zhaoji [1 ]
Cheng, Xina [1 ]
Jiao, Licheng [1 ]
机构
[1] Xidian Univ, Sch Artificial Intelligence, Key Lab Intelligent Percept & Image Understanding, Minist Educ, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
remote sensing image captioning; cross-modal interaction; attention mechanism; semantic information; encoder-decoder; TRANSFORMER; NETWORK;
D O I
10.3390/rs15030579
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
The aim of remote sensing image captioning (RSIC) is to describe a given remote sensing image (RSI) using coherent sentences. Most existing attention-based methods model the coherence through an LSTM-based decoder, which dynamically infers a word vector from preceding sentences. However, these methods are indirectly guided through the confusion of attentive regions, as (1) the weighted average in the attention mechanism distracts the word vector from capturing pertinent visual regions and (2) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a multi-source interactive stair attention mechanism that separately models the semantics of preceding sentences and visual regions of interest. Specifically, the multi-source interaction takes previous semantic vectors as queries and applies an attention mechanism on regional features to acquire the next word vector, which reduces immediate hesitation by considering linguistics. The stair attention divides the attentive weights into three levels-that is, the core region, the surrounding region, and other regions-and all regions in the search scope are focused on differently. Then, a CIDEr-based reward reinforcement learning is devised, in order to enhance the quality of the generated sentences. Comprehensive experiments on widely used benchmarks (i.e., the Sydney-Captions, UCM-Captions, and RSICD data sets) demonstrate the superiority of the proposed model over state-of-the-art models, in terms of its coherence, while maintaining high accuracy.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] A new method for multi-source remote sensing image fusion
    Zhang, SY
    Wang, PQ
    Chen, XY
    Zhang, X
    IGARSS 2005: IEEE International Geoscience and Remote Sensing Symposium, Vols 1-8, Proceedings, 2005, : 3948 - 3951
  • [2] Multi-Stage Fusion and Multi-Source Attention Network for Multi-Modal Remote Sensing Image Segmentation
    Zhao, Jiaqi
    Zhou, Yong
    Shi, Boyu
    Yang, Jingsong
    Zhang, Di
    Yao, Rui
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2021, 12 (06)
  • [3] Multi-Stage Fusion and Multi-Source Attention Network for Multi-Modal Remote Sensing Image Segmentation
    Zhao, Jiaqi
    Zhou, Yong
    Shi, Boyu
    Yang, Jingsong
    Zhang, Di
    Yao, Rui
    ACM Transactions on Intelligent Systems and Technology, 2021, 12 (06):
  • [4] Exploring Multi-Level Attention and Semantic Relationship for Remote Sensing Image Captioning
    Yuan, Zhenghang
    Li, Xuelong
    Wang, Qi
    IEEE ACCESS, 2020, 8 (08): : 2608 - 2620
  • [5] Feature refinement and rethinking attention for remote sensing image captioning
    Li, Yunpeng
    Tao, Chengjin
    Liu, Meng
    Zhang, Xiangrong
    Wang, Guanchun
    Zhang, Tianyang
    Zhao, Dong
    Wang, Dabao
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [6] An improved SIFT algorithm for multi-source remote sensing image registration
    Zhang, Qian
    Jia, Yonghong
    Hu, Zhongwen
    Wuhan Daxue Xuebao (Xinxi Kexue Ban)/Geomatics and Information Science of Wuhan University, 2013, 38 (04): : 455 - 459
  • [7] A New Fast Multi-Source Remote Sensing Image Registration Algorithm
    Zhang Yong-Mei
    Ma Li
    MANAGEMENT, MANUFACTURING AND MATERIALS ENGINEERING, PTS 1 AND 2, 2012, 452-453 : 950 - 953
  • [8] Sound Active Attention Framework for Remote Sensing Image Captioning
    Lu, Xiaoqiang
    Wang, Binqiang
    Zheng, Xiangtao
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (03): : 1985 - 2000
  • [9] Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning
    Li, Yunpeng
    Zhang, Xiangrong
    Gu, Jing
    Li, Chen
    Wang, Xin
    Tang, Xu
    Jiao, Licheng
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [10] Multi-source remote sensing image multilevel matching based on SIFT
    Pan, Li
    Zou, Lianzhao
    EPLWW3S 2011: 2011 INTERNATIONAL CONFERENCE ON ECOLOGICAL PROTECTION OF LAKES-WETLANDS-WATERSHED AND APPLICATION OF 3S TECHNOLOGY, VOL 3, 2011, : 538 - 543