Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

被引:15
|
作者
Zhuang, Shuo [1 ,2 ]
Wang, Ping [3 ]
Wang, Gang [2 ,3 ]
Wang, Di [3 ]
Chen, Jinyong [2 ]
Gao, Feng [2 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Peoples R China
[2] CETC Key Lab Aerosp Informat Applicat, Shijiazhuang 050081, Hebei, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
关键词
Feature extraction; Transformers; Decoding; Visualization; Training; Measurement; Semantics; Convolutional neural network (CNN); image captioning; remote sensing; transformer; MODELS;
D O I
10.1109/LGRS.2021.3135711
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Remote sensing image captioning (RSIC) has great significance in image understanding, which describes the image content in natural language. Existing methods are mainly based on deep learning and rely on the encoder-decoder model to generate sentences. In the decoding process, recurrent neural network (RNN) and long short-term memory (LSTM) are normally applied to sequentially generate image captions. In this letter, the transformer encoder-decoder is combined with grid features to improve the RSIC performance. First, the pretrained convolutional neural network (CNN) is used to extract grid-based visual features, which are encoded as vectorial representations. Then, the transformer outputs semantic descriptions to bridge visual features and natural language. Besides, the self-critical sequence training (SCST) strategy is applied to further optimize the image captioning model and improve the quality of generated sentences. Extensive experiments are organized on three public datasets of RSCID, UCM-Captions, and Sydney-Captions. Experimental results demonstrate the effectiveness of SCST strategy and the proposed method achieves superior performance compared with the state-of-the-art image captioning approaches on the RSCID dataset.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] Exploring better image captioning with grid features
    Yan, Jie
    Xie, Yuxiang
    Guo, Yanming
    Wei, Yingmei
    Luan, Xidao
    [J]. COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3541 - 3556
  • [22] On Combining Multiple Features for Hyperspectral Remote Sensing Image Classification
    Zhang, Lefei
    Zhang, Liangpei
    Tao, Dacheng
    Huang, Xin
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2012, 50 (03): : 879 - 893
  • [23] Remote Sensing Image Segmentation by Combining Spectral and Texture Features
    Yuan, Jiangye
    Wang, DeLiang
    Li, Rongxing
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2014, 52 (01): : 16 - 24
  • [24] Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion
    Zhao, An
    Yang, Wenzhong
    Chen, Danny
    Wei, Fuyuan
    [J]. ELECTRONICS, 2024, 13 (18)
  • [25] Improving Stylized Image Captioning with Better Use of Transformer
    Tan, Yutong
    Lin, Zheng
    Liu, Huan
    Zuo, Fan
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
  • [26] Remote-sensing image retrieval by combining image visual and semantic features
    Wang, M.
    Wan, Q. M.
    Gu, L. B.
    Song, T. Y.
    [J]. INTERNATIONAL JOURNAL OF REMOTE SENSING, 2013, 34 (12) : 4200 - 4223
  • [27] Meta captioning: A meta learning based remote sensing image captioning framework
    Yang, Qiaoqiao
    Ni, Zihao
    Ren, Peng
    [J]. ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2022, 186 : 190 - 200
  • [28] Intensive Positioning Network for Remote Sensing Image Captioning
    Wang, Shengsheng
    Chen, Jiawei
    Wang, Guangyao
    [J]. INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING, 2018, 11266 : 567 - 576
  • [29] Incorporating object counts into remote sensing image captioning
    Ni, Zihao
    Zong, Zhaoyun
    Ren, Peng
    [J]. INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2024, 17 (01)
  • [30] Structural Representative Network for Remote Sensing Image Captioning
    Sharma, Jaya
    Divya, Peketi
    Sravani, Yenduri
    Shekar, B. H.
    Mohan, Krishna C.
    [J]. FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701