Visual enhanced gLSTM for image captioning

被引:15
|
作者
Zhang, Jing [1 ]
Li, Kangkang [1 ]
Wang, Zhenkun [1 ]
Zhao, Xianwen [1 ]
Wang, Zhe [1 ]
机构
[1] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
关键词
Image caption; Visual enhanced-gLSTM; Bag of; Region of interest; Salient region;
D O I
10.1016/j.eswa.2021.115462
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For reducing the negative impact of the gradient diminishing on guiding long-short term memory (gLSTM) model in image captioning, we propose a visual enhanced gLSTM model for image caption generation. In this paper, the visual features of image's region of interest (RoI) are extracted and used as guiding information in gLSTM, in which visual information of RoI is added to gLSTM for generating more accurate image captions. Two visual enhanced methods based on region and entire image are proposed respectively. Among them the visual features from the important semantic region by CNN and the full image visual features by visual words are extracted to guide the LSTM for generating the most important semantic words. Then the visual features and text features of similar images are respectively projected to the common semantic space to obtain visual enhancement guiding information by canonical correlation analysis, and added to each memory cell of gLSTM for generating caption words. Compared with the original gLSTM method, visual enhanced gLSTM model focuses on important semantic region, which is more in line with human perception of images. Experiments on Flickr8k dataset illustrate that the proposed method can achieve more accurate image captions, and outperform the baseline gLSTM algorithm and other popular image captioning methods.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Image Captioning with Text-Based Visual Attention
    He, Chen
    Hu, Haifeng
    NEURAL PROCESSING LETTERS, 2019, 49 (01) : 177 - 185
  • [32] DIFNet: Boosting Visual Information Flow for Image Captioning
    Wu, Mingrui
    Zhang, Xuying
    Sun, Xiaoshuai
    Zhou, Yiyi
    Chen, Chao
    Gu, Jiaxin
    Sun, Xing
    Ji, Rongrong
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17999 - 18008
  • [33] Visual contextual relationship augmented transformer for image captioning
    Qiang Su
    Junbo Hu
    Zhixin Li
    Applied Intelligence, 2024, 54 : 4794 - 4813
  • [34] Aligned visual semantic scene graph for image captioning
    Zhao, Shanshan
    Li, Lixiang
    Peng, Haipeng
    DISPLAYS, 2022, 74
  • [35] Visual News: Benchmark and Challenges in News Image Captioning
    Liu, Fuxiao
    Wang, Yinghan
    Wang, Tianlu
    Ordonez, Vicente
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6761 - 6771
  • [36] Combine Visual Features and Scene Semantics for Image Captioning
    Li Z.-X.
    Wei H.-Y.
    Huang F.-C.
    Zhang C.-L.
    Ma H.-F.
    Shi Z.-Z.
    Li, Zhi-Xin (lizx@gxnu.edu.cn), 1624, Science Press (43): : 1624 - 1640
  • [37] Adaptive Semantic-Enhanced Transformer for Image Captioning
    Zhang, Jing
    Fang, Zhongjun
    Sun, Han
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1785 - 1796
  • [38] Relational Attention with Textual Enhanced Transformer for Image Captioning
    Song, Lifei
    Shi, Yiwen
    Xiao, Xinyu
    Zhang, Chunxia
    Xiang, Shiming
    PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
  • [39] Visual Analytics for Efficient Image Exploration and User-Guided Image Captioning
    Li, Yiran
    Wang, Junpeng
    Aboagye, Prince
    Yeh, Chin-Chia Michael
    Zheng, Yan
    Wang, Liang
    Zhang, Wei
    Ma, Kwan-Liu
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (06) : 2875 - 2887
  • [40] A Visual Attention-Based Model for Bengali Image Captioning
    Das B.
    Pal R.
    Majumder M.
    Phadikar S.
    Sekh A.A.
    SN Computer Science, 4 (2)