Visual enhanced gLSTM for image captioning

被引:15
|
作者
Zhang, Jing [1 ]
Li, Kangkang [1 ]
Wang, Zhenkun [1 ]
Zhao, Xianwen [1 ]
Wang, Zhe [1 ]
机构
[1] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
关键词
Image caption; Visual enhanced-gLSTM; Bag of; Region of interest; Salient region;
D O I
10.1016/j.eswa.2021.115462
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For reducing the negative impact of the gradient diminishing on guiding long-short term memory (gLSTM) model in image captioning, we propose a visual enhanced gLSTM model for image caption generation. In this paper, the visual features of image's region of interest (RoI) are extracted and used as guiding information in gLSTM, in which visual information of RoI is added to gLSTM for generating more accurate image captions. Two visual enhanced methods based on region and entire image are proposed respectively. Among them the visual features from the important semantic region by CNN and the full image visual features by visual words are extracted to guide the LSTM for generating the most important semantic words. Then the visual features and text features of similar images are respectively projected to the common semantic space to obtain visual enhancement guiding information by canonical correlation analysis, and added to each memory cell of gLSTM for generating caption words. Compared with the original gLSTM method, visual enhanced gLSTM model focuses on important semantic region, which is more in line with human perception of images. Experiments on Flickr8k dataset illustrate that the proposed method can achieve more accurate image captions, and outperform the baseline gLSTM algorithm and other popular image captioning methods.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Boosting convolutional image captioning with semantic content and visual relationship
    Bai, Cong
    Zheng, Anqi
    Huang, Yuan
    Pan, Xiang
    Chen, Nan
    DISPLAYS, 2021, 70
  • [42] Multi-level Visual Fusion Networks for Image Captioning
    Zhou, Dongming
    Zhang, Canlong
    Li, Zhixin
    Wang, Zhiwen
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [43] Neuraltalk+: neural image captioning with visual assistance capabilities
    Sharma H.
    Padha D.
    Multimedia Tools and Applications, 2025, 84 (10) : 6843 - 6871
  • [44] Aligning Linguistic Words and Visual Semantic Units for Image Captioning
    Guo, Longteng
    Liu, Jing
    Tang, Jinhui
    Li, Jiangwei
    Luo, Wei
    Lu, Hanqing
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 765 - 773
  • [45] Social Image Captioning: Exploring Visual Attention and User Attention
    Wang, Leiquan
    Chu, Xiaoliang
    Zhang, Weishan
    Wei, Yiwei
    Sun, Weichen
    Wu, Chunlei
    SENSORS, 2018, 18 (02)
  • [46] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    DIGITAL SIGNAL PROCESSING, 2022, 130
  • [47] VIXEN: Visual Text Comparison Network for Image Difference Captioning
    Black, Alexander
    Shi, Jing
    Fan, Yifei
    Bui, Tu
    Collomosse, John
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 846 - 854
  • [48] A Comparative Study on Deep CNN Visual Encoders for Image Captioning
    Arun, M.
    Arivazhagan, S.
    Harinisri, R.
    Raghavi, P. S.
    COMPUTER VISION AND IMAGE PROCESSING, CVIP 2023, PT III, 2024, 2011 : 14 - 26
  • [49] Image Captioning using Visual Attention and Detection Transformer Model
    Eluri, Yaswanth
    Vinutha, N.
    Jeevika, M.
    Sree, Sai Bhavya N.
    Abhiram, G. Surya
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [50] Image Captioning Based on Visual Relevance and Context Dual Attention
    Liu M.-F.
    Shi Q.
    Nie L.-Q.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (09):