Visual enhanced gLSTM for image captioning

被引:15
|
作者
Zhang, Jing [1 ]
Li, Kangkang [1 ]
Wang, Zhenkun [1 ]
Zhao, Xianwen [1 ]
Wang, Zhe [1 ]
机构
[1] East China Univ Sci & Technol, Dept Comp Sci & Engn, Shanghai 200237, Peoples R China
关键词
Image caption; Visual enhanced-gLSTM; Bag of; Region of interest; Salient region;
D O I
10.1016/j.eswa.2021.115462
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For reducing the negative impact of the gradient diminishing on guiding long-short term memory (gLSTM) model in image captioning, we propose a visual enhanced gLSTM model for image caption generation. In this paper, the visual features of image's region of interest (RoI) are extracted and used as guiding information in gLSTM, in which visual information of RoI is added to gLSTM for generating more accurate image captions. Two visual enhanced methods based on region and entire image are proposed respectively. Among them the visual features from the important semantic region by CNN and the full image visual features by visual words are extracted to guide the LSTM for generating the most important semantic words. Then the visual features and text features of similar images are respectively projected to the common semantic space to obtain visual enhancement guiding information by canonical correlation analysis, and added to each memory cell of gLSTM for generating caption words. Compared with the original gLSTM method, visual enhanced gLSTM model focuses on important semantic region, which is more in line with human perception of images. Experiments on Flickr8k dataset illustrate that the proposed method can achieve more accurate image captions, and outperform the baseline gLSTM algorithm and other popular image captioning methods.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] Input enhanced asymmetric transformer for image captioning
    Chenhao Zhu
    Xia Ye
    Qiduo Lu
    Signal, Image and Video Processing, 2023, 17 : 1419 - 1427
  • [22] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
  • [24] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
  • [25] Image captioning in Bengali language using visual attention
    Masud, Adiba
    Hosen, Md. Biplob
    Habibullah, Md.
    Anannya, Mehrin
    Kaiser, M. Shamim
    PLOS ONE, 2025, 20 (02):
  • [26] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [27] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [28] Image Captioning with Text-Based Visual Attention
    Chen He
    Haifeng Hu
    Neural Processing Letters, 2019, 49 : 177 - 185
  • [29] VISUAL SALIENCY FOR IMAGE CAPTIONING IN NEW MULTIMEDIA SERVICES
    Cornia, Marcella
    Baraldi, Lorenzo
    Serra, Giuseppe
    Cucchiara, Rita
    2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2017,
  • [30] A visual question answering model based on image captioning
    Zhou, Kun
    Liu, Qiongjie
    Zhao, Dexin
    MULTIMEDIA SYSTEMS, 2024, 30 (06)