GVA: guided visual attention approach for automatic image caption generation

被引:3
|
作者
Hossen, Md. Bipul [1 ]
Ye, Zhongfu [1 ]
Abdussalam, Amr [1 ]
Hossain, Md. Imran [2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Pabna Univ Sci & Technol, Dept ICE, Pabna 6600, Bangladesh
关键词
Image captioning; Faster R-CNN; LSTM; Up-down model; Encoder-decoder framework;
D O I
10.1007/s00530-023-01249-w
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains great attention in the multimedia field. However, deciding which aspects of an image to highlight for better captioning remains a challenge. Most advanced captioning models utilize only one attention module to assign attention weights to visual vectors, but this may not be enough to create an informative caption. To tackle this issue, we propose an innovative and well-designed Guided Visual Attention (GVA) approach, incorporating an additional attention mechanism to re-adjust the attentional weights on the visual feature vectors and feed the resulting context vector to the language LSTM. Utilizing the first-level attention module as guidance for the GVA module and re-weighting the attention weights significantly enhances the caption's quality. Recently, deep neural networks have allowed the encoder-decoder architecture to make use visual attention mechanism, where faster R-CNN is used for extracting features in the encoder and a visual attention-based LSTM is applied in the decoder. Extensive experiments have been implemented on both the MS-COCO and Flickr30k benchmark datasets. Compared with state-of-the-art methods, our approach achieved an average improvement of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, as well as 4.6% on BLEU@1 and 12.48% on CIDEr score for the Flickr30K datasets, based on the cross-entropy optimization. These results demonstrate the clear superiority of our proposed approach in comparison to existing methods using standard evaluation metrics. The implementing code can be found here: (https://github.com/mdbipu/GVA).
引用
下载
收藏
页数:16
相关论文
共 50 条
  • [41] Automatic Image Caption Generation Based on Some Machine Learning Algorithms
    Predic, Bratislav
    Manic, Dasa
    Saracevic, Muzafer
    Karabasevic, Darjan
    Stanujkic, Dragisa
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [42] Attention-based Visual-Audio Fusion for Video Caption Generation
    Guo, Ningning
    Liu, Huaping
    Jiang, Linhua
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844
  • [43] Mind's Eye: A Recurrent Visual Representation for Image Caption Generation
    Chen, Xinlei
    Zitnick, C. Lawrence
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2422 - 2431
  • [44] TVPRNN for image caption generation
    Yang, Liang
    Hu, Haifeng
    ELECTRONICS LETTERS, 2017, 53 (22) : 1471 - +
  • [45] Language of Gleam: Impressionism Artwork Automatic Caption Generation for People with Visual Impairments
    Lee, Dongmyeong
    Hwang, Hyegyeong
    Jabbar, Muhammad Shahid
    Cho, Jun-Dong
    THIRTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2020), 2021, 11605
  • [46] Image Caption with Endogenous–Exogenous Attention
    Teng Wang
    Haifeng Hu
    Chen He
    Neural Processing Letters, 2019, 50 : 431 - 443
  • [47] Attention based sequence-to-sequence framework for auto image caption generation
    Khan, Rashid
    Islam, M. Shujah
    Kanwal, Khadija
    Iqbal, Mansoor
    Hossain, Md Imran
    Ye, Zhongfu
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (01) : 159 - 170
  • [48] CNN image caption generation
    Li Y.
    Cheng H.
    Liang X.
    Guo Q.
    Qian Y.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2019, 46 (02): : 152 - 157
  • [49] Spatial Relational Attention Using Fully Convolutional Networks for Image Caption Generation
    Jiang, Teng
    Gong, Liang
    Yang, Yupu
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, 2020, 19 (02)
  • [50] Enhancing image caption generation through context-aware attention mechanism
    Bhuiyan, Ahatesham
    Hossain, Eftekhar
    Hoque, Mohammed Moshiul
    Dewan, M. Ali Akber
    HELIYON, 2024, 10 (17)