GVA: guided visual attention approach for automatic image caption generation

被引:3
|
作者
Hossen, Md. Bipul [1 ]
Ye, Zhongfu [1 ]
Abdussalam, Amr [1 ]
Hossain, Md. Imran [2 ]
机构
[1] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei 230027, Anhui, Peoples R China
[2] Pabna Univ Sci & Technol, Dept ICE, Pabna 6600, Bangladesh
关键词
Image captioning; Faster R-CNN; LSTM; Up-down model; Encoder-decoder framework;
D O I
10.1007/s00530-023-01249-w
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automated image caption generation with attention mechanisms focuses on visual features including objects, attributes, actions, and scenes of the image to understand and provide more detailed captions, which attains great attention in the multimedia field. However, deciding which aspects of an image to highlight for better captioning remains a challenge. Most advanced captioning models utilize only one attention module to assign attention weights to visual vectors, but this may not be enough to create an informative caption. To tackle this issue, we propose an innovative and well-designed Guided Visual Attention (GVA) approach, incorporating an additional attention mechanism to re-adjust the attentional weights on the visual feature vectors and feed the resulting context vector to the language LSTM. Utilizing the first-level attention module as guidance for the GVA module and re-weighting the attention weights significantly enhances the caption's quality. Recently, deep neural networks have allowed the encoder-decoder architecture to make use visual attention mechanism, where faster R-CNN is used for extracting features in the encoder and a visual attention-based LSTM is applied in the decoder. Extensive experiments have been implemented on both the MS-COCO and Flickr30k benchmark datasets. Compared with state-of-the-art methods, our approach achieved an average improvement of 2.4% on BLEU@1 and 13.24% on CIDEr for the MSCOCO dataset, as well as 4.6% on BLEU@1 and 12.48% on CIDEr score for the Flickr30K datasets, based on the cross-entropy optimization. These results demonstrate the clear superiority of our proposed approach in comparison to existing methods using standard evaluation metrics. The implementing code can be found here: (https://github.com/mdbipu/GVA).
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Automatic image caption generation using deep learning
    Verma, Akash
    Yadav, Arun Kumar
    Kumar, Mohit
    Yadav, Divakar
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (2) : 5309 - 5325
  • [22] Visual Attention Based on Long-Short Term Memory Model for Image Caption Generation
    Qu, Shiru
    Xi, Yuling
    Ding, Songtao
    [J]. 2017 29TH CHINESE CONTROL AND DECISION CONFERENCE (CCDC), 2017, : 4789 - 4794
  • [23] Automatic image caption generation using deep learning
    Akash Verma
    Arun Kumar Yadav
    Mohit Kumar
    Divakar Yadav
    [J]. Multimedia Tools and Applications, 2024, 83 : 5309 - 5325
  • [24] VD-SAN: Visual-Densely Semantic Attention Network for Image Caption Generation
    He, Xinwei
    Yang, Yang
    Shi, Baoguang
    Bai, Xiang
    [J]. NEUROCOMPUTING, 2019, 328 : 48 - 55
  • [25] Scene Attention Mechanism for Remote Sensing Image Caption Generation
    Wu, Shiqi
    Zhang, Xiangrong
    Wang, Xin
    Li, Chen
    Jiao, Licheng
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [26] Image caption generation method based on adaptive attention mechanism
    Jin, Huazhong
    Wu, Yu
    Wan, Fang
    Hu, Man
    Li, Qingqing
    [J]. MIPPR 2019: PATTERN RECOGNITION AND COMPUTER VISION, 2020, 11430
  • [27] Recurrent Attention LSTM Model for Image Chinese Caption Generation
    Zhang, Chaoying
    Dai, Yaping
    Cheng, Yanyan
    Jia, Zhiyang
    Hirota, Kaoru
    [J]. 2018 JOINT 10TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING AND INTELLIGENT SYSTEMS (SCIS) AND 19TH INTERNATIONAL SYMPOSIUM ON ADVANCED INTELLIGENT SYSTEMS (ISIS), 2018, : 808 - 813
  • [28] Assamese news image caption generation using attention mechanism
    Das, Ringki
    Singh, Thoudam Doren
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (07) : 10051 - 10069
  • [29] Assamese news image caption generation using attention mechanism
    Ringki Das
    Thoudam Doren Singh
    [J]. Multimedia Tools and Applications, 2022, 81 : 10051 - 10069
  • [30] Image Caption Automatic Generation Method Based on Weighted Feature
    Xi, Su Mei
    Cho, Young Im
    [J]. 2013 13TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND SYSTEMS (ICCAS 2013), 2013, : 548 - 551