Self-Enhanced Attention for Image Captioning

被引:0
|
作者
Sun, Qingyu [1 ]
Zhang, Juan [1 ]
Fang, Zhijun [1 ]
Gao, Yongbin [1 ]
机构
[1] Shanghai Univ Engn Sci, Sch Elect & Elect Engn, Shanghai, Peoples R China
关键词
Image captioning; Visual model; Language model; Attention mechanism; CIDEr score;
D O I
10.1007/s11063-024-11527-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning, which involves automatically generating textual descriptions based on the content of images, has garnered increasing attention from researchers. Recently, Transformers have emerged as the preferred choice for the language model in image captioning models. Transformers leverage self-attention mechanisms to address gradient accumulation issues and eliminate the risk of gradient explosion commonly associated with RNN networks. However, a challenge arises when the input features of the self-attention mechanism belong to different categories, as it may result in ineffective highlighting of important features. To address this issue, our paper proposes a novel attention mechanism called Self-Enhanced Attention (SEA), which replaces the self-attention mechanism in the decoder part of the Transformer model. In our proposed SEA, after generating the attention weight matrix, it further adjusts the matrix based on its own distribution to effectively highlight important features. To evaluate the effectiveness of SEA, we conducted experiments on the COCO dataset, comparing the results with different visual models and training strategies. The experimental results demonstrate that when using SEA, the CIDEr score is significantly higher compared to the scores obtained without using SEA. This indicates the successful addressing of the challenge of effectively highlighting important features with our proposed mechanism.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Self-Enhanced Attention for Image Captioning
    Qingyu Sun
    Juan Zhang
    Zhijun Fang
    Yongbin Gao
    [J]. Neural Processing Letters, 56
  • [2] Improve Image Captioning by Self-attention
    Li, Zhenru
    Li, Yaoyi
    Lu, Hongtao
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 91 - 98
  • [3] Relational Attention with Textual Enhanced Transformer for Image Captioning
    Song, Lifei
    Shi, Yiwen
    Xiao, Xinyu
    Zhang, Chunxia
    Xiang, Shiming
    [J]. PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
  • [4] Attention on Attention for Image Captioning
    Huang, Lun
    Wang, Wenmin
    Chen, Jie
    Wei, Xiao-Yong
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4633 - 4642
  • [5] Variational joint self-attention for image captioning
    Shao, Xiangjun
    Xiang, Zhenglong
    Li, Yuanxiang
    Zhang, Mingjie
    [J]. IET IMAGE PROCESSING, 2022, 16 (08) : 2075 - 2086
  • [6] Relation constraint self-attention for image captioning
    Ji, Junzhong
    Wang, Mingzhan
    Zhang, Xiaodan
    Lei, Minglong
    Qu, Liangqiong
    [J]. NEUROCOMPUTING, 2022, 501 : 778 - 789
  • [7] Enhanced Text-Guided Attention Model for Image Captioning
    Zhou, Yuanen
    Hu, Zhenzhen
    Zhao, Ye
    Liu, Xueliang
    Hong, Richang
    [J]. 2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [8] A Dual Self-Attention based Network for Image Captioning
    Li, ZhiYong
    Yang, JinFu
    Li, YaPing
    [J]. PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1590 - 1595
  • [9] Transformer with sparse self-attention mechanism for image captioning
    Wang, Duofeng
    Hu, Haifeng
    Chen, Dihu
    [J]. ELECTRONICS LETTERS, 2020, 56 (15) : 764 - +
  • [10] Areas of Attention for Image Captioning
    Pedersoli, Marco
    Lucas, Thomas
    Schmid, Cordelia
    Verbeek, Jakob
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1251 - 1259