Cross modification attention-based deliberation model for image captioning

被引:0
|
作者
Zheng Lian
Yanan Zhang
Haichang Li
Rui Wang
Xiaohui Hu
机构
[1] University of Chinese Academy of Sciences,
[2] Institute of Software Chinese Academy of Sciences,undefined
来源
Applied Intelligence | 2023年 / 53卷
关键词
Image captioning; Two-pass decoding; Deliberation; Attention mechanism; Reinforcement learning;
D O I
暂无
中图分类号
学科分类号
摘要
The two-pass decoding framework has been proved to considerably improve the performance of image captioning models. However, most of the existing two-pass models involve the coarse captions in assisting the refining process by simply using a conventional attention module. Such an insufficient interaction cannot provide satisfactory support for reproducing higher-quality image descriptions. In this paper, we propose a novel Cross Modification Attention (CMA) module to exploit the complementarity of images and the corresponding coarse captions to supply more reliable features for refinement. Specifically, our CMA extends the conventional attention mechanisms with a hierarchical gating network, which mutually modifies the attended vectors of both visual and linguistic modalities. Thus, it can make the visual semantic representation more unambiguous and filter out misleading information from the coarse captions. To cooperate with CMA in feature interaction, we further explore a general two-pass decoding framework, where the drafting and the deliberation model share only the image encoders rather than the whole drafting network as previous methods. Our framework provides visual features tightly coupling both decoding processes, and ensures the efficient joint optimization of the two-pass models. Moreover, we consider the coarse captions as a baseline when optimizing the deliberation model and employ a potential-oriented reward shaping strategy for reinforcement learning to pertinently improve the quality of refinement. Experiments on Flickr30K and MS COCO datasets demonstrate that our Cross Modification Attention-based Deliberation Model (CMA-DM) obtains significant improvements over single-pass decoding baselines and achieves competitive performance on MS COCO online test server.
引用
收藏
页码:5910 / 5933
页数:23
相关论文
共 50 条
  • [1] Cross modification attention-based deliberation model for image captioning
    Lian, Zheng
    Zhang, Yanan
    Li, Haichang
    Wang, Rui
    Hu, Xiaohui
    [J]. APPLIED INTELLIGENCE, 2023, 53 (05) : 5910 - 5933
  • [2] A Visual Attention-Based Model for Bengali Image Captioning
    Das B.
    Pal R.
    Majumder M.
    Phadikar S.
    Sekh A.A.
    [J]. SN Computer Science, 4 (2)
  • [3] A New Attention-Based LSTM for Image Captioning
    Fen Xiao
    Wenfeng Xue
    Yanqing Shen
    Xieping Gao
    [J]. Neural Processing Letters, 2022, 54 : 3157 - 3171
  • [4] A Survey on Attention-Based Models for Image Captioning
    Osman, Asmaa A. E.
    Shalaby, Mohamed A. Wahby
    Soliman, Mona M.
    Elsayed, Khaled M.
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (02) : 403 - 412
  • [5] A New Attention-Based LSTM for Image Captioning
    Xiao, Fen
    Xue, Wenfeng
    Shen, Yanqing
    Gao, Xieping
    [J]. NEURAL PROCESSING LETTERS, 2022, 54 (04) : 3157 - 3171
  • [6] AttResNet: Attention-based ResNet for Image Captioning
    Feng, Yunmeng
    Lan, Long
    Zhang, Xiang
    Xu, Chuanfu
    Wang, Zhenghua
    Luo, Zhigang
    [J]. 2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [7] Attention-Based Image Captioning Using DenseNet Features
    Hossain, Md Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 109 - 117
  • [8] Auxiliary feature extractor and dual attention-based image captioning
    Qian Zhao
    Guichang Wu
    [J]. Signal, Image and Video Processing, 2024, 18 : 3615 - 3626
  • [9] Auxiliary feature extractor and dual attention-based image captioning
    Zhao, Qian
    Wu, Guichang
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (04) : 3615 - 3626
  • [10] A Hierarchical Multimodal Attention-based Neural Network for Image Captioning
    Cheng, Yong
    Huang, Fei
    Zhou, Lian
    Jin, Cheng
    Zhang, Yuejie
    Zhang, Tao
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 889 - 892