Learning Distinct and Representative Modes for Image Captioning

被引:0
|
作者
Chen, Qi [1 ]
Deng, Chaorui [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset(1).
引用
收藏
页数:14
相关论文
共 50 条
  • [31] Generative image captioning in Urdu using deep learning
    Afzal M.K.
    Shardlow M.
    Tuarob S.
    Zaman F.
    Sarwar R.
    Ali M.
    Aljohani N.R.
    Lytras M.D.
    Nawaz R.
    Hassan S.-U.
    Journal of Ambient Intelligence and Humanized Computing, 2023, 14 (06) : 7719 - 7731
  • [32] Multitask Learning for Cross-Domain Image Captioning
    Yang, Min
    Zhao, Wei
    Xu, Wei
    Feng, Yabing
    Zhao, Zhou
    Chen, Xiaojun
    Lei, Kai
    IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (04) : 1047 - 1061
  • [33] Image Captioning using Adversarial Networks and Reinforcement Learning
    Yan, Shiyang
    Wu, Fangyu
    Smith, Jeremy S.
    Lu, Wenjin
    Zhang, Bailing
    2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2018, : 248 - 253
  • [34] A Hybridized Deep Learning Method for Bengali Image Captioning
    Humaira, Mayeesha
    Paul, Shimul
    Jim, Md Abidur Rahman Khan
    Ami, Amit Saha
    Shah, Faisal Muhammad
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (02) : 698 - 707
  • [35] Learning Cooperative Neural Modules for Stylized Image Captioning
    Xinxiao Wu
    Wentian Zhao
    Jiebo Luo
    International Journal of Computer Vision, 2022, 130 : 2305 - 2320
  • [36] Image Captioning using Reinforcement Learning with BLUDEr Optimization
    P. R. Devi
    V. Thrivikraman
    D. Kashyap
    S. S. Shylaja
    Pattern Recognition and Image Analysis, 2020, 30 : 607 - 613
  • [37] Structural Semantic Adversarial Active Learning for Image Captioning
    Zhang, Beichen
    Li, Liang
    Su, Li
    Wang, Shuhui
    Deng, Jincan
    Zha, Zheng-Jun
    Huang, Qingming
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 1112 - 1121
  • [38] A Multi-task Learning Approach for Image Captioning
    Zhao, Wei
    Wang, Benyou
    Ye, Jianbo
    Yang, Min
    Zhao, Zhou
    Luo, Ruotian
    Qiao, Yu
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1205 - 1211
  • [39] Learning joint relationship attention network for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
  • [40] Image Captioning Using Multimodal Deep Learning Approach
    Farkh, Rihem
    Oudinet, Ghislain
    Foued, Yasser
    Computers, Materials and Continua, 2024, 81 (03): : 3951 - 3968