Learning Distinct and Representative Modes for Image Captioning

被引:0
|
作者
Chen, Qi [1 ]
Deng, Chaorui [1 ]
Wu, Qi [1 ]
机构
[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset(1).
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Learning Combinatorial Prompts for Universal Controllable Image Captioning
    Wang, Zhen
    Xiao, Jun
    Zhuang, Yueting
    Gao, Fei
    Shao, Jian
    Chen, Long
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 129 - 150
  • [22] Image and Video Captioning for Apparels Using Deep Learning
    Agarwal, Govind
    Jindal, Kritika
    Chowdhury, Abishi
    Singh, Vishal K.
    Pal, Amrit
    IEEE ACCESS, 2024, 12 : 113138 - 113150
  • [23] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [24] Prompt-Based Learning for Unpaired Image Captioning
    Zhu, Peipei
    Wang, Xiao
    Zhu, Lin
    Sun, Zhenglong
    Zheng, Wei-Shi
    Wang, Yaowei
    Chen, Changwen
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 379 - 393
  • [25] High-Order Interaction Learning for Image Captioning
    Wang, Yanhui
    Xu, Ning
    Liu, An-An
    Li, Wenhui
    Zhang, Yongdong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) : 4417 - 4430
  • [26] Image Captioning using Reinforcement Learning with BLUDEr Optimization
    Devi, P. R.
    Thrivikraman, V
    Kashyap, D.
    Shylaja, S. S.
    PATTERN RECOGNITION AND IMAGE ANALYSIS, 2020, 30 (04) : 607 - 613
  • [27] Contrastive semantic similarity learning for image captioning evaluation
    Zeng, Chao
    Kwong, Sam
    Zhao, Tiesong
    Wang, Hanli
    INFORMATION SCIENCES, 2022, 609 : 913 - 930
  • [28] Image Change Captioning by Learning from an Auxiliary Task
    Hosseinzadeh, Mehrdad
    Wang, Yang
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2724 - 2733
  • [29] Learning Cooperative Neural Modules for Stylized Image Captioning
    Wu, Xinxiao
    Zhao, Wentian
    Luo, Jiebo
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2305 - 2320
  • [30] Dual Learning for Cross-domain Image Captioning
    Zhao, Wei
    Xu, Wei
    Yang, Min
    Ye, Jianbo
    Zhao, Zhou
    Feng, Yabing
    Qiao, Yu
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 29 - 38