Learning Distinct and Representative Modes for Image Captioning

被引：0

作者：

Chen, Qi ^{[1
]}

Deng, Chaorui ^{[1
]}

Wu, Qi ^{[1
]}

机构：

[1] Univ Adelaide, Australian Inst Machine Learning, Adelaide, Australia

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022 | 2022年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset(1).

引用

页数：14

共 50 条

[21] Learning Combinatorial Prompts for Universal Controllable Image Captioning
Wang, Zhen
Xiao, Jun
Zhuang, Yueting
Gao, Fei
Shao, Jian
Chen, Long
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (01) : 129 - 150
[22] Image and Video Captioning for Apparels Using Deep Learning
Agarwal, Govind
Jindal, Kritika
Chowdhury, Abishi
Singh, Vishal K.
Pal, Amrit
IEEE ACCESS, 2024, 12 : 113138 - 113150
[23] Reinforcement Learning Transformer for Image Captioning Generation Model
Yan, Zhaojie
FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
[24] Prompt-Based Learning for Unpaired Image Captioning
Zhu, Peipei
Wang, Xiao
Zhu, Lin
Sun, Zhenglong
Zheng, Wei-Shi
Wang, Yaowei
Chen, Changwen
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 379 - 393
[25] High-Order Interaction Learning for Image Captioning
Wang, Yanhui
Xu, Ning
Liu, An-An
Li, Wenhui
Zhang, Yongdong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) : 4417 - 4430
[26] Image Captioning using Reinforcement Learning with BLUDEr Optimization
Devi, P. R.
Thrivikraman, V
Kashyap, D.
Shylaja, S. S.
PATTERN RECOGNITION AND IMAGE ANALYSIS, 2020, 30 (04) : 607 - 613
[27] Contrastive semantic similarity learning for image captioning evaluation
Zeng, Chao
Kwong, Sam
Zhao, Tiesong
Wang, Hanli
INFORMATION SCIENCES, 2022, 609 : 913 - 930
[28] Image Change Captioning by Learning from an Auxiliary Task
Hosseinzadeh, Mehrdad
Wang, Yang
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 2724 - 2733
[29] Learning Cooperative Neural Modules for Stylized Image Captioning
Wu, Xinxiao
Zhao, Wentian
Luo, Jiebo
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (09) : 2305 - 2320
[30] Dual Learning for Cross-domain Image Captioning
Zhao, Wei
Xu, Wei
Yang, Min
Ye, Jianbo
Zhao, Zhou
Feng, Yabing
Qiao, Yu
CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 29 - 38

← 1 2 3 4 5 →