Image Captioning with Masked Diffusion Model

被引：0

作者：

Tian, Weidong ^{[1
]}

Xu, Wenzheng ^{[1
]}

Zhao, Junxiang ^{[1
]}

Zhao, Zhongqiu ^{[1
,2
,3
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China

[2] HFUT, Intelligent Mfg Inst, Hefei, Peoples R China

[3] Guangxi Acad Sci, Nanning, Guangxi, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VIII, ICIC 2024 | 2024年 / 14869卷

基金：

中国国家自然科学基金;

关键词：

Image Captioning; Diffusion Model; Time Varying Mask; Features Fusion; CLIP;

D O I：

10.1007/978-981-97-5603-2_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Some image captioning models adopt a non-autoregressive approach to independently generate each word, thereby speeding up the generation process. However, this generation method often sacrifices the quality of the generated captions. This paper aims to address this issue by proposing a novel diffusion model based on a non-autoregressive approach for image captioning tasks. Our model integrates a time-varying masking mechanism, gradually adding mask in the reverse diffusion process to guide image features selectively. Additionally, to further enhance the quality of generation, we introduce the CLIP model and fuse it with regional features to incorporate semantic information into the image features. This comprehensive utilization of visual and semantic information aids in generating richer and more accurate caption descriptions. To validate the performance of our model, we conducted extensive experiments and ablation studies on the MSCOCO benchmark. The experimental results demonstrate that our masked diffusion model combined with the CLIP model achieves highly competitive performance in caption generation tasks. Not only does it significantly improve generation speed, but it also yields satisfactory results in terms of generation quality. This study highlights the potential applications and importance of our approach in the field of image captioning.

引用

页码：216 / 227

页数：12

共 50 条

[1] Image captioning by diffusion models: A survey
Daneshfar, Fatemeh
Bartani, Ako
Lotfi, Pardis
[J]. Engineering Applications of Artificial Intelligence, 2024, 138
[2] Parallel Image Captioning Using 2D Masked Convolution
Poleak, Chanrith
Kwon, Jangwoo
[J]. APPLIED SCIENCES-BASEL, 2019, 9 (09):
[3] A visual persistence model for image captioning
Wang, Yiyu
Xu, Jungang
Sun, Yingfei
[J]. NEUROCOMPUTING, 2022, 468 : 48 - 59
[4] Image-Captioning Model Compression
Atliha, Viktar
Sesok, Dmitrij
[J]. APPLIED SCIENCES-BASEL, 2022, 12 (03):
[5] Feedback Attention Model for Image Captioning
Lyu, Fan
Hu, Fuyuan
Zhang, Yanning
Xia, Zhenping
Sheng, Victor S
[J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (07): : 1122 - 1129
[6] Semantic-Conditional Diffusion Networks for Image Captioning
Luo, Jianjie
Li, Yehao
Pan, Yingwei
Yao, Ting
Feng, Jianlin
Chao, Hongyang
Mei, Tao
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23359 - 23368
[7] Masked Diffusion Transformer is a Strong Image Synthesizer
Gao, Shanghua
Zhou, Pan
Cheng, Ming-Ming
Yan, Shuicheng
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 23107 - 23116
[8] Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model
Yang, Yue
Liu, Tie
Pu, Ying
Liu, Liangchen
Zhao, Qijun
Wan, Qun
[J]. Remote Sensing, 2024, 16 (21)
[9] Multimodal Data Augmentation for Image Captioning using Diffusion Models
Xiao, Changrong
Xu, Sean Xin
Zhang, Kunpeng
[J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 23 - 33
[10] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
Wang, Qi
Huang, Wei
Zhang, Xueting
Li, Xuelong
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922

← 1 2 3 4 5 →