Image Captioning with Masked Diffusion Model

被引:0
|
作者
Tian, Weidong [1 ]
Xu, Wenzheng [1 ]
Zhao, Junxiang [1 ]
Zhao, Zhongqiu [1 ,2 ,3 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei, Peoples R China
[2] HFUT, Intelligent Mfg Inst, Hefei, Peoples R China
[3] Guangxi Acad Sci, Nanning, Guangxi, Peoples R China
基金
中国国家自然科学基金;
关键词
Image Captioning; Diffusion Model; Time Varying Mask; Features Fusion; CLIP;
D O I
10.1007/978-981-97-5603-2_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Some image captioning models adopt a non-autoregressive approach to independently generate each word, thereby speeding up the generation process. However, this generation method often sacrifices the quality of the generated captions. This paper aims to address this issue by proposing a novel diffusion model based on a non-autoregressive approach for image captioning tasks. Our model integrates a time-varying masking mechanism, gradually adding mask in the reverse diffusion process to guide image features selectively. Additionally, to further enhance the quality of generation, we introduce the CLIP model and fuse it with regional features to incorporate semantic information into the image features. This comprehensive utilization of visual and semantic information aids in generating richer and more accurate caption descriptions. To validate the performance of our model, we conducted extensive experiments and ablation studies on the MSCOCO benchmark. The experimental results demonstrate that our masked diffusion model combined with the CLIP model achieves highly competitive performance in caption generation tasks. Not only does it significantly improve generation speed, but it also yields satisfactory results in terms of generation quality. This study highlights the potential applications and importance of our approach in the field of image captioning.
引用
收藏
页码:216 / 227
页数:12
相关论文
共 50 条
  • [1] Image captioning by diffusion models: A survey
    Daneshfar, Fatemeh
    Bartani, Ako
    Lotfi, Pardis
    [J]. Engineering Applications of Artificial Intelligence, 2024, 138
  • [2] Parallel Image Captioning Using 2D Masked Convolution
    Poleak, Chanrith
    Kwon, Jangwoo
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (09):
  • [3] A visual persistence model for image captioning
    Wang, Yiyu
    Xu, Jungang
    Sun, Yingfei
    [J]. NEUROCOMPUTING, 2022, 468 : 48 - 59
  • [4] Image-Captioning Model Compression
    Atliha, Viktar
    Sesok, Dmitrij
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (03):
  • [5] Feedback Attention Model for Image Captioning
    Lyu, Fan
    Hu, Fuyuan
    Zhang, Yanning
    Xia, Zhenping
    Sheng, Victor S
    [J]. Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (07): : 1122 - 1129
  • [6] Semantic-Conditional Diffusion Networks for Image Captioning
    Luo, Jianjie
    Li, Yehao
    Pan, Yingwei
    Yao, Ting
    Feng, Jianlin
    Chao, Hongyang
    Mei, Tao
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23359 - 23368
  • [7] Masked Diffusion Transformer is a Strong Image Synthesizer
    Gao, Shanghua
    Zhou, Pan
    Cheng, Ming-Ming
    Yan, Shuicheng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 23107 - 23116
  • [8] Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model
    Yang, Yue
    Liu, Tie
    Pu, Ying
    Liu, Liangchen
    Zhao, Qijun
    Wan, Qun
    [J]. Remote Sensing, 2024, 16 (21)
  • [9] Multimodal Data Augmentation for Image Captioning using Diffusion Models
    Xiao, Changrong
    Xu, Sean Xin
    Zhang, Kunpeng
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 23 - 33
  • [10] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922