CgT-GAN: CLIP-guided Text GAN for Image Captioning

被引:3
|
作者
Yu, Jiarui [1 ]
Li, Haoran [1 ]
Hao, Yanbin [1 ]
Zhu, Bin [2 ]
Xu, Tong [1 ]
He, Xiangnan [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] Singapore Management Univ, Bras Basah, Singapore
关键词
Image captioning; CLIP; Reinforcement learning; GAN;
D O I
10.1145/3581783.3611891
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded based on the caption naturalness to human language calculated from the GAN's discriminator and the semantic guidance reward computed by the CLIP-based reward module. In addition to the cosine similarity as the semantic guidance reward (i.e., CLIP-cos), we further introduce a novel semantic guidance reward called CLIP-agg, which aligns the generated caption with a weighted text embedding by attentively aggregating the entire corpus. Experimental results on three subtasks (ZS-IC, In-UIC and Cross-UIC) show that CgT-GAN outperforms state-of-the-art methods significantly across all metrics. Code is available at https://github.com/Lihr747/CgtGAN.
引用
收藏
页码:2252 / 2263
页数:12
相关论文
共 50 条
  • [1] CLIP-guided StyleGAN Inversion for Text-driven Real Image Editing
    Baykal, Ahmet Canberk
    Anees, Abdul Basit
    Ceylan, Duygu
    Erdem, Erkut
    Erdem, Aykut
    Yuret, Deniz
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (05):
  • [2] Image-Based CLIP-Guided Essence Transfer
    Chefer, Hila
    Benaim, Sagie
    Paiss, Roni
    Wolf, Lior
    [J]. COMPUTER VISION, ECCV 2022, PT XIII, 2022, 13673 : 695 - 711
  • [3] StyleGAN-based CLIP-guided Image Shape Manipulation
    Qian, Yuchen
    Yamamoto, Kohei
    Yanai, Keiji
    [J]. 19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 162 - 166
  • [4] VIDEO QUESTION ANSWERING USING CLIP-GUIDED VISUAL-TEXT ATTENTION
    Ye, Shuhong
    Kong, Weikai
    Yao, Chenglin
    Ren, Jianfeng
    Jiang, Xudong
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 81 - 85
  • [5] CLIP-guided black-box domain adaptation of image classification
    Tian, Liang
    Ye, Mao
    Zhou, Lihua
    He, Qichen
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (05) : 4637 - 4646
  • [6] StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
    Gal, Rinon
    Patashnik, Or
    Maron, Haggai
    Bermano, Amit H.
    Chechik, Gal
    Cohen-Or, Daniel
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
  • [7] aRTIC GAN: A Recursive Text-Image-Conditioned GAN
    Alati, Edoardo
    Caracciolo, Carlo Alberto
    Costa, Marco
    Sanzari, Marta
    Russo, Paolo
    Amerini, Irene
    [J]. ELECTRONICS, 2022, 11 (11)
  • [8] COME: Clip-OCR and Master ObjEct for text image captioning
    Lv, Gang
    Sun, Yining
    Nian, Fudong
    Zhu, Maofei
    Tang, Wenliang
    Hu, Zhenzhen
    [J]. IMAGE AND VISION COMPUTING, 2023, 136
  • [9] Text-Guided Attention Model for Image Captioning
    Mun, Jonghwan
    Cho, Minsu
    Han, Bohyung
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4233 - 4239
  • [10] A TEXT-GUIDED GRAPH STRUCTURE FOR IMAGE CAPTIONING
    Wang, Depeng
    Hu, Zhenzhen
    Zhou, Yuanen
    Liu, Xueliang
    Wu, Le
    Hong, Richang
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2020,