Shifted Diffusion for Text-to-image Generation

被引:15
|
作者
Zhou, Yufan [1 ]
Liu, Bingchen [2 ]
Zhu, Yizhe [2 ]
Yang, Xiao [2 ]
Chen, Changyou [1 ]
Xu, Jinhui [1 ]
机构
[1] SUNY Buffalo, Buffalo, NY 14260 USA
[2] ByteDance, Beijing, Peoples R China
来源
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2023年
关键词
D O I
10.1109/CVPR52729.2023.00979
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Corgi, a novel method for text-to-image generation. Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text. Unlike the baseline diffusion model used in DALL-E 2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model in its diffusion process by designing a new initialization distribution and a new transition step of the diffusion. Compared to the strong DALL-E 2 baseline, our method performs better in generating image embedding from the text in terms of both efficiency and effectiveness, resulting in better text-to-image generation. Extensive large-scale experiments are conducted and evaluated in terms of both quantitative measures and human evaluation, indicating a stronger generation ability of our method compared to existing ones. Furthermore, our model enables semi-supervised and language-free training for text-to-image generation, where only part or none of the images in the training dataset have an associated caption. Trained with only 1.7% of the images being captioned, our semi-supervised model obtains FID results comparable to DALL-E 2 on zero-shot text-to-image generation evaluated on MS-COCO. Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks, outperforming the previous method, Lafite, by a large margin.
引用
收藏
页码:10157 / 10166
页数:10
相关论文
共 50 条
  • [21] AltDiffusion: A Multilingual Text-to-Image Diffusion Model
    Ye, Fulong
    Liu, Guang
    Wu, Xinya
    Wu, Ledell
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 6648 - 6656
  • [22] Controlling Text-to-Image Diffusion by Orthogonal Finetuning
    Qiu, Zeju
    Liu, Weiyang
    Feng, Haiwen
    Xue, Yuxuan
    Feng, Yao
    Liu, Zhen
    Zhang, Dan
    Weller, Adrian
    Schoelkopf, Bernhard
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [23] Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines
    Toker, Michael
    Orgad, Hadas
    Ventura, Mor
    Arad, Dana
    Belinkov, Yonatan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 9713 - 9728
  • [24] Zero-Shot Text-to-Image Generation
    Ramesh, Aditya
    Pavlov, Mikhail
    Goh, Gabriel
    Gray, Scott
    Voss, Chelsea
    Radford, Alec
    Chen, Mark
    Sutskever, Ilya
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [25] Dense Text-to-Image Generation with Attention Modulation
    Kim, Yunji
    Lee, Jiyoung
    Kim, Jin-Hwa
    Ha, Jung-Woo
    Zhu, Jun-Yan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7667 - 7677
  • [26] Visual Programming for Text-to-Image Generation and Evaluation
    Cho, Jaemin
    Zala, Abhay
    Bansal, Mohit
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] MirrorGAN: Learning Text-to-image Generation by Redescription
    Qiao, Tingting
    Zhang, Jing
    Xu, Duanqing
    Tao, Dacheng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1505 - 1514
  • [28] StyleDrop: Text-to-Image Generation in Any Style
    Sohn, Kihyuk
    Ruiz, Nataniel
    Lee, Kimin
    Chin, Daniel Castro
    Blok, Irina
    Chang, Huiwen
    Barber, Jarred
    Jiang, Lu
    Entis, Glenn
    Li, Yuanzhen
    Hao, Yuan
    Essa, Irfan
    Rubinstein, Michael
    Krishnan, Dilip
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [29] A taxonomy of prompt modifiers for text-to-image generation
    Oppenlaender, Jonas
    BEHAVIOUR & INFORMATION TECHNOLOGY, 2024, 43 (15) : 3763 - 3776
  • [30] SINE: SINgle Image Editing with Text-to-Image Diffusion Models
    Zhang, Zhixing
    Han, Ligong
    Ghosh, Arnab
    Metaxas, Dimitris
    Ren, Jian
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6027 - 6037