CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

被引:0
|
作者
Zheng, Wendi [1 ,2 ]
Teng, Jiayan [1 ,2 ]
Yang, Zhuoyi [1 ,2 ]
Wang, Weihan [1 ,2 ]
Chen, Jidong [1 ]
Gu, Xiaotao [2 ]
Dong, Yuxiao [1 ]
Ding, Ming [2 ]
Tang, Jie [1 ]
机构
[1] Tsinghua Univ, Beijing, Peoples R China
[2] Zhipu AI, Beijing, Peoples R China
来源
关键词
Text-to-Image Generation; Diffusion Models;
D O I
10.1007/978-3-031-72980-5_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.
引用
收藏
页码:1 / 22
页数:22
相关论文
共 50 条
  • [1] CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers
    Ding, Ming
    Zheng, Wendi
    Hong, Wenyi
    Tang, Jie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] CogView: Mastering Text-to-Image Generation via Transformers
    Ding, Ming
    Yang, Zhuoyi
    Hong, Wenyi
    Zheng, Wendi
    Zhou, Chang
    Yin, Da
    Lin, Junyang
    Zou, Xu
    Shao, Zhou
    Yang, Hongxia
    Tang, Jie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [3] Shifted Diffusion for Text-to-image Generation
    Zhou, Yufan
    Liu, Bingchen
    Zhu, Yizhe
    Yang, Xiao
    Chen, Changyou
    Xu, Jinhui
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10157 - 10166
  • [4] RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
    Xue, Zeyue
    Song, Guanglu
    Guo, Qiushan
    Liu, Boxiao
    Zong, Zhuofan
    Liu, Yu
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Adversarial Robustification via Text-to-Image Diffusion Models
    Choi, Daewon
    Jeong, Jongheon
    Jang, Huiwon
    Shin, Jinwoo
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 158 - 177
  • [6] EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
    Yang, Jingyuan
    Feng, Jiawei
    Huang, Hui
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 6358 - 6368
  • [7] Controllable Text-to-Image Generation
    Li, Bowen
    Qi, Xiaojuan
    Lukasiewicz, Thomas
    Torr, Philip H. S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [8] Surgical text-to-image generation
    Nwoye, Chinedu Innocent
    Bose, Rupak
    Elgohary, Kareem
    Arboit, Lorenzo
    Carlino, Giorgio
    Lavanchy, Joel L.
    Mascagni, Pietro
    Padoy, Nicolas
    PATTERN RECOGNITION LETTERS, 2025, 190 : 73 - 80
  • [9] UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
    Xu, Yanwu
    Zhao, Yang
    Xiao, Zhisheng
    Hou, Tingbo
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8196 - 8206
  • [10] Text-to-Image Generation via Semi-Supervised Training
    Ji, Zhongyi
    Wang, Wenmin
    Chen, Baoyang
    Han, Xiao
    2020 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2020, : 265 - 268