MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

被引:0
|
作者
Zhou, Yupeng [1 ,2 ]
Zhou, Daquan [2 ]
Wang, Yaxing [1 ]
Feng, Jiashi [2 ]
Hou, Qibin [1 ]
机构
[1] Nankai Univ, VCIP, CS, Tianjin 300350, Peoples R China
[2] ByteDance, Singapore, Singapore
关键词
Diffusion model; Text-to-image generation; Conditional mask;
D O I
10.1007/s11263-024-02294-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. However, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In this work, we identify that a crucial factor leading to the erroneous generation of objects and their attributes is the inadequate cross-modality relation learning between the prompt and the generated images. To better align the prompt and image content, we advance the cross-attention with an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. This mechanism explicitly diminishes the ambiguity in the semantic information embedding of the text encoder, leading to a boost of text-to-image consistency in the synthesized images. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models. When applied to the latent diffusion models, our MaskDiffusion can largely enhance their capability to correctly generate objects and their attributes, with negligible computation overhead compared to the original diffusion models. Our project page is https://github.com/HVision-NKU/MaskDiffusion.
引用
收藏
页码:2805 / 2824
页数:20
相关论文
共 50 条
  • [1] Text-to-image via mask anchor points
    Baraheem, Samah S.
    Nguyen, Tam, V
    PATTERN RECOGNITION LETTERS, 2020, 133 : 25 - 32
  • [2] Text-to-Image Generation Method Based on Image-Text Semantic Consistency
    Xue Z.
    Xu Z.
    Lang C.
    Feng S.
    Wang T.
    Li Y.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2180 - 2190
  • [3] Adding Conditional Control to Text-to-Image Diffusion Models
    Zhang, Lvmin
    Rao, Anyi
    Agrawala, Maneesh
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3813 - 3824
  • [4] Optimizing and interpreting the latent space of the conditional text-to-image GANs
    Zhenxing Zhang
    Lambert Schomaker
    Neural Computing and Applications, 2024, 36 : 2549 - 2572
  • [5] Getting it Right: Improving Spatial Consistency in Text-to-Image Models
    Chatterjee, Agneet
    Stan, Gabriela Ben Melech
    Aflalo, Estelle
    Paul, Sayak
    Ghosh, Dhruba
    Gokhale, Tejas
    Schmidt, Ludwig
    Hajishirzi, Hannaneh
    Lal, Vasudev
    Baral, Chitta
    Yang, Yezhou
    COMPUTER VISION-ECCV 2024, PT XXII, 2025, 15080 : 204 - 222
  • [6] Optimizing and interpreting the latent space of the conditional text-to-image GANs
    Zhang, Zhenxing
    Schomaker, Lambert
    NEURAL COMPUTING & APPLICATIONS, 2024, 36 (05): : 2549 - 2572
  • [7] Enhanced Text-to-Image Synthesis Conditional Generative Adversarial Networks
    Tan, Yong Xuan
    Lee, Chin Poo
    Neo, Mai
    Lim, Kian Ming
    Lim, Jit Yan
    IAENG International Journal of Computer Science, 2022, 49 (01) : 1 - 7
  • [8] MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
    Zhao, Jing
    Zheng, Heliang
    Wang, Chaoyue
    Lan, Long
    Yang, Wenjing
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22535 - 22545
  • [9] From text to mask: Localizing entities using the attention of text-to-image diffusion models
    Xiao, Changming
    Yang, Qi
    Zhou, Feng
    Zhang, Changshui
    NEUROCOMPUTING, 2024, 610
  • [10] Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation
    Tan, Zhaorui
    Yang, Xi
    Ye, Zihan
    Wang, Qiufeng
    Yan, Yuyao
    Nguyen, Anh
    Huang, Kaizhu
    PATTERN RECOGNITION, 2023, 144