Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models

被引:1
|
作者
Zhang, Yasi [1 ]
Yu, Peiyu [1 ]
Wu, Ying Nian [1 ]
机构
[1] Univ Calif Los Angeles, Dept Stat & Data Sci, Los Angeles, CA 90095 USA
来源
关键词
Attention Map Alignment; Energy-Based Models; Text-to-Image Diffusion Models;
D O I
10.1007/978-3-031-72946-1_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text-to-image diffusion models have shown great success in generating high-quality text-guided images. Yet, these models may still fail to semantically align generated images with the provided text prompts, leading to problems like incorrect attribute binding and/or catastrophic object neglect. Given the pervasive object-oriented structure underlying text prompts, we introduce a novel object-conditioned Energy-Based Attention Map Alignment (EBAMA) method to address the aforementioned problems. We show that an object-centric attribute binding loss naturally emerges by approximately maximizing the log-likelihood of a z-parameterized energy-based model with the help of the negative sampling technique. We further propose an object-centric intensity regularizer to prevent excessive shifts of objects attention towards their attributes. Extensive qualitative and quantitative experiments, including human evaluation, on several challenging benchmarks demonstrate the superior performance of our method over previous strong counterparts. With better aligned attention maps, our approach shows great promise in further enhancing the text-controlled image editing ability of diffusion models. The code is available at https://github.com/YasminZhang/EBAMA.
引用
收藏
页码:55 / 71
页数:17
相关论文
共 50 条
  • [41] Lego: Learning to Disentangle and Invert Personalized Concepts Beyond Object Appearance in Text-to-Image Diffusion Models
    Motamed, Saman
    Paudel, Danda Pani
    Van Gool, Luc
    COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 116 - 133
  • [42] Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis
    Wu, Qiucheng
    Liu, Yujian
    Zhao, Handong
    Bui, Trung
    Lin, Zhe
    Zhang, Yang
    Chang, Shiyu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7732 - 7742
  • [43] Residual Energy-Based Models for Text
    Bakhtin, Anton
    Deng, Yuntian
    Gross, Sam
    Ott, Myle
    Ranzato, Marc'Aurelio
    Szlam, Arthur
    JOURNAL OF MACHINE LEARNING RESEARCH, 2021, 22
  • [44] Residual energy-based models for text
    Bakhtin, Anton
    Deng, Yuntian
    Gross, Sam
    Ott, Myle
    Ranzato, Marc'Aurelio
    Szlam, Arthur
    Journal of Machine Learning Research, 2021, 22
  • [45] Controlling Attention Map Better for Text-Guided Image Editing Diffusion Models
    Xu, Siqi
    Sun, Lijun
    Liu, Guanming
    Wei, Zhihua
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XII, ICIC 2024, 2024, 14873 : 54 - 65
  • [46] Inv-ReVersion: Enhanced Relation Inversion Based on Text-to-Image Diffusion Models
    Zhang, Guangzi
    Qian, Yulin
    Deng, Juntao
    Cai, Xingquan
    APPLIED SCIENCES-BASEL, 2024, 14 (08):
  • [47] HOIEdit: Human-object interaction editing with text-to-image diffusion model
    Xu, Tang
    Wang, Wenbin
    Zhong, Alin
    VISUAL COMPUTER, 2025,
  • [48] PromptMix: Text-to-image diffusion models enhance the performance of lightweight networks
    Bakhtiarnia, Arian
    Zhang, Qi
    Iosifidis, Alexandros
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [49] Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models
    Zhao, Juntu
    Deng, Junyu
    Ye, Yixin
    Li, Chongxuan
    Deng, Zhijie
    Wang, Dequan
    COMPUTER VISION - ECCV 2024, PT LXIX, 2025, 15127 : 318 - 333
  • [50] Masked-attention diffusion guidance for spatially controlling text-to-image generation
    Endo, Yuki
    VISUAL COMPUTER, 2024, 40 (09): : 6033 - 6045