Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

被引:3
|
作者
Ham, Cusuh [1 ]
Hays, James [1 ]
Lu, Jingwan [2 ]
Singh, Krishna Kumar [2 ]
Zhang, Zhifei [2 ]
Hinz, Tobias [2 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Adobe Res, San Francisco, CA USA
来源
PROCEEDINGS OF SIGGRAPH 2023 CONFERENCE PAPERS, SIGGRAPH 2023 | 2023年
关键词
image synthesis; image generation; multimodal synthesis; neural networks; diffusion models;
D O I
10.1145/3588432.3591549
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network's parameters. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only similar to 1% of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
    Jiang, Jingjing
    Liu, Ziyi
    Zheng, Nanning
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 132 (1) : 185 - 207
  • [32] Learning Adapters for Text-Guided Portrait Stylization with Pretrained Diffusion Models
    Yang, Mintu
    Hou, Xianxu
    Li, Hao
    Shen, Linlin
    Fan, Lixin
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 247 - 258
  • [33] ∞-Brush: Controllable Large Image Synthesis with Diffusion Models in Infinite Dimensions
    Le, Minh-Quan
    Graikos, Alexandros
    Yellapragada, Srikar
    Gupta, Rajarsi
    Saltz, Joel
    Samaras, Dimitris
    COMPUTER VISION - ECCV 2024, PT XXXII, 2025, 15090 : 385 - 401
  • [34] Measurement Guidance in Diffusion Models: Insight from Medical Image Synthesis
    Luo, Yimin
    Yang, Qinyu
    Fan, Yuheng
    Qi, Haikun
    Xia, Menghan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 7983 - 7997
  • [35] High-Fidelity Guided Image Synthesis with Latent Diffusion Models
    Singh, Jaskirat
    Gould, Stephen
    Zheng, Liang
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5997 - 6006
  • [36] Layout-Agnostic Scene Text Image Synthesis with Diffusion Models
    Zhangli, Qilong
    Jiang, Jindong
    Liu, Di
    Yu, Licheng
    Dai, Xiaoliang
    Ramchandani, Ankit
    Pang, Guan
    Metaxas, Dimitris N.
    Krishnan, Praveen
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7496 - 7506
  • [37] Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
    Jingjing Jiang
    Ziyi Liu
    Nanning Zheng
    International Journal of Computer Vision, 2024, 132 : 185 - 207
  • [38] Integrating Multimodal Information in Large Pretrained Transformers
    Rahman, Wasifur
    Hasan, Md Kamrul
    Lee, Sangwu
    Zadeh, Amir
    Mao, Chengfeng
    Morency, Louis-Philippe
    Hoque, Ehsan
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2359 - 2369
  • [39] Prostate Image Classification Using Pretrained Models: GoogLeNet and ResNet-50
    Jusman, Yessi
    Nurkholid, Muhammad Ahdan Fawwaz
    Utomo, Feriandri
    2021 15TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2021,
  • [40] VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
    Chen, Jun
    Guo, Han
    Yi, Kai
    Li, Boyang
    Elhoseiny, Mohamed
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18009 - 18019