Improving text-to-image generation with object layout guidance

被引:9
|
作者
Zakraoui, Jezia [1 ]
Saleh, Moutaz [1 ]
Al-Maadeed, Somaya [1 ]
Jaam, Jihad Mohammed [1 ]
机构
[1] Qatar Univ, Dept Comp Sci & Engn, Doha 2713, Qatar
关键词
Image generation; Text processing; Scene graph; Object layout; Conditioning augmentation; StackGAN;
D O I
10.1007/s11042-021-11038-0
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The automatic generation of realistic images directly from a story text is a very challenging problem, as it cannot be addressed using a single image generation approach due mainly to the semantic complexity of the story text constituents. In this work, we propose a new approach that decomposes the task of story visualization into three phases: semantic text understanding, object layout prediction, and image generation and refinement. We start by simplifying the text using a scene graph triple notation that encodes semantic relationships between the story objects. We then introduce an object layout module to capture the features of these objects from the corresponding scene graph. Specifically, the object layout module aggregates individual object features from the scene graph as well as averaged or likelihood object features generated by a graph convolutional neural network. All these features are concatenated to form semantic triples that are then provided to the image generation framework. For the image generation phase, we adopt a scene graph image generation framework as stage-I, which is refined using a StackGAN as stage-II conditioned on the object layout module and the generated output image from stage-I. Our approach renders object details in high-resolution images while keeping the image structure consistent with the input text. To evaluate the performance of our approach, we use the COCO dataset and compare it with three baseline approaches, namely, sg2im, StackGAN and AttnGAN, in terms of image quality and user evaluation. According to the obtained assessment results, our object layout guidance-based approach significantly outperforms the abovementioned baseline approaches in terms of the accuracy of semantic matching and realism of the generated images representing the story text sentences.
引用
收藏
页码:27423 / 27443
页数:21
相关论文
共 50 条
  • [1] Improving text-to-image generation with object layout guidance
    Jezia Zakraoui
    Moutaz Saleh
    Somaya Al-Maadeed
    Jihad Mohammed Jaam
    Multimedia Tools and Applications, 2021, 80 : 27423 - 27443
  • [2] Background Layout Generation and Object Knowledge Transfer for Text-to-Image Generation
    Chen, Zhuowei
    Mao, Zhendong
    Fang, Shancheng
    Hu, Bo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4327 - 4335
  • [3] LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation
    Qu, Leigang
    Wu, Shengqiong
    Fei, Hao
    Nie, Liqiang
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 643 - 654
  • [4] Text-to-Image Synthesis via Aesthetic Layout
    Baraheem, Samah Saeed
    Trung-Nghia Le
    Nguyen, Tam, V
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4485 - 4487
  • [5] Layout-Bridging Text-to-Image Synthesis
    Liang, Jiadong
    Pei, Wenjie
    Lu, Feng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7438 - 7451
  • [6] Controllable Text-to-Image Generation
    Li, Bowen
    Qi, Xiaojuan
    Lukasiewicz, Thomas
    Torr, Philip H. S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [7] Surgical text-to-image generation
    Nwoye, Chinedu Innocent
    Bose, Rupak
    Elgohary, Kareem
    Arboit, Lorenzo
    Carlino, Giorgio
    Lavanchy, Joel L.
    Mascagni, Pietro
    Padoy, Nicolas
    PATTERN RECOGNITION LETTERS, 2025, 190 : 73 - 80
  • [8] Text-to-Image Generation Method Based on Object Enhancement and Attention Maps
    Huang, Yongsen
    Cai, Xiaodong
    An, Yuefan
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2025, 16 (01) : 961 - 968
  • [9] Expressive Text-to-Image Generation with Rich Text
    Ge, Songwei
    Park, Taesung
    Zhu, Jun-Yan
    Huang, Jia-Bin
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7511 - 7522
  • [10] Masked-attention diffusion guidance for spatially controlling text-to-image generation
    Endo, Yuki
    VISUAL COMPUTER, 2024, 40 (09): : 6033 - 6045