Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis

被引:34
|
作者
Gao, Lianli [1 ]
Chen, Daiyuan [1 ]
Zhao, Zhou [2 ]
Shao, Jie [1 ]
Shen, Heng Tao [1 ]
机构
[1] Univ Elect Sci & Technol China, Dept Comp Sci, Chengdu 611731, Peoples R China
[2] Zhejiang Univ, Sch Comp Sci, Hangzhou, Peoples R China
关键词
Text-to-image synthesis; Conditional generative adversarial network (CGAN); Network complexity; Disentanglement process; Entanglement process; Information compensation; Pyramid attentive fusion;
D O I
10.1016/j.patcog.2020.107384
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The text-to-image synthesis task aims to generate photographic images conditioned on semantic text de-scriptions. To ensure the sharpness and fidelity of generated images, this task tends to generate high resolution images (e.g., 128(2) or 256(2) ). However, as the resolution increases, the network parameters and complexity increases dramatically. Recent works introduce network structures with extensive parameters and heavy computations to guarantee the production of high-resolution images. As a result, these models come across problems of the unstable training process and high training cost. To tackle these issues, in this paper, we propose an effective information compensation based approach, namely Lightweight Dynamic Conditional GAN (LD-CGAN). LD-CGAN is a compact and structured single-stream network, and it consists of one generator and two independent discriminators to regularize and generate 64(2) and 128(2) images in one feed-forward process. Specifically, the generator of LD-CGAN is composed of three major components: (1) Conditional Embedding (CE), which is an automatically unsupervised learning process aiming at disentangling integrated semantic attributes in the text space; (2) Conditional Manipulating Modular (CM-M) in Conditional Manipulating Block (CM-B), which is designed to continuously provide the image features with the compensation information (i.e., the disentangled attribute); and (3) Pyramid Attention Refine Block (PAR-B), which is used to enrich multi-scale features by capturing spatial importance between multi-scale context. Consequently, experiments conducted under two benchmark datasets, CUB and Oxford-102, indicate that our generated 128(2) images can achieve comparable performance with 256(2) images generated by the state-of-the-arts on two evaluation metrics: Inception Score (IS) and Visual-semantic Similarity (VS). Compared with the current state-of-the-art HDGAN, our LD-CGAN significantly decreases the number of parameters and computation time by 86.8% and 94.9%, respectively. (c) 2020 Elsevier Ltd. All rights reserved.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Unsupervised text-to-image synthesis
    Dong, Yanlong
    Zhang, Ying
    Ma, Lin
    Wang, Zhi
    Luo, Jiebo
    Pattern Recognition, 2021, 110
  • [22] Unsupervised text-to-image synthesis
    Dong, Yanlong
    Zhang, Ying
    Ma, Lin
    Wang, Zhi
    Luo, Jiebo
    PATTERN RECOGNITION, 2021, 110
  • [23] Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models
    Wang, Ruichen
    Chen, Zekang
    Chen, Chen
    Ma, Jian
    Lu, Haonan
    Lin, Xiaodong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 5544 - 5552
  • [24] Counterfactual GAN for debiased text-to-image synthesisCounterfactual GAN for debiased text-to-image synthesisX. Kong et al.
    Xianghua Kong
    Ning Xu
    Zefang Sun
    Zhewen Shen
    Bolun Zheng
    Chenggang Yan
    Jinbo Cao
    Rongbao Kang
    An-An Liu
    Multimedia Systems, 2025, 31 (1)
  • [25] Dense Text-to-Image Generation with Attention Modulation
    Kim, Yunji
    Lee, Jiyoung
    Kim, Jin-Hwa
    Ha, Jung-Woo
    Zhu, Jun-Yan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7667 - 7677
  • [26] Adding Conditional Control to Text-to-Image Diffusion Models
    Zhang, Lvmin
    Rao, Anyi
    Agrawala, Maneesh
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3813 - 3824
  • [27] MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
    Zhou, Yupeng
    Zhou, Daquan
    Wang, Yaxing
    Feng, Jiashi
    Hou, Qibin
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, : 2805 - 2824
  • [28] DE-GAN: Text-to-image synthesis with dual and efficient fusion model
    Jiang, Bin
    Zeng, Weiyuan
    Yang, Chao
    Wang, Renjun
    Zhang, Bolin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (8) : 23839 - 23852
  • [29] DE-GAN: Text-to-image synthesis with dual and efficient fusion model
    Bin Jiang
    Weiyuan Zeng
    Chao Yang
    Renjun Wang
    Bolin Zhang
    Multimedia Tools and Applications, 2024, 83 : 23839 - 23852
  • [30] Hybrid Attention Driven Text-to-Image Synthesis via Generative Adversarial Networks
    Cheng, Qingrong
    Gu, Xiaodong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: WORKSHOP AND SPECIAL SESSIONS, 2019, 11731 : 483 - 495