TT2INet: Text to Photo-realistic Image Synthesis with Transformer as Text Encoder

被引:0
|
作者
Zhu, Jianwei [1 ]
Li, Zhixin [1 ]
Ma, Huifang [2 ]
机构
[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformer; Generative Adversarial Networks (GANs); spectral normalization; self-attention;
D O I
10.1109/IJCNN52387.2021.9534074
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A text-to-image (T2I) generation method is mainly evaluated from two aspects, one is the quality and diversity of the generated images, and the other is the semantic consistency between the generated images and the input sentences. The feature extraction of the text is a very important part. In this paper, we propose a Transformer based Text-to-Image Network (TT2INet). we use the pre-trained Transformer model (ALBERT) to extract the sentence feature vectors and word feature vectors of the input sentences as the basis for the Generative Adversarial Networks (GANs) to generate images. In addition, we also added self-attention mechanism and spectral normalization method to the model. Adding a self-attention mechanism can make the model pay attention to more local features when generating images. Using the spectral normalization method can make the training of GANs more stable. The Inception Scores of our method on Oxford-102, CUB and COCO datasets are 3.90, 4.89 and 26.53, and R-precision scores are 92.55, 87.72 and 92.29, respectively.
引用
收藏
页数:8
相关论文
共 28 条
  • [21] High-Resolution Realistic Image Synthesis from Text Using Iterative Generative Adversarial Network
    Ullah, Anwar
    Yu, Xinguo
    Majid, Abdul
    Rahman, Hafiz Ur
    Mughal, M. Farhan
    IMAGE AND VIDEO TECHNOLOGY (PSIVT 2019), 2019, 11854 : 211 - 224
  • [22] Deep transformer: A framework for 2D text image rectification from planar transformations
    Yan, Chengzhe
    Hu, Jie
    Zhang, Changshui
    NEUROCOMPUTING, 2018, 289 : 32 - 43
  • [23] CELL-E 2: Translating Proteins to Pictures and Back with a Bidirectional Text-to-Image Transformer
    Khwaja, Emaad
    Song, Yun S.
    Agarunov, Aaron
    Huang, Bo
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Swinv2-Imagen: hierarchical vision transformer diffusion models for text-to-image generation
    Li, Ruijun
    Li, Weihua
    Yang, Yi
    Wei, Hanyu
    Jiang, Jianhua
    Bai, Quan
    NEURAL COMPUTING & APPLICATIONS, 2023, 36 (28): : 17245 - 17260
  • [25] I2T21: LEARNING TEXT TO IMAGE SYNTHESIS WITH TEXTUAL DATA AUGMENTATION
    Dong, Hao
    Zhang, Jingqing
    McIlwraith, Douglas
    Guo, Yike
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 2015 - 2019
  • [26] StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
    Li, Zhiheng
    Min, Martin Renqiang
    Li, Kai
    Xu, Chenliang
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18176 - 18186
  • [27] RiFeGAN2: Rich Feature Generation for Text-to-Image Synthesis From Constrained Prior Knowledge
    Cheng, Jun
    Wu, Fuxiang
    Tian, Yanling
    Wang, Lei
    Tao, Dapeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5187 - 5200
  • [28] Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image
    Goto, Shunsuke
    Onishi, Kotaro
    Saito, Yuki
    Tachibana, Kentaro
    Mori, Koichiro
    INTERSPEECH 2020, 2020, : 1321 - 1325