Compositional Transformers for Scene Generation

被引:0
|
作者
Hudson, Drew A. [1 ]
Zitnick, C. Lawrence [2 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[2] Facebook Inc, Facebook AI Res, Menlo Pk, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. The network incorporates strong and explicit structural priors, to reflect the compositional nature of visual scenes, and synthesizes images through a sequential process. It operates in two stages: a fast and lightweight planning phase, where we draft a high-level scene layout, followed by an attention-based execution phase, where the layout is being refined, evolving into a rich and detailed picture. Our model moves away from conventional black-box GAN architectures that feature a flat and monolithic latent space towards a transparent design that encourages efficiency, controllability and interpretability. We demonstrate GANformer2's strengths and qualities through a careful evaluation over a range of datasets, from multi-object CLEVR scenes to the challenging COCO images, showing it successfully achieves state-of-the-art performance in terms of visual quality, diversity and consistency. Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process, as it proceeds step-by-step from a rough initial sketch, to a detailed layout that accounts for objects' depths and dependencies, and up to the final high-resolution depiction of vibrant and intricate real-world scenes. See https://github.com/ dorarad/gansformer for model implementation.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Compositional Scene Representation Learning via Reconstruction: A Survey
    Yuan, Jinyang
    Chen, Tonglin
    Li, Bin
    Xue, Xiangyang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11540 - 11560
  • [22] Enhancing Semantic Features with Compositional Analysis for Scene Recognition
    Redi, Miriam
    Merialdo, Bernard
    COMPUTER VISION - ECCV 2012, PT III, 2012, 7585 : 446 - 455
  • [23] Surgical Instruction Generation with Transformers
    Zhang, Jinglu
    Nie, Yinyu
    Chang, Jian
    Zhang, Jian Jun
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT IV, 2021, 12904 : 290 - 299
  • [24] Monitoring Gas Generation in Transformers
    Rutledge, Chris
    2018 IEEE/PES TRANSMISSION AND DISTRIBUTION CONFERENCE AND EXPOSITION (T&D), 2018,
  • [25] Spatial Generation of Molecules with Transformers
    Cofala, Tim
    Teusch, Thomas
    Kramer, Oliver
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [26] Compositional Video Understanding with Spatiotemporal Structure-based Transformers
    Yun, Hoyeoung
    Ahn, Jinwoo
    Kim, Minseo
    Kim, Eun-Sol
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18751 - 18760
  • [27] R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
    Liu, Jinxiu
    Liu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3657 - 3665
  • [28] ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
    Liu, Difan
    Shetty, Sandesh
    Hinz, Tobias
    Fisher, Matthew
    Zhang, Richard
    Park, Taesung
    Kalogerakis, Evangelos
    ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
  • [29] EAFormer: Scene Text Segmentation with Edge-Aware Transformers
    Yu, Haiyang
    Fu, Teng
    Li, Bin
    Xue, Xiangyang
    COMPUTER VISION - ECCV 2024, PT XXV, 2025, 15083 : 410 - 427
  • [30] SCENE-BY-SCENE COLOR CORRECTION - THE NEXT GENERATION
    ORSBURN, ML
    SMPTE JOURNAL, 1986, 95 (01): : 177 - 178