Compositional Transformers for Scene Generation

被引：0

作者：

Hudson, Drew A. ^{[1
]}

Zitnick, C. Lawrence ^{[2
]}

机构：

[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA

[2] Facebook Inc, Facebook AI Res, Menlo Pk, CA USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021) | 2021年 / 34卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. The network incorporates strong and explicit structural priors, to reflect the compositional nature of visual scenes, and synthesizes images through a sequential process. It operates in two stages: a fast and lightweight planning phase, where we draft a high-level scene layout, followed by an attention-based execution phase, where the layout is being refined, evolving into a rich and detailed picture. Our model moves away from conventional black-box GAN architectures that feature a flat and monolithic latent space towards a transparent design that encourages efficiency, controllability and interpretability. We demonstrate GANformer2's strengths and qualities through a careful evaluation over a range of datasets, from multi-object CLEVR scenes to the challenging COCO images, showing it successfully achieves state-of-the-art performance in terms of visual quality, diversity and consistency. Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process, as it proceeds step-by-step from a rough initial sketch, to a detailed layout that accounts for objects' depths and dependencies, and up to the final high-resolution depiction of vibrant and intricate real-world scenes. See https://github.com/ dorarad/gansformer for model implementation.

引用

页数：15

共 50 条

[21] Compositional Scene Representation Learning via Reconstruction: A Survey
Yuan, Jinyang
Chen, Tonglin
Li, Bin
Xue, Xiangyang
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11540 - 11560
[22] Enhancing Semantic Features with Compositional Analysis for Scene Recognition
Redi, Miriam
Merialdo, Bernard
COMPUTER VISION - ECCV 2012, PT III, 2012, 7585 : 446 - 455
[23] Surgical Instruction Generation with Transformers
Zhang, Jinglu
Nie, Yinyu
Chang, Jian
Zhang, Jian Jun
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT IV, 2021, 12904 : 290 - 299
[24] Monitoring Gas Generation in Transformers
Rutledge, Chris
2018 IEEE/PES TRANSMISSION AND DISTRIBUTION CONFERENCE AND EXPOSITION (T&D), 2018,
[25] Spatial Generation of Molecules with Transformers
Cofala, Tim
Teusch, Thomas
Kramer, Oliver
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[26] Compositional Video Understanding with Spatiotemporal Structure-based Transformers
Yun, Hoyeoung
Ahn, Jinwoo
Kim, Minseo
Kim, Eun-Sol
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18751 - 18760
[27] R3CD: Scene Graph to Image Generation with Relation-Aware Compositional Contrastive Control Diffusion
Liu, Jinxiu
Liu, Qi
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3657 - 3665
[28] ASSET: Autoregressive Semantic Scene Editing with Transformers at High Resolutions
Liu, Difan
Shetty, Sandesh
Hinz, Tobias
Fisher, Matthew
Zhang, Richard
Park, Taesung
Kalogerakis, Evangelos
ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
[29] EAFormer: Scene Text Segmentation with Edge-Aware Transformers
Yu, Haiyang
Fu, Teng
Li, Bin
Xue, Xiangyang
COMPUTER VISION - ECCV 2024, PT XXV, 2025, 15083 : 410 - 427
[30] SCENE-BY-SCENE COLOR CORRECTION - THE NEXT GENERATION
ORSBURN, ML
SMPTE JOURNAL, 1986, 95 (01): : 177 - 178

← 1 2 3 4 5 →