Textual-Visual Logic Challenge: Understanding and Reasoning in Text-to-Image Generation

被引：0

作者：

Xiong, Peixi ^{[1
]}

Kozuch, Michael ^{[1
]}

Jain, Nilesh ^{[1
]}

机构：

[1] Intel Labs, Portland, OR 97229 USA

来源：

COMPUTER VISION - ECCV 2024, PT V | 2025年 / 15063卷

关键词：

Text-to-Image Generation; Structural Reasoning; Relational Understanding;

D O I：

10.1007/978-3-031-72652-1_19

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text-to-image generation plays a pivotal role in computer vision and natural language processing by translating textual descriptions into visual representations. However, understanding complex relations in detailed text prompts filled with rich relational content remains a significant challenge. To address this, we introduce a novel task: Logic-Rich Text-to-Image generation. Unlike conventional image generation tasks that rely on short and structurally simple natural language inputs, our task focuses on intricate text inputs abundant in relational information. To tackle these complexities, we collect the Textual-Visual Logic dataset, designed to evaluate the performance of text-to-image generation models across diverse and complex scenarios. Furthermore, we propose a baseline model as a benchmark for this task. Our model comprises three key components: a relation understanding module, a multi-modality fusion module, and a negative pair discriminator. These components enhance the model's ability to handle disturbances in informative tokens and prioritize relational elements during image generation https://github.com/IntelLabs/Textual-Visual-Logic-Challenge.

引用

页码：318 / 334

页数：17

共 50 条

[21] StyleDrop: Text-to-Image Generation in Any Style
Sohn, Kihyuk
Ruiz, Nataniel
Lee, Kimin
Chin, Daniel Castro
Blok, Irina
Chang, Huiwen
Barber, Jarred
Jiang, Lu
Entis, Glenn
Li, Yuanzhen
Hao, Yuan
Essa, Irfan
Rubinstein, Michael
Krishnan, Dilip
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[22] DALL-EVAL: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models
Cho, Jaemin
Zala, Abhay
Bansal, Mohit
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3020 - 3031
[23] A taxonomy of prompt modifiers for text-to-image generation
Oppenlaender, Jonas
BEHAVIOUR & INFORMATION TECHNOLOGY, 2024, 43 (15) : 3763 - 3776
[24] Hybrid textual-visual relevance learning for content-based image retrieval
Cui, Chaoran
Lin, Peiguang
Nie, Xiushan
Yin, Yilong
Zhu, Qingfeng
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2017, 48 : 367 - 374
[25] Text-to-Image Generation Method Based on Image-Text Semantic Consistency
Xue Z.
Xu Z.
Lang C.
Feng S.
Wang T.
Li Y.
Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (09): : 2180 - 2190
[26] Large-scale Text-to-Image Generation Models for Visual Artists' Creative Works
Ko, Hyung-Kwon
Park, Gwanmo
Jeon, Hyeon
Jo, Jaemin
Kim, Juho
Seo, Jinwook
PROCEEDINGS OF 2023 28TH ANNUAL CONFERENCE ON INTELLIGENT USER INTERFACES, IUI 2023, 2023, : 919 - 933
[27] Locally controllable network based on visual–linguistic relation alignment for text-to-image generation
Zaike Li
Li Liu
Huaxiang Zhang
Dongmei Liu
Yu Song
Boqun Li
Multimedia Systems, 2024, 30
[28] Generative adversarial text-to-image generation with style image constraint
Zekang Wang
Li Liu
Huaxiang Zhang
Dongmei Liu
Yu Song
Multimedia Systems, 2023, 29 : 3291 - 3303
[29] Generative adversarial text-to-image generation with style image constraint
Wang, Zekang
Liu, Li
Zhang, Huaxiang
Liu, Dongmei
Song, Yu
MULTIMEDIA SYSTEMS, 2023, 29 (06) : 3291 - 3303
[30] Unleashing Text-to-Image Diffusion Models for Visual Perception
Zhao, Wenliang
Rao, Yongming
Liu, Zuyan
Liu, Benlin
Zhou, Jie
Lu, Jiwen
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 5706 - 5716

← 1 2 3 4 5 →