Graph neural networks in vision-language image understanding: a survey

被引：1

作者：

Senior, Henry ^{[1
]}

Slabaugh, Gregory ^{[1
]}

Yuan, Shanxin ^{[1
]}

Rossi, Luca ^{[2
]}

机构：

[1] Queen Mary Univ London, Digital Environm Res Inst, New Rd, London E1 1HH, England

[2] Hong Kong Polytech Univ, Dept Elect & Elect Engn, Hung Hom, Hong Kong, Peoples R China

来源：

VISUAL COMPUTER | 2024年

基金：

英国工程与自然科学研究理事会;

关键词：

Graph neural networks; Image captioning; Visual question answering; Image retrieval; RETRIEVAL; KNOWLEDGE;

D O I：

10.1007/s00371-024-03343-0

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.

引用

页数：26

共 50 条

[1] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
[2] Graph convolutional networks in language and vision: A survey
Ren, Haotian
Lu, Wei
Xiao, Yun
Chang, Xiaojun
Wang, Xuanhong
Dong, Zhiqiang
Fang, Dingyi
[J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
[3] Vision-language navigation: a survey and taxonomy
Wansen Wu
Tao Chang
Xinmeng Li
Quanjun Yin
Yue Hu
[J]. Neural Computing and Applications, 2024, 36 : 3291 - 3316
[4] Vision-language navigation: a survey and taxonomy
Wu, Wansen
Chang, Tao
Li, Xinmeng
Yin, Quanjun
Hu, Yue
[J]. NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
[5] Debiasing vision-language models for vision tasks: a survey
Zhu, Beier
Zhang, Hanwang
[J]. Frontiers of Computer Science, 2025, 19 (01)
[6] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
Wang, Wenhui
Bao, Hangbo
Dong, Li
Bjorck, Johan
Peng, Zhiliang
Liu, Qiang
Aggarwal, Kriti
Mohammed, Owais Khan
Singhal, Saksham
Som, Subhojit
Wei, Furu
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
[7] Survey on Vision-language Pre-training
Yin J.
Zhang Z.-D.
Gao Y.-H.
Yang Z.-W.
Li L.
Xiao M.
Sun Y.-Q.
Yan C.-G.
[J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
[8] Neural Implicit Vision-Language Feature Fields
Blomqvist, Kenneth
Milano, Francesco
Chung, Jen Jen
Ott, Lionel
Siegwart, Roland
[J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1313 - 1318
[9] Adventures of Trustworthy Vision-Language Models: A Survey
Vatsa, Mayank
Jain, Anubhooti
Singh, Richa
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
[10] Aligning vision-language for graph inference in visual dialog
Jiang, Tianling
Shao, Hailin
Tian, Xin
Ji, Yi
Liu, Chunping
[J]. IMAGE AND VISION COMPUTING, 2021, 116

← 1 2 3 4 5 →