Graph neural networks in vision-language image understanding: a survey

被引:1
|
作者
Senior, Henry [1 ]
Slabaugh, Gregory [1 ]
Yuan, Shanxin [1 ]
Rossi, Luca [2 ]
机构
[1] Queen Mary Univ London, Digital Environm Res Inst, New Rd, London E1 1HH, England
[2] Hong Kong Polytech Univ, Dept Elect & Elect Engn, Hung Hom, Hong Kong, Peoples R China
来源
基金
英国工程与自然科学研究理事会;
关键词
Graph neural networks; Image captioning; Visual question answering; Image retrieval; RETRIEVAL; KNOWLEDGE;
D O I
10.1007/s00371-024-03343-0
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
2D image understanding is a complex problem within computer vision, but it holds the key to providing human-level scene comprehension. It goes further than identifying the objects in an image, and instead, it attempts to understand the scene. Solutions to this problem form the underpinning of a range of tasks, including image captioning, visual question answering (VQA), and image retrieval. Graphs provide a natural way to represent the relational arrangement between objects in an image, and thus, in recent years graph neural networks (GNNs) have become a standard component of many 2D image understanding pipelines, becoming a core architectural component, especially in the VQA group of tasks. In this survey, we review this rapidly evolving field and we provide a taxonomy of graph types used in 2D image understanding approaches, a comprehensive list of the GNN models used in this domain, and a roadmap of future potential developments. To the best of our knowledge, this is the first comprehensive survey that covers image captioning, visual question answering, and image retrieval techniques that focus on using GNNs as the main part of their architecture.
引用
收藏
页数:26
相关论文
共 50 条
  • [1] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [2] Graph convolutional networks in language and vision: A survey
    Ren, Haotian
    Lu, Wei
    Xiao, Yun
    Chang, Xiaojun
    Wang, Xuanhong
    Dong, Zhiqiang
    Fang, Dingyi
    [J]. KNOWLEDGE-BASED SYSTEMS, 2022, 251
  • [3] Vision-language navigation: a survey and taxonomy
    Wansen Wu
    Tao Chang
    Xinmeng Li
    Quanjun Yin
    Yue Hu
    [J]. Neural Computing and Applications, 2024, 36 : 3291 - 3316
  • [4] Vision-language navigation: a survey and taxonomy
    Wu, Wansen
    Chang, Tao
    Li, Xinmeng
    Yin, Quanjun
    Hu, Yue
    [J]. NEURAL COMPUTING & APPLICATIONS, 2024, 36 (07): : 3291 - 3316
  • [5] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    [J]. Frontiers of Computer Science, 2025, 19 (01)
  • [6] Image as a Foreign Language: BEIT Pretraining for Vision and Vision-Language Tasks
    Wang, Wenhui
    Bao, Hangbo
    Dong, Li
    Bjorck, Johan
    Peng, Zhiliang
    Liu, Qiang
    Aggarwal, Kriti
    Mohammed, Owais Khan
    Singhal, Saksham
    Som, Subhojit
    Wei, Furu
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19175 - 19186
  • [7] Survey on Vision-language Pre-training
    Yin J.
    Zhang Z.-D.
    Gao Y.-H.
    Yang Z.-W.
    Li L.
    Xiao M.
    Sun Y.-Q.
    Yan C.-G.
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [8] Neural Implicit Vision-Language Feature Fields
    Blomqvist, Kenneth
    Milano, Francesco
    Chung, Jen Jen
    Ott, Lionel
    Siegwart, Roland
    [J]. 2023 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, IROS, 2023, : 1313 - 1318
  • [9] Adventures of Trustworthy Vision-Language Models: A Survey
    Vatsa, Mayank
    Jain, Anubhooti
    Singh, Richa
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22650 - 22658
  • [10] Aligning vision-language for graph inference in visual dialog
    Jiang, Tianling
    Shao, Hailin
    Tian, Xin
    Ji, Yi
    Liu, Chunping
    [J]. IMAGE AND VISION COMPUTING, 2021, 116