From image to language: A critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities

被引:3
|
作者
Ishmam, Md. Farhan [1 ,2 ]
Shovon, Md. Sakib Hossain [2 ,3 ]
Mridha, M. F. [2 ,3 ]
Dey, Nilanjan [4 ]
机构
[1] Islamic Univ Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Adv Machine Intelligence Res Lab, Dhaka, Bangladesh
[3] Amer Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
[4] Techno Int New Town, Dept Comp Sci & Engn, Kolkata, India
关键词
Visual Question Answering; Vision language pre-training; Multimodal learning; Multimodal large language models; FEATURE-EXTRACTION; NETWORK; ATTENTION; KNOWLEDGE; TOOL;
D O I
10.1016/j.inffus.2024.102270
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre -trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre -training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain of VQA that delves into the intricacies of VQA datasets and methods over the field's history, introduces a detailed taxonomy to categorize the facets of VQA, and highlights the recent trends, challenges, and scopes for improvement. We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation. The work aims to navigate both beginners and experts by shedding light on the potential avenues of research and expanding the boundaries of the field.
引用
收藏
页数:32
相关论文
共 50 条
  • [1] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [2] VQA: Visual Question Answering
    Agrawal, Aishwarya
    Lu, Jiasen
    Antol, Stanislaw
    Mitchell, Margaret
    Zitnick, C. Lawrence
    Parikh, Devi
    Batra, Dhruv
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
  • [3] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Goyal, Yash
    Khot, Tejas
    Agrawal, Aishwarya
    Summers-Stay, Douglas
    Batra, Dhruv
    Parikh, Devi
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (04) : 398 - 414
  • [4] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Goyal, Yash
    Khot, Tejas
    Summers-Stay, Douglas
    Batra, Dhruv
    Parikh, Devi
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6325 - 6334
  • [5] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
    Yash Goyal
    Tejas Khot
    Aishwarya Agrawal
    Douglas Summers-Stay
    Dhruv Batra
    Devi Parikh
    [J]. International Journal of Computer Vision, 2019, 127 : 398 - 414
  • [6] VC-VQA: VISUAL CALIBRATION MECHANISM FOR VISUAL QUESTION ANSWERING
    Qiao, Yanyuan
    Yu, Zheng
    Liu, Jing
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 1481 - 1485
  • [7] CQ-VQA: Visual Question Answering on Categorized Questions
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [8] CS-VQA: VISUAL QUESTION ANSWERING WITH COMPRESSIVELY SENSED IMAGES
    Huang, Li-Chi
    Kulkarni, Kuldeep
    Jha, Anik
    Lohit, Suhas
    Jayasuriya, Suren
    Turaga, Pavan
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1283 - 1287
  • [9] Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool
    Liu, Feng
    Xiang, Tao
    Hospedales, Timothy M.
    Yang, Wankou
    Sun, Changyin
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (02) : 460 - 474
  • [10] VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering
    Narayanan, Abhishek
    Rao, Abijna
    Prasad, Abhishek
    Natarajan, S.
    [J]. IMAGE AND VISION COMPUTING, 2021, 116