From image to language: A critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities

被引:3
|
作者
Ishmam, Md. Farhan [1 ,2 ]
Shovon, Md. Sakib Hossain [2 ,3 ]
Mridha, M. F. [2 ,3 ]
Dey, Nilanjan [4 ]
机构
[1] Islamic Univ Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Adv Machine Intelligence Res Lab, Dhaka, Bangladesh
[3] Amer Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
[4] Techno Int New Town, Dept Comp Sci & Engn, Kolkata, India
关键词
Visual Question Answering; Vision language pre-training; Multimodal learning; Multimodal large language models; FEATURE-EXTRACTION; NETWORK; ATTENTION; KNOWLEDGE; TOOL;
D O I
10.1016/j.inffus.2024.102270
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre -trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre -training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain of VQA that delves into the intricacies of VQA datasets and methods over the field's history, introduces a detailed taxonomy to categorize the facets of VQA, and highlights the recent trends, challenges, and scopes for improvement. We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation. The work aims to navigate both beginners and experts by shedding light on the potential avenues of research and expanding the boundaries of the field.
引用
收藏
页数:32
相关论文
共 50 条
  • [41] Overcoming Language Priors with Counterfactual Inference for Visual Question Answering
    Ren, Zhibo
    Wang, Huizhen
    Zhu, Muhua
    Wang, Yichao
    Xiao, Tong
    Zhu, Jingbo
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 58 - 71
  • [42] Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
    Ramakrishnan, Sainandan
    Agrawal, Aishwarya
    Lee, Stefan
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [43] Component Analysis for Visual Question Answering Architectures
    Kolling, Camila
    Wehrmann, Jonatas
    Barros, Rodrigo C.
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [44] Hierarchical Question-Image Co-Attention for Visual Question Answering
    Lu, Jiasen
    Yang, Jianwei
    Batra, Dhruv
    Parikh, Devi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [45] Self-Critical Reasoning for Robust Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [46] COIN: Counterfactual Image Generation for Visual Question Answering Interpretation
    Boukhers, Zeyd
    Hartmann, Timo
    Juerjens, Jan
    [J]. SENSORS, 2022, 22 (06)
  • [47] Enhancing Image Comprehension for Computer Science Visual Question Answering
    Wang, Hongyu
    Qiang, Pengpeng
    Tan, Hongye
    Hu, Jingchang
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 487 - 498
  • [48] Leveraging Visual Question Answering for Image-Caption Ranking
    Lin, Xiao
    Parikh, Devi
    [J]. COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 : 261 - 277
  • [49] Post-Disaster Damage Detection using Aerial Footage: Visual Question Answering (VQA) Case Study
    Lowande, Rafael De Sa
    Mahyari, Arash
    Sevil, Hakki Erhan
    [J]. 2022 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, AIPR, 2022,
  • [50] FTN-VQA: MULTIMODAL REASONING BY LEVERAGING A FULLY TRANSFORMER-BASED NETWORK FOR VISUAL QUESTION ANSWERING
    Wang, Runmin
    Xu, Weixiang
    Zhu, Yanbin
    Zhu, Zhenlin
    Chen, Hua
    Ding, Yajun
    Liu, Jinping
    Gao, Changxin
    Sang, Nong
    [J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2023, 31 (06)