From image to language: A critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities

被引：3

作者：

Ishmam, Md. Farhan ^{[1
,2
]}

Shovon, Md. Sakib Hossain ^{[2
,3
]}

Mridha, M. F. ^{[2
,3
]}

Dey, Nilanjan ^{[4
]}

机构：

[1] Islamic Univ Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh

[2] Adv Machine Intelligence Res Lab, Dhaka, Bangladesh

[3] Amer Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh

[4] Techno Int New Town, Dept Comp Sci & Engn, Kolkata, India

来源：

INFORMATION FUSION | 2024年 / 106卷

关键词：

Visual Question Answering; Vision language pre-training; Multimodal learning; Multimodal large language models; FEATURE-EXTRACTION; NETWORK; ATTENTION; KNOWLEDGE; TOOL;

D O I：

10.1016/j.inffus.2024.102270

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre -trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre -training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain of VQA that delves into the intricacies of VQA datasets and methods over the field's history, introduces a detailed taxonomy to categorize the facets of VQA, and highlights the recent trends, challenges, and scopes for improvement. We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation. The work aims to navigate both beginners and experts by shedding light on the potential avenues of research and expanding the boundaries of the field.

引用

页数：32

共 50 条

[41] Overcoming Language Priors with Counterfactual Inference for Visual Question Answering
Ren, Zhibo
Wang, Huizhen
Zhu, Muhua
Wang, Yichao
Xiao, Tong
Zhu, Jingbo
[J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 58 - 71
[42] Overcoming Language Priors in Visual Question Answering with Adversarial Regularization
Ramakrishnan, Sainandan
Agrawal, Aishwarya
Lee, Stefan
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[43] Component Analysis for Visual Question Answering Architectures
Kolling, Camila
Wehrmann, Jonatas
Barros, Rodrigo C.
[J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
[44] Hierarchical Question-Image Co-Attention for Visual Question Answering
Lu, Jiasen
Yang, Jianwei
Batra, Dhruv
Parikh, Devi
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
[45] Self-Critical Reasoning for Robust Visual Question Answering
Wu, Jialin
Mooney, Raymond J.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[46] COIN: Counterfactual Image Generation for Visual Question Answering Interpretation
Boukhers, Zeyd
Hartmann, Timo
Juerjens, Jan
[J]. SENSORS, 2022, 22 (06)
[47] Enhancing Image Comprehension for Computer Science Visual Question Answering
Wang, Hongyu
Qiang, Pengpeng
Tan, Hongye
Hu, Jingchang
[J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT I, 2024, 14425 : 487 - 498
[48] Leveraging Visual Question Answering for Image-Caption Ranking
Lin, Xiao
Parikh, Devi
[J]. COMPUTER VISION - ECCV 2016, PT II, 2016, 9906 : 261 - 277
[49] Post-Disaster Damage Detection using Aerial Footage: Visual Question Answering (VQA) Case Study
Lowande, Rafael De Sa
Mahyari, Arash
Sevil, Hakki Erhan
[J]. 2022 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, AIPR, 2022,
[50] FTN-VQA: MULTIMODAL REASONING BY LEVERAGING A FULLY TRANSFORMER-BASED NETWORK FOR VISUAL QUESTION ANSWERING
Wang, Runmin
Xu, Weixiang
Zhu, Yanbin
Zhu, Zhenlin
Chen, Hua
Ding, Yajun
Liu, Jinping
Gao, Changxin
Sang, Nong
[J]. FRACTALS-COMPLEX GEOMETRY PATTERNS AND SCALING IN NATURE AND SOCIETY, 2023, 31 (06)

← 1 2 3 4 5 →