From image to language: A critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities

被引:3
|
作者
Ishmam, Md. Farhan [1 ,2 ]
Shovon, Md. Sakib Hossain [2 ,3 ]
Mridha, M. F. [2 ,3 ]
Dey, Nilanjan [4 ]
机构
[1] Islamic Univ Technol, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Adv Machine Intelligence Res Lab, Dhaka, Bangladesh
[3] Amer Int Univ, Dept Comp Sci & Engn, Dhaka, Bangladesh
[4] Techno Int New Town, Dept Comp Sci & Engn, Kolkata, India
关键词
Visual Question Answering; Vision language pre-training; Multimodal learning; Multimodal large language models; FEATURE-EXTRACTION; NETWORK; ATTENTION; KNOWLEDGE; TOOL;
D O I
10.1016/j.inffus.2024.102270
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and various other visual inputs. The emergence of large pre -trained networks has shifted the early VQA approaches relying on feature extraction and fusion schemes to vision language pre -training (VLP) techniques. However, there is a lack of comprehensive surveys that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the lens of VQA haven't been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey in the domain of VQA that delves into the intricacies of VQA datasets and methods over the field's history, introduces a detailed taxonomy to categorize the facets of VQA, and highlights the recent trends, challenges, and scopes for improvement. We further generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future investigation. The work aims to navigate both beginners and experts by shedding light on the potential avenues of research and expanding the boundaries of the field.
引用
收藏
页数:32
相关论文
共 50 条
  • [21] Context-VQA: Towards Context-Aware and Purposeful Visual Question Answering
    Naik, Nandita
    Potts, Christopher
    Kreiss, Elisa
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2813 - 2817
  • [22] Event-Oriented Visual Question Answering: The E-VQA Dataset and Benchmark
    Yang, Zhenguo
    Xiang, Jiale
    You, Jiuxiang
    Li, Qing
    Liu, Wenyin
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (10) : 10210 - 10223
  • [23] Image captioning improved visual question answering
    Himanshu Sharma
    Anand Singh Jalal
    [J]. Multimedia Tools and Applications, 2022, 81 : 34775 - 34796
  • [24] Image captioning improved visual question answering
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (24) : 34775 - 34796
  • [25] A Critical Analysis of Benchmarks, Techniques, and Models in Medical Visual Question Answering
    Al-Hadhrami, Suheer
    Menai, Mohamed El Bachir
    Al-Ahmadi, Saad
    Alnafessah, Ahmed
    [J]. IEEE ACCESS, 2023, 11 : 136507 - 136540
  • [26] RESCUENET-VQA: A LARGE-SCALE VISUAL QUESTION ANSWERING BENCHMARK FOR DAMAGE ASSESSMENT
    Sarkar, Argho
    Rahnemoonfar, Maryam
    [J]. IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 1150 - 1153
  • [27] An Analysis of Visual Question Answering Algorithms
    Kafle, Kushal
    Kanan, Christopher
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1983 - 1991
  • [28] Visual question answering: Datasets, algorithms, and future challenges
    Kafle, Kushal
    Kanan, Christopher
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2017, 163 : 3 - 20
  • [29] Multiview Language Bias Reduction for Visual Question Answering
    Li, Pengju
    Tan, Zhiyi
    Bao, Bing-Kun
    [J]. IEEE MULTIMEDIA, 2023, 30 (01) : 91 - 99
  • [30] An Empirical Study on the Language Modal in Visual Question Answering
    Peng, Daowan
    Wei, Wei
    Mao, Xian-Ling
    Fu, Yuanyuan
    Chen, Dangyang
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4109 - 4117