Component Analysis for Visual Question Answering Architectures

被引:0
|
作者
Kolling, Camila [1 ]
Wehrmann, Jonatas [1 ]
Barros, Rodrigo C. [1 ]
机构
[1] Pontificia Univ Catolica Rio Grande do Sul, Machine Intelligence & Robot Res Grp, Sch Technol, Av Ipiranga 6681, BR-90619900 Porto Alegre, RS, Brazil
关键词
Visual Question Answering; Computer Vision; Natural Language Processing;
D O I
10.1109/ijcnn48605.2020.9206679
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research advances in Computer Vision and Natural Language Processing have introduced novel tasks that are paving the way for solving AI-complete problems. One of those tasks is called Visual Question Answering (VQA). This system takes an image and a free-form, open-ended natural-language question about the image, and produce a natural language answer as the output. Such a task has drawn great attention from the scientific community, which generated a plethora of approaches that aim to improve the VQA predictive accuracy. Most of them comprise three major components: (i) independent representation learning of images and questions; (ii) feature fusion so the model can use information from both sources to answer visual questions; and (iii) the generation of the correct answer in natural language. With so many approaches being recently introduced, it became unclear the real contribution of each component for the ultimate performance of the model. The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in VQA models. Our extensive set of experiments cover both visual and textual elements, as well as the combination of these representations in form of fusion and attention mechanisms. Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] iVQA: Inverse Visual Question Answering
    Liu, Feng
    Xiang, Tao
    Hospedales, Timothy M.
    Yang, Wankou
    Sun, Changyin
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8611 - 8619
  • [32] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [33] Semantically Guided Visual Question Answering
    Zhao, Handong
    Fan, Quanfu
    Gutfreund, Dan
    Fu, Yun
    [J]. 2018 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2018), 2018, : 1852 - 1860
  • [34] VAQA: Visual Arabic Question Answering
    Kamel, Sarah M. M.
    Hassan, Shimaa I. I.
    Elrefaei, Lamiaa
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2023, 48 (08) : 10803 - 10823
  • [35] Adapted GooLeNet for Visual Question Answering
    Huang, Jie
    Hu, Yue
    Yang, Weilong
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON MECHANICAL, CONTROL AND COMPUTER ENGINEERING (ICMCCE), 2018, : 603 - 606
  • [36] VAQA: Visual Arabic Question Answering
    Sarah M. kamel
    Shimaa I. Hassan
    Lamiaa Elrefaei
    [J]. Arabian Journal for Science and Engineering, 2023, 48 : 10803 - 10823
  • [37] Multitask Learning for Visual Question Answering
    Ma, Jie
    Liu, Jun
    Lin, Qika
    Wu, Bei
    Wang, Yaxian
    You, Yang
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1380 - 1394
  • [38] Visual Question Answering for Intelligent Interaction
    Gao, Panpan
    Sun, Hanxu
    Chen, Gang
    Wang, Ruiquan
    Li, Minggang
    [J]. MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [39] Differential Networks for Visual Question Answering
    Wu, Chenfei
    Liu, Jinlai
    Wang, Xiaojie
    Li, Ruifan
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8997 - 9004
  • [40] Document Collection Visual Question Answering
    Tito, Ruben
    Karatzas, Dimosthenis
    Valveny, Ernest
    [J]. DOCUMENT ANALYSIS AND RECOGNITION - ICDAR 2021, PT II, 2021, 12822 : 778 - 792