VQA: Visual Question Answering

被引:2360
|
作者
Antol, Stanislaw [1 ]
Agrawal, Aishwarya [1 ]
Lu, Jiasen [1 ]
Mitchell, Margaret [2 ]
Batra, Dhruv [1 ]
Zitnick, C. Lawrence [2 ]
Parikh, Devi [1 ]
机构
[1] Virginia Tech, Blacksburg, VA 24061 USA
[2] Microsoft Res, Cambridge, MA USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/ICCV.2015.279
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing similar to 0.25M images, similar to 0.76M questions, and similar to 10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance.
引用
收藏
页码:2425 / 2433
页数:9
相关论文
共 50 条
  • [31] BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining
    Kim, MinJun
    Song, SeungWoo
    Lee, YouHan
    Jang, Haneol
    Lim, KyungTae
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18381 - 18389
  • [32] VQA: Visual Question Answeringwww.visualqa.org
    Aishwarya Agrawal
    Jiasen Lu
    Stanislaw Antol
    Margaret Mitchell
    C. Lawrence Zitnick
    Devi Parikh
    Dhruv Batra
    [J]. International Journal of Computer Vision, 2017, 123 : 4 - 31
  • [33] Visual Question Answering A tutorial
    Teney, Damien
    Wu, Qi
    van den Hengel, Anton
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) : 63 - 75
  • [34] Visual Question Generation as Dual Task of Visual Question Answering
    Li, Yikang
    Duan, Nan
    Zhou, Bolei
    Chu, Xiao
    Ouyang, Wanli
    Wang, Xiaogang
    Zhou, Ming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6116 - 6124
  • [35] Sequential Visual Reasoning for Visual Question Answering
    Liu, Jinlai
    Wu, Chenfei
    Wang, Xiaojie
    Dong, Xuan
    [J]. PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 410 - 415
  • [36] Robust Explanations for Visual Question Answering
    Patro, Badri N.
    Patel, Shivansh
    Namboodiri, Vinay P.
    [J]. 2020 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2020, : 1566 - 1575
  • [37] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662
  • [38] Visual Question Answering for Cultural Heritage
    Bongini, Pietro
    Becattini, Federico
    Bagdanov, Andrew D.
    Del Bimbo, Alberto
    [J]. INTERNATIONAL CONFERENCE FLORENCE HERI-TECH: THE FUTURE OF HERITAGE SCIENCE AND TECHNOLOGIES, 2020, 949
  • [39] Question action relevance and editing for visual question answering
    Andeep S. Toor
    Harry Wechsler
    Michele Nappi
    [J]. Multimedia Tools and Applications, 2019, 78 : 2921 - 2935
  • [40] Question -Led object attention for visual question answering
    Gao, Lianli
    Cao, Liangfu
    Xu, Xing
    Shao, Jie
    Song, Jingkuan
    [J]. NEUROCOMPUTING, 2020, 391 : 227 - 233