Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

被引:195
|
作者
Das, Abhishek [1 ]
Agrawal, Harsh [2 ]
Zitnick, Larry [3 ]
Parikh, Devi [1 ,3 ]
Batra, Dhruv [1 ,3 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Virginia Tech, Blacksburg, VA 24061 USA
[3] Facebook AI Res, Menlo Pk, CA USA
基金
美国国家科学基金会;
关键词
Visual Question Answering; Attention;
D O I
10.1016/j.cviu.2017.10.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance.
引用
收藏
页码:90 / 100
页数:11
相关论文
共 50 条
  • [1] Deep Modular Co-Attention Networks for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Cui, Yuhao
    Tao, Dacheng
    Tian, Qi
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
  • [2] Where To Look: Focus Regions for Visual Question Answering
    Shih, Kevin J.
    Singh, Saurabh
    Hoiem, Derek
    2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4613 - 4621
  • [3] Knowing Where to Look? Analysis on Attention of Visual Question Answering System
    Li, Wei
    Yuan, Zehuan
    Fang, Xiangzhong
    Wang, Changhu
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 145 - 152
  • [4] Deep Attention Neural Tensor Network for Visual Question Answering
    Bai, Yalong
    Fu, Jianlong
    Zhao, Tiejun
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37
  • [5] Deep Modular Bilinear Attention Network for Visual Question Answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbing
    SENSORS, 2022, 22 (03)
  • [6] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [7] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [8] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
    Lee, Doyup
    Cheon, Yeongjae
    Han, Wook-Shin
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
  • [9] Multi-view Attention Networks for Visual Question Answering
    Li, Min
    Bai, Zongwen
    Deng, Jie
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 788 - 794
  • [10] An Improved Attention for Visual Question Answering
    Rahman, Tanzila
    Chou, Shih-Han
    Sigal, Leonid
    Carenini, Giuseppe
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662