Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

被引:195
|
作者
Das, Abhishek [1 ]
Agrawal, Harsh [2 ]
Zitnick, Larry [3 ]
Parikh, Devi [1 ,3 ]
Batra, Dhruv [1 ,3 ]
机构
[1] Georgia Inst Technol, Atlanta, GA 30332 USA
[2] Virginia Tech, Blacksburg, VA 24061 USA
[3] Facebook AI Res, Menlo Pk, CA USA
基金
美国国家科学基金会;
关键词
Visual Question Answering; Attention;
D O I
10.1016/j.cviu.2017.10.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance.
引用
收藏
页码:90 / 100
页数:11
相关论文
共 50 条
  • [31] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [32] Dynamic Capsule Attention for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Sun, Xiaoshuai
    Chen, Weiqiu
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
  • [33] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    IMAGE AND VISION COMPUTING, 2023, 140
  • [34] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [35] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
    Cai, Linqin
    Xu, Nuoying
    Tian, Hang
    Chen, Kejia
    Fan, Haodu
    NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
  • [36] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    Soft Computing, 2021, 25 : 5411 - 5421
  • [37] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
    Guo, Zihan
    Han, Dezhi
    SENSORS, 2020, 20 (23) : 1 - 15
  • [38] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    Applied Intelligence, 2023, 53 : 586 - 600
  • [39] A medical visual question answering approach based on co-attention networks
    Cui W.
    Shi W.
    Shao H.
    Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568
  • [40] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600