Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

被引：195

作者：

Das, Abhishek ^{[1
]}

Agrawal, Harsh ^{[2
]}

Zitnick, Larry ^{[3
]}

Parikh, Devi ^{[1
,3
]}

Batra, Dhruv ^{[1
,3
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] Virginia Tech, Blacksburg, VA 24061 USA

[3] Facebook AI Res, Menlo Pk, CA USA

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2017年 / 163卷

基金：

美国国家科学基金会;

关键词：

Visual Question Answering; Attention;

D O I：

10.1016/j.cviu.2017.10.001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance.

引用

页码：90 / 100

页数：11

共 50 条

[31] Feature Fusion Attention Visual Question Answering
Wang, Chunlin
Sun, Jianyong
Chen, Xiaolin
ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
[32] Dynamic Capsule Attention for Visual Question Answering
Zhou, Yiyi
Ji, Rongrong
Su, Jinsong
Sun, Xiaoshuai
Chen, Weiqiu
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 9324 - 9331
[33] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
IMAGE AND VISION COMPUTING, 2023, 140
[34] Cross-modality co-attention networks for visual question answering
Han, Dezhi
Zhou, Shuli
Li, Kuan Ching
de Mello, Rodrigo Fernandes
SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
[35] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Cai, Linqin
Xu, Nuoying
Tian, Hang
Chen, Kejia
Fan, Haodu
NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943
[36] Cross-modality co-attention networks for visual question answering
Dezhi Han
Shuli Zhou
Kuan Ching Li
Rodrigo Fernandes de Mello
Soft Computing, 2021, 25 : 5411 - 5421
[37] Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering
Guo, Zihan
Han, Dezhi
SENSORS, 2020, 20 (23) : 1 - 15
[38] Sparse co-attention visual question answering networks based on thresholds
Zihan Guo
Dezhi Han
Applied Intelligence, 2023, 53 : 586 - 600
[39] A medical visual question answering approach based on co-attention networks
Cui W.
Shi W.
Shao H.
Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568
[40] Sparse co-attention visual question answering networks based on thresholds
Guo, Zihan
Han, Dezhi
APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600

← 1 2 3 4 5 →