Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

被引：195

作者：

Das, Abhishek ^{[1
]}

Agrawal, Harsh ^{[2
]}

Zitnick, Larry ^{[3
]}

Parikh, Devi ^{[1
,3
]}

Batra, Dhruv ^{[1
,3
]}

机构：

[1] Georgia Inst Technol, Atlanta, GA 30332 USA

[2] Virginia Tech, Blacksburg, VA 24061 USA

[3] Facebook AI Res, Menlo Pk, CA USA

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2017年 / 163卷

基金：

美国国家科学基金会;

关键词：

Visual Question Answering; Attention;

D O I：

10.1016/j.cviu.2017.10.001

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance.

引用

页码：90 / 100

页数：11

共 50 条

[1] Deep Modular Co-Attention Networks for Visual Question Answering
Yu, Zhou
Yu, Jun
Cui, Yuhao
Tao, Dacheng
Tian, Qi
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6274 - 6283
[2] Where To Look: Focus Regions for Visual Question Answering
Shih, Kevin J.
Singh, Saurabh
Hoiem, Derek
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4613 - 4621
[3] Knowing Where to Look? Analysis on Attention of Visual Question Answering System
Li, Wei
Yuan, Zehuan
Fang, Xiangzhong
Wang, Changhu
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 145 - 152
[4] Deep Attention Neural Tensor Network for Visual Question Answering
Bai, Yalong
Fu, Jianlong
Zhao, Tiejun
Mei, Tao
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37
[5] Deep Modular Bilinear Attention Network for Visual Question Answering
Yan, Feng
Silamu, Wushouer
Li, Yanbing
SENSORS, 2022, 22 (03)
[6] Multi-level Attention Networks for Visual Question Answering
Yu, Dongfei
Fu, Jianlong
Mei, Tao
Rui, Yong
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
[7] Stacked Self-Attention Networks for Visual Question Answering
Sun, Qiang
Fu, Yanwei
ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
[8] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
Lee, Doyup
Cheon, Yeongjae
Han, Wook-Shin
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
[9] Multi-view Attention Networks for Visual Question Answering
Li, Min
Bai, Zongwen
Deng, Jie
2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 788 - 794
[10] An Improved Attention for Visual Question Answering
Rahman, Tanzila
Chou, Shih-Han
Sigal, Leonid
Carenini, Giuseppe
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, : 1653 - 1662

← 1 2 3 4 5 →