Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

被引：415

作者：

Xu, Huijuan ^{[1
]}

Saenko, Kate ^{[1
]}

机构：

[1] Boston Univ, Comp Sci, Boston, MA 02215 USA

来源：

COMPUTER VISION - ECCV 2016, PT VII | 2016年 / 9911卷

关键词：

Visual question answering; Spatial attention; Memory network; Deep learning;

D O I：

10.1007/978-3-319-46478-7_28

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses attention to choose regions relevant for computing the answer. We propose a novel question-guided spatial attention architecture that looks for regions relevant to either individual words or the entire question, repeating the process over multiple recurrent steps, or "hops". To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the network's attention. We evaluate our model on two available visual question answering datasets and obtain improved results.

引用

页码：451 / 466

页数：16

共 50 条

[1] Question-Guided Hybrid Convolution for Visual Question Answering
Gao, Peng
Li, Hongsheng
Li, Shuang
Lu, Pan
Li, Yikang
Hoi, Steven C. H.
Wang, Xiaogang
[J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 485 - 501
[2] Question-guided feature pyramid network for medical visual question answering
Yu, Yonglin
Li, Haifeng
Shi, Hanrong
Li, Lin
Xiao, Jun
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
[3] Question Type Guided Attention in Visual Question Answering
Shi, Yang
Furlanello, Tommaso
Zha, Sheng
Anandkumar, Animashree
[J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
[4] Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering
Liu, Fei
Liu, Jing
Hong, Richang
Lu, Hanqing
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1367 - 1379
[5] A question-guided multi-hop reasoning graph network for visual question answering
Xu, Zhaoyang
Gu, Jinguang
Liu, Maofu
Zhou, Guangyou
Fu, Haidong
Qiu, Chen
[J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
[6] Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering
Gao, Ling
Zhang, Hongda
Sheng, Nan
Shi, Lida
Xu, Hao
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[7] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
Jiang, Jianwen
Chen, Ziqiang
Lin, Haojie
Zhao, Xibin
Gao, Yue
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108
[8] Question-relationship guided graph attention network for visual question answer
Liu, Rui
Zhuang, Liansheng
Yu, Zhou
Jiang, Zhihao
Bai, Tian
[J]. MULTIMEDIA SYSTEMS, 2022, 28 (02) : 445 - 456
[9] Question-relationship guided graph attention network for visual question answer
Rui Liu
Liansheng Zhuang
Zhou Yu
Zhihao Jiang
Tian Bai
[J]. Multimedia Systems, 2022, 28 : 445 - 456
[10] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
Qian, Tianwen
Cui, Ran
Chen, Jingjing
Peng, Pai
Guo, Xiaowei
Jiang, Yu-Gang
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563

← 1 2 3 4 5 →