Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

被引:415
|
作者
Xu, Huijuan [1 ]
Saenko, Kate [1 ]
机构
[1] Boston Univ, Comp Sci, Boston, MA 02215 USA
来源
关键词
Visual question answering; Spatial attention; Memory network; Deep learning;
D O I
10.1007/978-3-319-46478-7_28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses attention to choose regions relevant for computing the answer. We propose a novel question-guided spatial attention architecture that looks for regions relevant to either individual words or the entire question, repeating the process over multiple recurrent steps, or "hops". To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the network's attention. We evaluate our model on two available visual question answering datasets and obtain improved results.
引用
收藏
页码:451 / 466
页数:16
相关论文
共 50 条
  • [1] Question-Guided Hybrid Convolution for Visual Question Answering
    Gao, Peng
    Li, Hongsheng
    Li, Shuang
    Lu, Pan
    Li, Yikang
    Hoi, Steven C. H.
    Wang, Xiaogang
    [J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 485 - 501
  • [2] Question-guided feature pyramid network for medical visual question answering
    Yu, Yonglin
    Li, Haifeng
    Shi, Hanrong
    Li, Lin
    Xiao, Jun
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214
  • [3] Question Type Guided Attention in Visual Question Answering
    Shi, Yang
    Furlanello, Tommaso
    Zha, Sheng
    Anandkumar, Animashree
    [J]. COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
  • [4] Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering
    Liu, Fei
    Liu, Jing
    Hong, Richang
    Lu, Hanqing
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1367 - 1379
  • [5] A question-guided multi-hop reasoning graph network for visual question answering
    Xu, Zhaoyang
    Gu, Jinguang
    Liu, Maofu
    Zhou, Guangyou
    Fu, Haidong
    Qiu, Chen
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [6] Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering
    Gao, Ling
    Zhang, Hongda
    Sheng, Nan
    Shi, Lida
    Xu, Hao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [7] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
    Jiang, Jianwen
    Chen, Ziqiang
    Lin, Haojie
    Zhao, Xibin
    Gao, Yue
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108
  • [8] Question-relationship guided graph attention network for visual question answer
    Liu, Rui
    Zhuang, Liansheng
    Yu, Zhou
    Jiang, Zhihao
    Bai, Tian
    [J]. MULTIMEDIA SYSTEMS, 2022, 28 (02) : 445 - 456
  • [9] Question-relationship guided graph attention network for visual question answer
    Rui Liu
    Liansheng Zhuang
    Zhou Yu
    Zhihao Jiang
    Tian Bai
    [J]. Multimedia Systems, 2022, 28 : 445 - 456
  • [10] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
    Qian, Tianwen
    Cui, Ran
    Chen, Jingjing
    Peng, Pai
    Guo, Xiaowei
    Jiang, Yu-Gang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563