Stacked Attention Networks for Image Question Answering

被引：1166

作者：

Yang, Zichao ^{[1
]}

He, Xiaodong ^{[2
]}

Gao, Jianfeng ^{[2
]}

Deng, Li ^{[2
]}

Smola, Alex ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Microsoft Res, Redmond, WA 98052 USA

来源：

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2016年

关键词：

D O I：

10.1109/CVPR.2016.10

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. SANs use semantic representation of a question as query to search for the regions in an image that are related to the answer. We argue that image question answering (QA) often requires multiple steps of reasoning. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively. Experiments conducted on four image QA data sets demonstrate that the proposed SANs significantly outperform previous state-of-the-art approaches. The visualization of the attention layers illustrates the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.

引用

页码：21 / 29

页数：9

共 50 条

[21] Video Question Answering by Frame Attention
Fang, Jiannan
Sun, Lingling
Wang, Yaqi
ELEVENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2019), 2019, 11179
[22] Differential Attention for Visual Question Answering
Patro, Badri
Namboodiri, Vinay P.
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
[23] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[24] Question -Led object attention for visual question answering
Gao, Lianli
Cao, Liangfu
Xu, Xing
Shao, Jie
Song, Jingkuan
NEUROCOMPUTING, 2020, 391 : 227 - 233
[25] Question-Agnostic Attention for Visual Question Answering
Farazi, Moshiur
Khan, Salman
Barnes, Nick
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3542 - 3549
[26] Question Type Guided Attention in Visual Question Answering
Shi, Yang
Furlanello, Tommaso
Zha, Sheng
Anandkumar, Animashree
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 : 158 - 175
[27] Fusing Attention with Visual Question Answering
Burt, Ryan
Cudic, Mihael
Principe, Jose C.
2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
[28] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
IMAGE AND VISION COMPUTING, 2023, 140
[29] Cross-modality co-attention networks for visual question answering
Han, Dezhi
Zhou, Shuli
Li, Kuan Ching
de Mello, Rodrigo Fernandes
SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
[30] Multimodal Bi-direction Guided Attention Networks for Visual Question Answering
Cai, Linqin
Xu, Nuoying
Tian, Hang
Chen, Kejia
Fan, Haodu
NEURAL PROCESSING LETTERS, 2023, 55 (09) : 11921 - 11943

← 1 2 3 4 5 →