From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

被引：0

作者：

Song, Jingkuan

Zeng, Pengpeng

Gao, Lianli ^{[1
]}

Shen, Heng Tao ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu 611731, Peoples R China

来源：

PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2018年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problem. In this paper we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

引用

下载

页码：906 / 912

页数：7

共 50 条

[21] Collaborative Attention Network to Enhance Visual Question Answering
Gu, Rui
BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
[22] Densely Connected Attention Flow for Visual Question Answering
Liu, Fei
Liu, Jing
Fang, Zhiwei
Hong, Richang
Lu, Hanging
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 869 - 875
[23] Fair Attention Network for Robust Visual Question Answering
Bi Y.
Jiang H.
Hu Y.
Sun Y.
Yin B.
IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (09) : 1 - 1
[24] Adversarial Learning with Bidirectional Attention for Visual Question Answering
Li, Qifeng
Tang, Xinyi
Jian, Yi
SENSORS, 2021, 21 (21)
[25] Learning Visual Question Answering by Bootstrapping Hard Attention
Malinowski, Mateusz
Doersch, Carl
Santoro, Adam
Battaglia, Peter
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20
[26] Dual Attention and Question Categorization-Based Visual Question Answering
Mishra A.
Anand A.
Guha P.
IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 81 - 91
[27] Co-Attention Network With Question Type for Visual Question Answering
Yang, Chao
Jiang, Mengqi
Jiang, Bin
Zhou, Weixin
Li, Keqin
IEEE ACCESS, 2019, 7 : 40771 - 40781
[28] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[29] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[30] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73

← 1 2 3 4 5 →