From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

被引：0

作者：

Song, Jingkuan

Zeng, Pengpeng

Gao, Lianli ^{[1
]}

Shen, Heng Tao ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu 611731, Peoples R China

来源：

PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2018年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problem. In this paper we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

引用

页码：906 / 912

页数：7

共 50 条

[41] Word-to-region attention network for visual question answering
Liang Peng
Yang Yang
Yi Bin
Ning Xie
Fumin Shen
Yanli Ji
Xing Xu
Multimedia Tools and Applications, 2019, 78 : 3843 - 3858
[42] Local self-attention in transformer for visual question answering
Shen, Xiang
Han, Dezhi
Guo, Zihan
Chen, Chongqing
Hua, Jie
Luo, Gaofeng
APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
[43] Multi-level Attention Networks for Visual Question Answering
Yu, Dongfei
Fu, Jianlong
Mei, Tao
Rui, Yong
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
[44] Erasing-based Attention Learning for Visual Question Answering
Liu, Fei
Liu, Jing
Hong, Richang
Lu, Hanqing
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1175 - 1183
[45] Regularizing Attention Networks for Anomaly Detection in Visual Question Answering
Lee, Doyup
Cheon, Yeongjae
Han, Wook-Shin
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 1845 - 1853
[46] High-Order Attention Models for Visual Question Answering
Schwartz, Idan
Schwing, Alexander G.
Hazan, Tamir
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[47] An Improved Attention and Hybrid Optimization Technique for Visual Question Answering
Himanshu Sharma
Anand Singh Jalal
Neural Processing Letters, 2022, 54 : 709 - 730
[48] Counting Attention Based on Classification Confidence for Visual Question Answering
Chen, Mingqin
Wang, Yilei
Chen, Shan
Wu, Yingjie
2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1173 - 1179
[49] Explicit ensemble attention learning for improving visual question answering
Lioutas, Vasileios
Passalis, Nikolaos
Tefas, Anastasios
PATTERN RECOGNITION LETTERS, 2018, 111 : 51 - 57
[50] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
Zhou, Yiyi
Ren, Tianhe
Zhu, Chaoyang
Sun, Xiaoshuai
Liu, Jianzhuang
Ding, Xinghao
Xu, Mingliang
Ji, Rongrong
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064

← 1 2 3 4 5 →