From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

被引:0
|
作者
Song, Jingkuan
Zeng, Pengpeng
Gao, Lianli [1 ]
Shen, Heng Tao [1 ]
机构
[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu 611731, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problem. In this paper we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
引用
收藏
页码:906 / 912
页数:7
相关论文
共 50 条
  • [31] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
  • [32] Stacked Attention based Textbook Visual Question Answering with BERT
    Aishwarya, R.
    Sarath, P.
    Rahman, Shibil P.
    Sneha, U.
    Manmadhan, Sruthy
    2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
  • [33] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [34] Multimodal attention-driven visual question answering for Malayalam
    Kovath A.G.
    Nayyar A.
    Sikha O.K.
    Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
  • [35] Multi-stage Attention based Visual Question Answering
    Mishra, Aakansha
    Anand, Ashish
    Guha, Prithwijit
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9407 - 9414
  • [36] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [37] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [38] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [39] Erasing-based Attention Learning for Visual Question Answering
    Liu, Fei
    Liu, Jing
    Hong, Richang
    Lu, Hanqing
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1175 - 1183
  • [40] Word-to-region attention network for visual question answering
    Liang Peng
    Yang Yang
    Yi Bin
    Ning Xie
    Fumin Shen
    Yanli Ji
    Xing Xu
    Multimedia Tools and Applications, 2019, 78 : 3843 - 3858