Guiding Visual Question Answering with Attention Priors

被引:2
|
作者
Le, Thao Minh [1 ]
Le, Vuong [1 ]
Gupta, Sunil [1 ]
Venkatesh, Svetha [1 ]
Tran, Truyen [1 ]
机构
[1] Deakin Univ, Appl Artificial Intelligence Inst, Geelong, Vic, Australia
关键词
HUMANS;
D O I
10.1109/WACV56688.2023.00436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.
引用
收藏
页码:4370 / 4379
页数:10
相关论文
共 50 条
  • [31] StableNet: Distinguishing the hard samples to overcome language priors in visual question answering
    Yu, Zhengtao
    Zhao, Jia
    Guo, Chenliang
    Yang, Ying
    [J]. IET COMPUTER VISION, 2024, 18 (02) : 315 - 327
  • [32] Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
    Zhi, Xi
    Mao, Zhendong
    Liu, Chunxiao
    Zhang, Peng
    Wang, Bin
    Zhang, Yongdong
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1083 - 1089
  • [33] Hierarchical Question-Image Co-Attention for Visual Question Answering
    Lu, Jiasen
    Yang, Jianwei
    Batra, Dhruv
    Parikh, Devi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [34] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
    Li, Haiyan
    Han, Dezhi
    [J]. COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
  • [35] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73 (73)
  • [36] Focal Visual-Text Attention for Memex Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Kalantidis, Yannis
    Li, Li-Jia
    Hauptmann, Alexander G.
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
  • [37] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    [J]. APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [38] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [39] Erasing-based Attention Learning for Visual Question Answering
    Liu, Fei
    Liu, Jing
    Hong, Richang
    Lu, Hanqing
    [J]. PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 1175 - 1183
  • [40] Stacked Attention based Textbook Visual Question Answering with BERT
    Aishwarya, R.
    Sarath, P.
    Rahman, Shibil P.
    Sneha, U.
    Manmadhan, Sruthy
    [J]. 2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,