INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION

被引:2
|
作者
Parelli, Maria [1 ,2 ]
Mallis, Dimitrios [1 ]
Diomataris, Markos [1 ,2 ]
Pitsikalis, Vassilis [1 ]
机构
[1] DeepLab, Athens, Greece
[2] Swiss Fed Inst Technol, Zurich, Switzerland
关键词
Visual Question Answering; Visual Grounding; Interpretability; Attention Similarity;
D O I
10.1109/ICIP49359.2023.10223156
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.
引用
收藏
页码:2525 / 2529
页数:5
相关论文
共 50 条
  • [31] Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts
    Lan, Yunshi
    Li, Xiang
    Liu, Xin
    Li, Yang
    Qin, Wei
    Qian, Weining
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4389 - 4400
  • [32] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
    Naseem, Usman
    Khushi, Matloob
    Kim, Jinman
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
  • [33] Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
    Vedantam, Ramakrishna
    Desai, Karan
    Lee, Stefan
    Rohrbach, Marcus
    Batra, Dhruv
    Parikh, Devi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [34] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
    Hu, Xinyue
    Gu, Lin
    Kobayashi, Kazuma
    Liu, Liangchen
    Zhang, Mengliang
    Harada, Tatsuya
    Summers, Ronald M.
    Zhu, Yingying
    MEDICAL IMAGE ANALYSIS, 2024, 97
  • [35] Self-Critical Reasoning for Robust Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [36] LoRA: A Logical Reasoning Augmented Dataset for Visual Question Answering
    Gao, Jingying
    Wu, Qi
    Blair, Alan
    Pagnucco, Maurice
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [37] Explicit Knowledge-based Reasoning for Visual Question Answering
    Wang, Peng
    Wu, Qi
    Shen, Chunhua
    Dick, Anthony
    van den Hengel, Anton
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
  • [38] Towards Reasoning Ability in Scene Text Visual Question Answering
    Wang, Qingqing
    Xiao, Liqiang
    Lu, Yue
    Jin, Yaohui
    He, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
  • [39] An effective spatial relational reasoning networks for visual question answering
    Shen, Xiang
    Han, Dezhi
    Chen, Chongqing
    Luo, Gaofeng
    Wu, Zhongdai
    PLOS ONE, 2022, 17 (11):
  • [40] A Symbolic-Neural Reasoning Model for Visual Question Answering
    Gao, Jingying
    Blair, Alan
    Pagnucco, Maurice
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,