INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION

被引：2

作者：

Parelli, Maria ^{[1
,2
]}

Mallis, Dimitrios ^{[1
]}

Diomataris, Markos ^{[1
,2
]}

Pitsikalis, Vassilis ^{[1
]}

机构：

[1] DeepLab, Athens, Greece

[2] Swiss Fed Inst Technol, Zurich, Switzerland

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年

关键词：

Visual Question Answering; Visual Grounding; Interpretability; Attention Similarity;

D O I：

10.1109/ICIP49359.2023.10223156

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.

引用

页码：2525 / 2529

页数：5

共 50 条

[31] Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts
Lan, Yunshi
Li, Xiang
Liu, Xin
Li, Yang
Qin, Wei
Qian, Weining
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4389 - 4400
[32] Vision-Language Transformer for Interpretable Pathology Visual Question Answering
Naseem, Usman
Khushi, Matloob
Kim, Jinman
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (04) : 1681 - 1690
[33] Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
Vedantam, Ramakrishna
Desai, Karan
Lee, Stefan
Rohrbach, Marcus
Batra, Dhruv
Parikh, Devi
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
[34] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
Hu, Xinyue
Gu, Lin
Kobayashi, Kazuma
Liu, Liangchen
Zhang, Mengliang
Harada, Tatsuya
Summers, Ronald M.
Zhu, Yingying
MEDICAL IMAGE ANALYSIS, 2024, 97
[35] Self-Critical Reasoning for Robust Visual Question Answering
Wu, Jialin
Mooney, Raymond J.
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[36] LoRA: A Logical Reasoning Augmented Dataset for Visual Question Answering
Gao, Jingying
Wu, Qi
Blair, Alan
Pagnucco, Maurice
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[37] Explicit Knowledge-based Reasoning for Visual Question Answering
Wang, Peng
Wu, Qi
Shen, Chunhua
Dick, Anthony
van den Hengel, Anton
PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296
[38] Towards Reasoning Ability in Scene Text Visual Question Answering
Wang, Qingqing
Xiao, Liqiang
Lu, Yue
Jin, Yaohui
He, Hao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
[39] An effective spatial relational reasoning networks for visual question answering
Shen, Xiang
Han, Dezhi
Chen, Chongqing
Luo, Gaofeng
Wu, Zhongdai
PLOS ONE, 2022, 17 (11):
[40] A Symbolic-Neural Reasoning Model for Visual Question Answering
Gao, Jingying
Blair, Alan
Pagnucco, Maurice
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,

← 1 2 3 4 5 →