INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION

被引：2

作者：

Parelli, Maria ^{[1
,2
]}

Mallis, Dimitrios ^{[1
]}

Diomataris, Markos ^{[1
,2
]}

Pitsikalis, Vassilis ^{[1
]}

机构：

[1] DeepLab, Athens, Greece

[2] Swiss Fed Inst Technol, Zurich, Switzerland

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年

关键词：

Visual Question Answering; Visual Grounding; Interpretability; Attention Similarity;

D O I：

10.1109/ICIP49359.2023.10223156

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-based architectures have recently demonstrated remarkable performance in the Visual Question Answering (VQA) task. However, such models are likely to disregard crucial visual cues and often rely on multimodal shortcuts and inherent biases of the language modality to predict the correct answer, a phenomenon commonly referred to as lack of visual grounding. In this work, we alleviate this shortcoming through a novel architecture for visual question answering that leverages common sense reasoning as a supervisory signal. Reasoning supervision takes the form of a textual justification of the correct answer, with such annotations being already available on large-scale Visual Common Sense Reasoning (VCR) datasets. The model's visual attention is guided toward important elements of the scene through a similarity loss that aligns the learned attention distributions guided by the question and the correct reasoning. We demonstrate both quantitatively and qualitatively that the proposed approach can boost the model's visual perception capability and lead to performance increase, without requiring training on explicit grounding annotations.

引用

页码：2525 / 2529

页数：5

共 50 条

[21] Coarse-to-Fine Reasoning for Visual Question Answering
Nguyen, Binh X.
Tuong Do
Huy Tran
Tjiputra, Erman
Tran, Quang D.
Anh Nguyen
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4557 - 4565
[22] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
Hussain, Afzaal
Maqsood, Ifrah
Shahzad, Muhammad
Fraz, Muhammad Moazam
2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
[23] Relational reasoning and adaptive fusion for visual question answering
Shen, Xiang
Han, Dezhi
Zong, Liang
Guo, Zihan
Hua, Jie
APPLIED INTELLIGENCE, 2024, 54 (06) : 5062 - 5080
[24] MUREL: Multimodal Relational Reasoning for Visual Question Answering
Cadene, Remi
Ben-younes, Hedi
Cord, Matthieu
Thome, Nicolas
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
[25] Maintaining Reasoning Consistency in Compositional Visual Question Answering
Jing, Chenchen
Jia, Yunde
Wu, Yuwei
Liu, Xinyu
Wu, Qi
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5089 - 5098
[26] A DIAGNOSTIC STUDY OF VISUAL QUESTION ANSWERING WITH ANALOGICAL REASONING
Huang, Ziqi
Zhu, Hongyuan
Sun, Ying
Choi, Dongkyu
Tan, Cheston
Lim, Joo-Hwee
2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2463 - 2467
[27] Visual Question Answering Research on Joint Knowledge and Visual Information Reasoning
Su, Zhenqiang
Gou, Gang
Computer Engineering and Applications, 2024, 60 (05) : 95 - 102
[28] Interpretable Complex Question Answering
Chakrabarti, Soumen
WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2455 - 2457
[29] Visual-Guided Reasoning Path Generation for Visual Question Answering
Liu, Xinyu
Jing, Chenchen
Zhang, Mingliang
Wu, Yuwei
Jia, Yunde
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT 1, 2025, 15031 : 167 - 180
[30] Detection-Based Intermediate Supervision For Visual Question Answering
Liu, Yuhang
Peng, Daowan
Wei, Wei
Fu, Yuanyuan
Xie, Wenfeng
Chen, Dangyang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 12, 2024, : 14061 - 14068

← 1 2 3 4 5 →