Variational Causal Inference Network for Explanatory Visual Question Answering

被引：1

作者：

Xue, Dizhan ^{[1
,2
]}

Qian, Shengsheng ^{[1
,2
]}

Xu, Changsheng ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

D O I：

10.1109/ICCV51070.2023.00238

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross- modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state- of-the-art EVQA methods.

引用

页码：2515 / 2525

页数：11

共 50 条

[1] Affective Visual Question Answering Network
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Dong, Ming
[J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173
[2] Visual Causal Scene Refinement for Video Question Answering
Wei, Yushen
Liu, Yang
Yan, Hong
Li, Guanbin
Lin, Liang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 377 - 386
[3] Deconfounded Visual Question Generation with Causal Inference
Chen, Jiali
Guo, Zhenjun
Xie, Jiayuan
Cai, Yi
Li, Qing
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5132 - 5142
[4] An Answer FeedBack Network for Visual Question Answering
Tian, Weidong
Tian, Ruihua
Zhao, Zhongqiu
Ren, Quan
[J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[5] Overcoming Language Priors with Counterfactual Inference for Visual Question Answering
Ren, Zhibo
Wang, Huizhen
Zhu, Muhua
Wang, Yichao
Xiao, Tong
Zhu, Jingbo
[J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 58 - 71
[6] VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering
Bolanos, Marc
Peris, Alvaro
Casacuberta, Francisco
Radeva, Petia
[J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017), 2017, 10255 : 372 - 380
[7] Co-Attention Network With Question Type for Visual Question Answering
Yang, Chao
Jiang, Mengqi
Jiang, Bin
Zhou, Weixin
Li, Keqin
[J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
[8] An Inference Mechanism for Question Answering
Roger, S.
Ferrandez, A.
Peral, J.
Ferrandez, S.
Lopez-Moreno, P.
[J]. JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2007, 7 (01): : 21 - 27
[9] Triple attention network for sentimental visual question answering
Ruwa, Nelson
Mao, Qirong
Song, Heping
Jia, Hongjie
Dong, Ming
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
[10] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
Gu, Geonmo
Kim, Seong Tae
Ro, Yong Man
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002

← 1 2 3 4 5 →