Variational Causal Inference Network for Explanatory Visual Question Answering

被引：1

作者：

Xue, Dizhan ^{[1
,2
]}

Qian, Shengsheng ^{[1
,2
]}

Xu, Changsheng ^{[1
,2
,3
]}

机构：

[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

D O I：

10.1109/ICCV51070.2023.00238

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross- modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state- of-the-art EVQA methods.

引用

页码：2515 / 2525

页数：11

共 50 条

[41] Multi-modality Latent Interaction Network for Visual Question Answering
Gao, Peng
You, Haoxuan
Zhang, Zhanpeng
Wang, Xiaogang
Li, Hongsheng
[J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5824 - 5834
[42] Language-Guided Visual Aggregation Network for Video Question Answering
Liang, Xiao
Wang, Di
Wang, Quan
Wan, Bo
An, Lingling
He, Lihuo
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5195 - 5203
[43] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
Zheng, Xiangtao
Wang, Binqiang
Du, Xingqian
Lu, Xiaoqiang
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[44] ARDN: Attention Re-distribution Network for Visual Question Answering
Yi, Jinyang
Han, Dezhi
Chen, Chongqing
Shen, Xiang
Zong, Liang
[J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
[45] Auto-Parsing Network for Image Captioning and Visual Question Answering
Yang, Xu
Gao, Chongyang
Zhang, Hanwang
Cai, Jianfei
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2177 - 2187
[46] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
Zhang, Zixiao
Jiao, Licheng
Li, Lingling
Liu, Xu
Chen, Puhua
Liu, Fang
Li, Yuxuan
Guo, Zhicheng
[J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
[47] Co-attention graph convolutional network for visual question answering
Liu, Chuan
Tan, Ying-Ying
Xia, Tian-Tian
Zhang, Jiajing
Zhu, Ming
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
[48] More Than An Answer: Neural Pivot Network for Visual Question Answering
Zhou, Yiyi
Ji, Rongrong
Su, Jinsong
Wu, Yongjian
Wu, Yunsheng
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 681 - 689
[49] Cascade Reasoning Network for Text-based Visual Question Answering
Liu, Fen
Xu, Guanghui
Wu, Qi
Du, Qing
Jia, Wei
Tan, Mingkui
[J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
[50] Cross-modal Relational Reasoning Network for Visual Question Answering
Chen, Hongyu
Liu, Ruifang
Peng, Bo
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948

← 1 2 3 4 5 →