Variational Causal Inference Network for Explanatory Visual Question Answering

被引:1
|
作者
Xue, Dizhan [1 ,2 ]
Qian, Shengsheng [1 ,2 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.00238
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross- modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state- of-the-art EVQA methods.
引用
收藏
页码:2515 / 2525
页数:11
相关论文
共 50 条
  • [41] Multi-modality Latent Interaction Network for Visual Question Answering
    Gao, Peng
    You, Haoxuan
    Zhang, Zhanpeng
    Wang, Xiaogang
    Li, Hongsheng
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5824 - 5834
  • [42] Language-Guided Visual Aggregation Network for Video Question Answering
    Liang, Xiao
    Wang, Di
    Wang, Quan
    Wan, Bo
    An, Lingling
    He, Lihuo
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5195 - 5203
  • [43] Mutual Attention Inception Network for Remote Sensing Visual Question Answering
    Zheng, Xiangtao
    Wang, Binqiang
    Du, Xingqian
    Lu, Xiaoqiang
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [44] ARDN: Attention Re-distribution Network for Visual Question Answering
    Yi, Jinyang
    Han, Dezhi
    Chen, Chongqing
    Shen, Xiang
    Zong, Liang
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024,
  • [45] Auto-Parsing Network for Image Captioning and Visual Question Answering
    Yang, Xu
    Gao, Chongyang
    Zhang, Hanwang
    Cai, Jianfei
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2177 - 2187
  • [46] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering
    Zhang, Zixiao
    Jiao, Licheng
    Li, Lingling
    Liu, Xu
    Chen, Puhua
    Liu, Fang
    Li, Yuxuan
    Guo, Zhicheng
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2023, 61
  • [47] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [48] More Than An Answer: Neural Pivot Network for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Wu, Yongjian
    Wu, Yunsheng
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 681 - 689
  • [49] Cascade Reasoning Network for Text-based Visual Question Answering
    Liu, Fen
    Xu, Guanghui
    Wu, Qi
    Du, Qing
    Jia, Wei
    Tan, Mingkui
    [J]. MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
  • [50] Cross-modal Relational Reasoning Network for Visual Question Answering
    Chen, Hongyu
    Liu, Ruifang
    Peng, Bo
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948