Variational Causal Inference Network for Explanatory Visual Question Answering

被引:1
|
作者
Xue, Dizhan [1 ,2 ]
Qian, Shengsheng [1 ,2 ]
Xu, Changsheng [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci, Inst Automat, State Key Lab Multimodal Artificial Intelligence, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
D O I
10.1109/ICCV51070.2023.00238
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross- modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state- of-the-art EVQA methods.
引用
收藏
页码:2515 / 2525
页数:11
相关论文
共 50 条
  • [1] Affective Visual Question Answering Network
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Dong, Ming
    [J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173
  • [2] Visual Causal Scene Refinement for Video Question Answering
    Wei, Yushen
    Liu, Yang
    Yan, Hong
    Li, Guanbin
    Lin, Liang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 377 - 386
  • [3] Deconfounded Visual Question Generation with Causal Inference
    Chen, Jiali
    Guo, Zhenjun
    Xie, Jiayuan
    Cai, Yi
    Li, Qing
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5132 - 5142
  • [4] An Answer FeedBack Network for Visual Question Answering
    Tian, Weidong
    Tian, Ruihua
    Zhao, Zhongqiu
    Ren, Quan
    [J]. 2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [5] Overcoming Language Priors with Counterfactual Inference for Visual Question Answering
    Ren, Zhibo
    Wang, Huizhen
    Zhu, Muhua
    Wang, Yichao
    Xiao, Tong
    Zhu, Jingbo
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 : 58 - 71
  • [6] VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering
    Bolanos, Marc
    Peris, Alvaro
    Casacuberta, Francisco
    Radeva, Petia
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2017), 2017, 10255 : 372 - 380
  • [7] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    [J]. IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [8] An Inference Mechanism for Question Answering
    Roger, S.
    Ferrandez, A.
    Peral, J.
    Ferrandez, S.
    Lopez-Moreno, P.
    [J]. JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2007, 7 (01): : 21 - 27
  • [9] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [10] ADAPTIVE ATTENTION FUSION NETWORK FOR VISUAL QUESTION ANSWERING
    Gu, Geonmo
    Kim, Seong Tae
    Ro, Yong Man
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2017, : 997 - 1002