VCD: Visual Causality Discovery for Cross-Modal Question Reasoning

被引:0
|
作者
Liu, Yang [1 ]
Tan, Ying [1 ]
Luo, Jingzhou [1 ]
Chen, Weixing [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
关键词
Visual Question Answering; Visual-linguistic; Causal Inference;
D O I
10.1007/978-981-99-8540-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing visual question reasoning methods usually fail to explicitly discover the inherent causal mechanism and ignore jointly modeling cross-modal event temporality and causality. In this paper, we propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover temporal causal structure and mitigate visual spurious correlation by causal intervention. To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally and disentangle the visual spurious correlations by attention-based front-door causal intervention module named Local-Global Causal Attention Module (LGCAM). To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. Extensive experiments on four datasets demonstrate the superiority of CMQR for discovering visual causal structures and achieving robust question reasoning. The supplementary file can be referred to https://github.com/YangLiu9208/VCD/blob/main/0793_supp.pdf.
引用
下载
收藏
页码:309 / 322
页数:14
相关论文
共 50 条
  • [31] Cross-modal exogenous visual selective attention
    Zhao, C
    Yang, H
    Zhang, K
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2000, 35 (3-4) : 100 - 100
  • [32] Informative Visual Storytelling with Cross-modal Rules
    Li, Jiacheng
    Shi, Haizhou
    Tang, Siliang
    Wu, Fei
    Zhuang, Yueting
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 2314 - 2322
  • [33] TACTUAL AND VISUAL INTERPOLATION - A CROSS-MODAL COMPARISON
    CHURCHILL, AV
    CANADIAN JOURNAL OF PSYCHOLOGY, 1960, 14 (03): : 183 - 190
  • [34] Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering
    Liu, Gang
    He, Jinlong
    Li, Pengfei
    Zhong, Shenjun
    Li, Hongyang
    He, Genrong
    REMOTE SENSING, 2023, 15 (19)
  • [35] Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering
    Li, Yong
    Yang, Qihao
    Wang, Fu Lee
    Lee, Lap-Kei
    Qu, Yingying
    Hao, Tianyong
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2023, 144
  • [36] Causality-Invariant Interactive Mining for Cross-Modal Similarity Learning
    Yan, Jiexi
    Deng, Cheng
    Huang, Heng
    Liu, Wei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (09) : 6216 - 6230
  • [37] Lightweight recurrent cross-modal encoder for video question answering
    Immanuel, Steve Andreas
    Jeong, Cheol
    KNOWLEDGE-BASED SYSTEMS, 2023, 276
  • [38] Cross-modal integration of natural visual and auditory stimuli
    Tichacek, K
    Onat, S
    König, P
    JOURNAL OF COGNITIVE NEUROSCIENCE, 2005, : 82 - 82
  • [39] The role of visual stimuli in cross-modal Stroop interference
    Lutfi-Proctor, Danielle A.
    Elliott, Emily M.
    Cowan, Nelson
    PSYCH JOURNAL, 2014, 3 (01) : 17 - 29
  • [40] The role of visual experience in the emergence of cross-modal correspondences
    Hamilton-Fletcher, Giles
    Pisanski, Katarzyna
    Reby, David
    Stefaticzyk, Michal
    Ward, Jamie
    Sorokowska, Agnieszka
    COGNITION, 2018, 175 : 114 - 121