VCD: Visual Causality Discovery for Cross-Modal Question Reasoning

被引:0
|
作者
Liu, Yang [1 ]
Tan, Ying [1 ]
Luo, Jingzhou [1 ]
Chen, Weixing [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
关键词
Visual Question Answering; Visual-linguistic; Causal Inference;
D O I
10.1007/978-981-99-8540-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing visual question reasoning methods usually fail to explicitly discover the inherent causal mechanism and ignore jointly modeling cross-modal event temporality and causality. In this paper, we propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover temporal causal structure and mitigate visual spurious correlation by causal intervention. To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally and disentangle the visual spurious correlations by attention-based front-door causal intervention module named Local-Global Causal Attention Module (LGCAM). To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. Extensive experiments on four datasets demonstrate the superiority of CMQR for discovering visual causal structures and achieving robust question reasoning. The supplementary file can be referred to https://github.com/YangLiu9208/VCD/blob/main/0793_supp.pdf.
引用
下载
收藏
页码:309 / 322
页数:14
相关论文
共 50 条
  • [21] CROSS-MODAL CONGRUITY - VISUAL AND OLFACTORY
    HENION, KE
    JOURNAL OF SOCIAL PSYCHOLOGY, 1970, 81 (01): : 15 - &
  • [22] Visual Contextual Semantic Reasoning for Cross-Modal Drone Image-Text Retrieval
    Huang, Jinghao
    Chen, Yaxiong
    Xiong, Shengwu
    Lu, Xiaoqiang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [23] Cross-Modal Feature Distribution Calibration for Few-Shot Visual Question Answering
    Zhang, Jing
    Liu, Xiaoqiang
    Chen, Mingzhe
    Wang, Zhe
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 7, 2024, : 7151 - 7159
  • [24] Cross-Modal Multistep Fusion Network With Co-Attention for Visual Question Answering
    Lao, Mingrui
    Guo, Yanming
    Wang, Hui
    Zhang, Xin
    IEEE ACCESS, 2018, 6 : 31516 - 31524
  • [25] Contrastive Cross-Modal Representation Learning Based Active Learning for Visual Question Answer
    Zhang B.-C.
    Li L.
    Zha Z.-J.
    Huang Q.-M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (08): : 1730 - 1745
  • [26] Cross-modal interactions in auditory and visual discrimination
    Marks, LE
    Ben-Artzi, E
    Lakatos, S
    INTERNATIONAL JOURNAL OF PSYCHOPHYSIOLOGY, 2003, 50 (1-2) : 125 - 145
  • [27] Auditory, visual, and cross-modal negative priming
    Axel Buchner
    Anouk Zabal
    Susanne Mayr
    Psychonomic Bulletin & Review, 2003, 10 : 917 - 923
  • [28] A Survey of Cross-Modal Visual Content Generation
    Nazarieh F.
    Feng Z.
    Awais M.
    Wang W.
    Kittler J.
    IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (08) : 1 - 1
  • [29] Learning Visual Locomotion with Cross-Modal Supervision
    Loquercio, Antonio
    Kumar, Ashish
    Malik, Jitendra
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 7295 - 7302
  • [30] Auditory, visual, and cross-modal negative priming
    Buchner, A
    Zabal, A
    Mayr, S
    PSYCHONOMIC BULLETIN & REVIEW, 2003, 10 (04) : 917 - 923