VCD: Visual Causality Discovery for Cross-Modal Question Reasoning

被引：0

作者：

Liu, Yang ^{[1
]}

Tan, Ying ^{[1
]}

Luo, Jingzhou ^{[1
]}

Chen, Weixing ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII | 2024年 / 14431卷

关键词：

Visual Question Answering; Visual-linguistic; Causal Inference;

D O I：

10.1007/978-981-99-8540-1_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing visual question reasoning methods usually fail to explicitly discover the inherent causal mechanism and ignore jointly modeling cross-modal event temporality and causality. In this paper, we propose a visual question reasoning framework named Cross-Modal Question Reasoning (CMQR), to discover temporal causal structure and mitigate visual spurious correlation by causal intervention. To explicitly discover visual causal structure, the Visual Causality Discovery (VCD) architecture is proposed to find question-critical scene temporally and disentangle the visual spurious correlations by attention-based front-door causal intervention module named Local-Global Causal Attention Module (LGCAM). To align the fine-grained interactions between linguistic semantics and spatial-temporal representations, we build an Interactive Visual-Linguistic Transformer (IVLT) that builds the multi-modal co-occurrence interactions between visual and linguistic content. Extensive experiments on four datasets demonstrate the superiority of CMQR for discovering visual causal structures and achieving robust question reasoning. The supplementary file can be referred to https://github.com/YangLiu9208/VCD/blob/main/0793_supp.pdf.

引用

页码：309 / 322

页数：14

共 50 条

[1] Cross-modal Relational Reasoning Network for Visual Question Answering
Chen, Hongyu
Liu, Ruifang
Peng, Bo
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3939 - 3948
[2] Reasoning on the Relation: Enhancing Visual Representation for Visual Question Answering and Cross-Modal Retrieval
Yu, Jing
Zhang, Weifeng
Lu, Yuhang
Qin, Zengchang
Hu, Yue
Tan, Jianlong
Wu, Qi
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) : 3196 - 3209
[3] Cross-modal knowledge reasoning for knowledge-based visual question answering
Yu, Jing
Zhu, Zihao
Wang, Yujing
Zhang, Weifeng
Hu, Yue
Tan, Jianlong
[J]. PATTERN RECOGNITION, 2020, 108
[4] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering
Liu, Yang
Li, Guanbin
Lin, Liang
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (10) : 11624 - 11641
[5] Causality and Cross-Modal Integration
Schutz, Michael
Kubovy, Michael
[J]. JOURNAL OF EXPERIMENTAL PSYCHOLOGY-HUMAN PERCEPTION AND PERFORMANCE, 2009, 35 (06) : 1791 - 1810
[6] Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning
Zhang, Xi
Zhang, Feifei
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2986 - 2997
[7] Cross-Modal Visual Question Answering for Remote Sensing Data
Felix, Rafael
Repasky, Boris
Hodge, Samuel
Zolfaghari, Reza
Abbasnejad, Ehsan
Sherrah, Jamie
[J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 57 - 65
[8] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
Zhu, Zihao
Yu, Jing
Wang, Yujing
Sun, Yajing
Hu, Yue
Wu, Qi
[J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103
[9] Cross-modal attention guided visual reasoning for referring image segmentation
Zhang, Wenjing
Hu, Mengnan
Tan, Quange
Zhou, Qianli
Wang, Rong
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 28853 - 28872
[10] Cross-modal attention guided visual reasoning for referring image segmentation
Wenjing Zhang
Mengnan Hu
Quange Tan
Qianli Zhou
Rong Wang
[J]. Multimedia Tools and Applications, 2023, 82 : 28853 - 28872

← 1 2 3 4 5 →