Visual Causal Scene Refinement for Video Question Answering

被引：5

作者：

Wei, Yushen ^{[1
]}

Liu, Yang ^{[1
]}

Yan, Hong ^{[1
]}

Li, Guanbin ^{[1
]}

Lin, Liang ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Guangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Video Question Answering; Causal Reasoning; Cross-Modal;

D O I：

10.1145/3581783.3611873

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.

引用

页码：377 / 386

页数：10

共 50 条

[21] COCA: COllaborative CAusal Regularization for Audio-Visual Question Answering
Lao, Mingrui
Pu, Nan
Liu, Yu
He, Kai
Bakker, Erwin M.
Lew, Michael S.
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12995 - 13003
[22] Visual Question Answering
Nada, Ahmed
Chen, Min
2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
[23] Transductive Cross-Lingual Scene-Text Visual Question Answering
Li, Lin
Zhang, Haohan
Fang, Zeqin
Xie, Zhongwei
Liu, Jianquan
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 452 - 467
[24] Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
Koner, Rajat
Li, Hang
Hildebrandt, Marcel
Das, Deepan
Tresp, Volker
Guennemann, Stephan
SEMANTIC WEB - ISWC 2021, 2021, 12922 : 111 - 127
[25] Knowledge enhancement and scene understanding for knowledge-based visual question answering
Su, Zhenqiang
Gou, Gang
KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 2193 - 2208
[26] Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation
Yan, Xu
Yuan, Zhihao
Du, Yuhao
Liao, Yinghong
Guo, Yao
Cui, Shuguang
Li, Zhen
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (12) : 7473 - 7485
[27] Knowledge enhancement and scene understanding for knowledge-based visual question answering
Zhenqiang Su
Gang Gou
Knowledge and Information Systems, 2024, 66 : 2193 - 2208
[28] Multimodal grid features and cell pointers for scene text visual question answering
Gomez, Lluis
Biten, Ali Furkan
Tito, Ruben
Mafla, Andres
Rusinol, Marcal
Valveny, Ernest
Karatzas, Dimosthenis
PATTERN RECOGNITION LETTERS, 2021, 150 : 242 - 249
[29] Question Modifiers in Visual Question Answering
Britton, William
Sarkhel, Somdeb
Venugopal, Deepak
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
[30] Language-Guided Visual Aggregation Network for Video Question Answering
Liang, Xiao
Wang, Di
Wang, Quan
Wan, Bo
An, Lingling
He, Lihuo
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5195 - 5203

← 1 2 3 4 5 →