Visual Causal Scene Refinement for Video Question Answering

被引:5
|
作者
Wei, Yushen [1 ]
Liu, Yang [1 ]
Yan, Hong [1 ]
Li, Guanbin [1 ]
Lin, Liang [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Video Question Answering; Causal Reasoning; Cross-Modal;
D O I
10.1145/3581783.3611873
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.
引用
收藏
页码:377 / 386
页数:10
相关论文
共 50 条
  • [11] DynGraph: Visual Question Answering via Dynamic Scene Graphs
    Haurilet, Monica
    Al-Halah, Ziad
    Stiefelhagen, Rainer
    PATTERN RECOGNITION, DAGM GCPR 2019, 2019, 11824 : 428 - 441
  • [12] Variational Causal Inference Network for Explanatory Visual Question Answering
    Xue, Dizhan
    Qian, Shengsheng
    Xu, Changsheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2515 - 2525
  • [13] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
    Ko, Dohwan
    Lee, Ji Soo
    Kang, Wooyoung
    Roh, Byungseok
    Kim, Hyunwoo J.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316
  • [14] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
    Zang, Chuanqi
    Wang, Hanqing
    Pei, Mingtao
    Liang, Wei
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
  • [15] Coarse-to-Fine Visual Question Answering by Iterative, Conditional Refinement
    Burghouts, Gertjan J.
    Huizinga, Wyke
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2022, PT II, 2022, 13232 : 418 - 428
  • [16] Improving visual question answering by combining scene-text information
    Sharma, Himanshu
    Jalal, Anand Singh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (09) : 12177 - 12208
  • [17] An Empirical Study of Multilingual Scene-Text Visual Question Answering
    Li, Lin
    Zhang, Haohan
    Fang, Zeqing
    PROCEEDINGS OF THE 2ND WORKSHOP ON USER-CENTRIC NARRATIVE SUMMARIZATION OF LONG VIDEOS, NARSUM 2023, 2023, : 3 - 8
  • [18] Improving visual question answering by combining scene-text information
    Himanshu Sharma
    Anand Singh Jalal
    Multimedia Tools and Applications, 2022, 81 : 12177 - 12208
  • [19] Towards Video Text Visual Question Answering: Benchmark and Baseline
    Zhao, Minyi
    Li, Bingjia
    Wang, Jie
    Li, Wanqing
    Zhou, Wenjing
    Zhang, Lan
    Xuyang, Shijie
    Yu, Zhihang
    Yu, Xinkun
    Li, Guangze
    Dai, Aobotao
    Zhou, Shuigeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [20] Learning to enhance areal video captioning with visual question answering
    Al Mehmadi, Shima M.
    Bazi, Yakoub
    Al Rahhal, Mohamad M.
    Zuair, Mansour
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2024, 45 (18) : 6395 - 6407