Visual Causal Scene Refinement for Video Question Answering

被引:5
|
作者
Wei, Yushen [1 ]
Liu, Yang [1 ]
Yan, Hong [1 ]
Li, Guanbin [1 ]
Lin, Liang [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Video Question Answering; Causal Reasoning; Cross-Modal;
D O I
10.1145/3581783.3611873
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.
引用
收藏
页码:377 / 386
页数:10
相关论文
共 50 条
  • [31] SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering
    Luo, Haonan
    Lin, Guosheng
    Liu, Zichuan
    Liu, Fayao
    Tang, Zhenmin
    Yao, Yazhou
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9666 - 9675
  • [32] Language-aware Visual Semantic Distillation for Video Question Answering
    Zou, Bo
    Yang, Chao
    Qiao, Yu
    Quan, Chengbin
    Zhao, Youjian
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27103 - 27113
  • [33] A Causal Approach to Mitigate Modality Preference Bias in Medical Visual Question Answering
    Ye, Shuchang
    Naseem, Usman
    Meng, Mingyuan
    Feng, Dagan
    Kim, Jinman
    PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON VISION-LANGUAGE MODELS FOR BIOMEDICAL APPLICATIONS, VLM4BIO 2024, 2024, : 13 - 17
  • [34] Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question Answering
    Cai, Linqin
    Fang, Haodu
    Xu, Nuoying
    Ren, Bo
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (12) : 4430 - 4441
  • [35] Depth and Video Segmentation Based Visual Attention for Embodied Question Answering
    Luo, Haonan
    Lin, Guosheng
    Yao, Yazhou
    Liu, Fayao
    Liu, Zichuan
    Tang, Zhenmin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 6807 - 6819
  • [36] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
    Wu, Jinmeng
    Ge, Fulin
    Hong, Hanyu
    Shi, Yu
    Hao, Yanbin
    Ma, Lei
    PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
  • [37] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [38] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [39] Visual explainable artificial intelligence for graph-based visual question answering and scene graph curation
    Sebastian Künzel
    Tanja Munz-Körner
    Pascal Tilli
    Noel Schäfer
    Sandeep Vidyapu
    Ngoc Thang Vu
    Daniel Weiskopf
    Visual Computing for Industry, Biomedicine, and Art, 8 (1)
  • [40] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +