Visual Causal Scene Refinement for Video Question Answering

被引:5
|
作者
Wei, Yushen [1 ]
Liu, Yang [1 ]
Yan, Hong [1 ]
Li, Guanbin [1 ]
Lin, Liang [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Video Question Answering; Causal Reasoning; Cross-Modal;
D O I
10.1145/3581783.3611873
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.
引用
收藏
页码:377 / 386
页数:10
相关论文
共 50 条
  • [21] COCA: COllaborative CAusal Regularization for Audio-Visual Question Answering
    Lao, Mingrui
    Pu, Nan
    Liu, Yu
    He, Kai
    Bakker, Erwin M.
    Lew, Michael S.
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 12995 - 13003
  • [22] Visual Question Answering
    Nada, Ahmed
    Chen, Min
    2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
  • [23] Transductive Cross-Lingual Scene-Text Visual Question Answering
    Li, Lin
    Zhang, Haohan
    Fang, Zeqin
    Xie, Zhongwei
    Liu, Jianquan
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT VI, 2024, 14452 : 452 - 467
  • [24] Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
    Koner, Rajat
    Li, Hang
    Hildebrandt, Marcel
    Das, Deepan
    Tresp, Volker
    Guennemann, Stephan
    SEMANTIC WEB - ISWC 2021, 2021, 12922 : 111 - 127
  • [25] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Su, Zhenqiang
    Gou, Gang
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) : 2193 - 2208
  • [26] Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation
    Yan, Xu
    Yuan, Zhihao
    Du, Yuhao
    Liao, Yinghong
    Guo, Yao
    Cui, Shuguang
    Li, Zhen
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (12) : 7473 - 7485
  • [27] Knowledge enhancement and scene understanding for knowledge-based visual question answering
    Zhenqiang Su
    Gang Gou
    Knowledge and Information Systems, 2024, 66 : 2193 - 2208
  • [28] Multimodal grid features and cell pointers for scene text visual question answering
    Gomez, Lluis
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Rusinol, Marcal
    Valveny, Ernest
    Karatzas, Dimosthenis
    PATTERN RECOGNITION LETTERS, 2021, 150 : 242 - 249
  • [29] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
  • [30] Language-Guided Visual Aggregation Network for Video Question Answering
    Liang, Xiao
    Wang, Di
    Wang, Quan
    Wan, Bo
    An, Lingling
    He, Lihuo
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5195 - 5203