Visual Causal Scene Refinement for Video Question Answering

被引:5
|
作者
Wei, Yushen [1 ]
Liu, Yang [1 ]
Yan, Hong [1 ]
Li, Guanbin [1 ]
Lin, Liang [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Video Question Answering; Causal Reasoning; Cross-Modal;
D O I
10.1145/3581783.3611873
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.
引用
收藏
页码:377 / 386
页数:10
相关论文
共 50 条
  • [1] Scene Graph Refinement Network for Visual Question Answering
    Qian, Tianwen
    Chen, Jingjing
    Chen, Shaoxiang
    Wu, Bo
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961
  • [2] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [3] Multichannel Attention Refinement for Video Question Answering
    Zhuang, Yueting
    Xu, Dejing
    Yan, Xin
    Cheng, Wenzhuo
    Zhao, Zhou
    Pu, Shiliang
    Xiao, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [4] Removing Bias of Video Question Answering by Causal Theory
    Huang, Yue
    Gu, Xiaodong
    2024 IEEE 19TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, ICIEA 2024, 2024,
  • [5] A Multilingual Approach to Scene Text Visual Question Answering
    Brugues i Pujolras, Josep
    Gomez i Bigorda, Llufs
    Karatzas, Dimosthenis
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
  • [6] Lightweight Visual Question Answering using Scene Graphs
    Nuthalapati, Sai Vidyaranya
    Chandradevan, Ramraj
    Giunchiglia, Eleonora
    Li, Bowen
    Kayser, Maxime
    Lukasiewicz, Thomas
    Yang, Carl
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3353 - 3357
  • [7] Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
    Mao, Jianguo
    Jiang, Wenbin
    Wang, Xiangdong
    Feng, Zhifan
    Lyu, Yajuan
    Liu, Hong
    Zhu, Yong
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3894 - 3904
  • [8] Towards Reasoning Ability in Scene Text Visual Question Answering
    Wang, Qingqing
    Xiao, Liqiang
    Lu, Yue
    Jin, Yaohui
    He, Hao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
  • [9] Scene text visual question answering by using YOLO and STN
    Nourali K.
    Dolkhani E.
    International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
  • [10] Scene Understanding for Autonomous Driving Using Visual Question Answering
    Wantiez, Adrien
    Qiu, Tianming
    Matthes, Stefan
    Shen, Hao
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,