Visual Causal Scene Refinement for Video Question Answering

被引：5

作者：

Wei, Yushen ^{[1
]}

Liu, Yang ^{[1
]}

Yan, Hong ^{[1
]}

Li, Guanbin ^{[1
]}

Lin, Liang ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Guangzhou, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

国家重点研发计划; 中国国家自然科学基金;

关键词：

Video Question Answering; Causal Reasoning; Cross-Modal;

D O I：

10.1145/3581783.3611873

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing methods for video question answering (VideoQA) often suffer from spurious correlations between different modalities, leading to a failure in identifying the dominant visual evidence and the intended question. Moreover, these methods function as black boxes, making it difficult to interpret the visual scene during the QA process. In this paper, to discover critical video segments and frames that serve as the visual causal scene for generating reliable answers, we present a causal analysis of VideoQA and propose a framework for cross-modal causal relational reasoning, named Visual Causal Scene Refinement (VCSR). Particularly, a set of causal front-door intervention operations is introduced to explicitly find the visual causal scenes at both segment and frame levels. Our VCSR involves two essential modules: i) the Question-Guided Refiner (QGR) module, which refines consecutive video frames guided by the question semantics to obtain more representative segment features for causal front-door intervention; ii) the Causal Scene Separator (CSS) module, which discovers a collection of visual causal and non-causal scenes based on the visual-linguistic causal relevance and estimates the causal effect of the scene-separating intervention in a contrastive learning manner. Extensive experiments on the NExT-QA, Causal-VidQA, and MSRVTT-QA datasets demonstrate the superiority of our VCSR in discovering visual causal scene and achieving robust video question answering. The code is available at https://github.com/YangLiu9208/VCSR.

引用

页码：377 / 386

页数：10

共 50 条

[1] Scene Graph Refinement Network for Visual Question Answering
Qian, Tianwen
Chen, Jingjing
Chen, Shaoxiang
Wu, Bo
Jiang, Yu-Gang
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961
[2] Scene Text Visual Question Answering
Biten, Ali Furkan
Tito, Ruben
Mafla, Andres
Gomez, Lluis
Rusinol, Marcal
Valveny, Ernest
Jawahar, C. V.
Karatzas, Dimosthenis
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
[3] Multichannel Attention Refinement for Video Question Answering
Zhuang, Yueting
Xu, Dejing
Yan, Xin
Cheng, Wenzhuo
Zhao, Zhou
Pu, Shiliang
Xiao, Jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
[4] Removing Bias of Video Question Answering by Causal Theory
Huang, Yue
Gu, Xiaodong
2024 IEEE 19TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, ICIEA 2024, 2024,
[5] A Multilingual Approach to Scene Text Visual Question Answering
Brugues i Pujolras, Josep
Gomez i Bigorda, Llufs
Karatzas, Dimosthenis
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 65 - 79
[6] Lightweight Visual Question Answering using Scene Graphs
Nuthalapati, Sai Vidyaranya
Chandradevan, Ramraj
Giunchiglia, Eleonora
Li, Bowen
Kayser, Maxime
Lukasiewicz, Thomas
Yang, Carl
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3353 - 3357
[7] Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
Mao, Jianguo
Jiang, Wenbin
Wang, Xiangdong
Feng, Zhifan
Lyu, Yajuan
Liu, Hong
Zhu, Yong
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3894 - 3904
[8] Towards Reasoning Ability in Scene Text Visual Question Answering
Wang, Qingqing
Xiao, Liqiang
Lu, Yue
Jin, Yaohui
He, Hao
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2281 - 2289
[9] Scene text visual question answering by using YOLO and STN
Nourali K.
Dolkhani E.
International Journal of Speech Technology, 2024, 27 (01) : 69 - 76
[10] Scene Understanding for Autonomous Driving Using Visual Question Answering
Wantiez, Adrien
Qiu, Tianming
Matthes, Stefan
Shen, Hao
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,

← 1 2 3 4 5 →