Discovering Spatio-Temporal Rationales for Video Question Answering

被引：0

作者：

Li, Yicong ^{[1
]}

Xiao, Junbin ^{[1
]}

Feng, Chun ^{[2
]}

Wang, Xiang ^{[2
]}

Chua, Tat-Seng ^{[1
]}

机构：

[1] Natl Univ Singapore, Singapore, Singapore

[2] Univ Sci & Technol China, Hefei, Peoples R China

来源：

2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023) | 2023年

关键词：

D O I：

10.1109/ICCV51070.2023.01275

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalization (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8% and 6.8%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Code will be released at https://github.com/yl3800/TranSTR.

引用

页码：13823 / 13832

页数：10

共 50 条

[1] Video Question Answering with Spatio-Temporal Reasoning
Jang, Yunseok
Song, Yale
Kim, Chris Dongjoo
Yu, Youngjae
Kim, Youngjin
Kim, Gunhee
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
[2] Video Question Answering with Spatio-Temporal Reasoning
Yunseok Jang
Yale Song
Chris Dongjoo Kim
Youngjae Yu
Youngjin Kim
Gunhee Kim
[J]. International Journal of Computer Vision, 2019, 127 : 1385 - 1412
[3] Spatio-Temporal Context Networks for Video Question Answering
Gao, Kun
Han, Yahong
[J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
[4] Dynamic Spatio-Temporal Modular Network for Video Question Answering
Qian, Zi
Wang, Xin
Duan, Xuguang
Chen, Hong
Zhu, Wenwu
[J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
[5] Spatio-Temporal Two-stage Fusion for video question answering
Xu, Feifei
Zhu, Yitao
Wang, Chun
Cao, Yangze
Zhong, Zheng
Li, Xiongmin
[J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
[6] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
Zhao, Zhou
Yang, Qifan
Cai, Deng
He, Xiaofei
Zhuang, Yueting
[J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
[7] Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
Dang, Long Hoang
Le, Thao Minh
Le, Vuong
Tran, Truyen
[J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 636 - 642
[8] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
Jiang, Jianwen
Chen, Ziqiang
Lin, Haojie
Zhao, Xibin
Gao, Yue
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108
[9] Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
Cheng, Yi
Fan, Hehe
Lin, Dongyun
Sun, Ying
Kankanhalli, Mohan
Lim, Joo-Hwee
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6131 - 6141
[10] (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
Cherian, Anoop
Hori, Chiori
Marks, Tim K.
Le Roux, Jonathan
[J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 444 - 453

← 1 2 3 4 5 →