Video Graph Transformer for Video Question Answering

被引：23

作者：

Xiao, Junbin ^{[1
,2
,3
]}

Zhou, Pan ^{[1
]}

Chua, Tat-Seng ^{[2
,3
]}

Yan, Shuicheng ^{[1
]}

机构：

[1] Sea AI Lab, Singapore, Singapore

[2] Sea NExT Joint Lab, Singapore, Singapore

[3] Natl Univ Singapore, Dept Comp Sci, Singapore, Singapore

来源：

COMPUTER VISION, ECCV 2022, PT XXXVI | 2022年 / 13696卷

关键词：

Dynamic visual graph; Transformer; VideoQA;

D O I：

10.1007/978-3-031-20059-5_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper proposes a Video Graph Transformer (VGT) model for Video Question Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.

引用

页码：39 / 58

页数：20

共 50 条

[1] Contrastive Video Question Answering via Video Graph Transformer
Xiao, Junbin
Zhou, Pan
Yao, Angela
Li, Yicong
Hong, Richang
Yan, Shuicheng
Chua, Tat-Seng
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
[2] Video -Context Aligned Transformer for Video Question Answering
Zong, Linlin
Wan, Jiahui
Zhang, Xianchao
Liu, Xinyue
Liang, Wenxin
Xu, Bo
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19795 - 19803
[3] Redundancy-aware Transformer for Video Question Answering
Li, Yicong
Yang, Xun
Zhang, An
Feng, Chun
Wang, Xiang
Chua, Tat-Seng
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3172 - 3180
[4] Multimodal Graph Reasoning and Fusion for Video Question Answering
Zhang, Shuai
Wang, Xingfu
Hawbani, Ammar
Zhao, Liang
Alsamhi, Saeed Hamood
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
[5] Progressive Graph Attention Network for Video Question Answering
Peng, Liang
Yang, Shuangji
Bin, Yi
Wang, Guoqing
[J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2871 - 2879
[6] Reasoning with Heterogeneous Graph Alignment for Video Question Answering
Jiang, Pin
Han, Yahong
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11109 - 11116
[7] Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
Mao, Jianguo
Jiang, Wenbin
Wang, Xiangdong
Feng, Zhifan
Lyu, Yajuan
Liu, Hong
Zhu, Yong
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3894 - 3904
[8] Video Reference: A Video Question Answering Engine
Gao, Lei
Li, Guangda
Zheng, Yan-Tao
Hong, Richang
Chua, Tat-Seng
[J]. ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
[9] Action-Centric Relation Transformer Network for Video Question Answering
Zhang, Jipeng
Shao, Jie
Cao, Rui
Gao, Lianli
Xu, Xing
Shen, Heng Tao
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 63 - 74
[10] Affective question answering on video
Ruwa, Nelson
Mao, Qirong
Wang, Liangjun
Gou, Jianping
[J]. NEUROCOMPUTING, 2019, 363 : 125 - 139

← 1 2 3 4 5 →