Video Graph Transformer for Video Question Answering

被引:23
|
作者
Xiao, Junbin [1 ,2 ,3 ]
Zhou, Pan [1 ]
Chua, Tat-Seng [2 ,3 ]
Yan, Shuicheng [1 ]
机构
[1] Sea AI Lab, Singapore, Singapore
[2] Sea NExT Joint Lab, Singapore, Singapore
[3] Natl Univ Singapore, Dept Comp Sci, Singapore, Singapore
来源
关键词
Dynamic visual graph; Transformer; VideoQA;
D O I
10.1007/978-3-031-20059-5_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes a Video Graph Transformer (VGT) model for Video Question Answering (VideoQA). VGT's uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.
引用
收藏
页码:39 / 58
页数:20
相关论文
共 50 条
  • [1] Contrastive Video Question Answering via Video Graph Transformer
    Xiao, Junbin
    Zhou, Pan
    Yao, Angela
    Li, Yicong
    Hong, Richang
    Yan, Shuicheng
    Chua, Tat-Seng
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13265 - 13280
  • [2] Video -Context Aligned Transformer for Video Question Answering
    Zong, Linlin
    Wan, Jiahui
    Zhang, Xianchao
    Liu, Xinyue
    Liang, Wenxin
    Xu, Bo
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19795 - 19803
  • [3] Redundancy-aware Transformer for Video Question Answering
    Li, Yicong
    Yang, Xun
    Zhang, An
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3172 - 3180
  • [4] Multimodal Graph Reasoning and Fusion for Video Question Answering
    Zhang, Shuai
    Wang, Xingfu
    Hawbani, Ammar
    Zhao, Liang
    Alsamhi, Saeed Hamood
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
  • [5] Progressive Graph Attention Network for Video Question Answering
    Peng, Liang
    Yang, Shuangji
    Bin, Yi
    Wang, Guoqing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2871 - 2879
  • [6] Reasoning with Heterogeneous Graph Alignment for Video Question Answering
    Jiang, Pin
    Han, Yahong
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11109 - 11116
  • [7] Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
    Mao, Jianguo
    Jiang, Wenbin
    Wang, Xiangdong
    Feng, Zhifan
    Lyu, Yajuan
    Liu, Hong
    Zhu, Yong
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3894 - 3904
  • [8] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    [J]. ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [9] Action-Centric Relation Transformer Network for Video Question Answering
    Zhang, Jipeng
    Shao, Jie
    Cao, Rui
    Gao, Lianli
    Xu, Xing
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 63 - 74
  • [10] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    [J]. NEUROCOMPUTING, 2019, 363 : 125 - 139