Spatio-Temporal Graph Convolution Transformer for Video Question Answering

被引:0
|
作者
Tang, Jiahao [1 ]
Hu, Jianguo [1 ,2 ]
Huang, Wenjun [1 ]
Shen, Shengzhi [1 ]
Pan, Jiakai [1 ]
Wang, Deming [3 ]
Ding, Yanyu [4 ]
机构
[1] Sun Yat Sen Univ, Sch Microelect Sci & Technol, Zhuhai 519082, Peoples R China
[2] Sun Yat Sen Univ, Shenzhen Res Inst, Shenzhen 510275, Peoples R China
[3] South China Normal Univ, Sch Elect & Informat Engn, Foshan 528225, Peoples R China
[4] Dongguan Univ Technol, Dongguan 523820, Peoples R China
来源
IEEE ACCESS | 2024年 / 12卷
关键词
Visualization; Transformers; Feature extraction; Convolution; Computational modeling; Question answering (information retrieval); Data models; Video question answering (VideoQA); video reasoning and description; spatial-temporal graph; dynamic graph Transformer; graph attention; computer vision natural language processing;
D O I
10.1109/ACCESS.2024.3445636
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Currently, video question answering (VideoQA) algorithms relying on video-text pretraining models employ intricate unimodal encoders and multimodal fusion Transformers, which often lead to decreased efficiency in tasks such as visual reasoning. Conversely, video question answering algorithms based on graph neural networks often exhibit suboptimal performance in video description and reasoning, attributed to their simplistic graph construction and cross-modal interaction designs, necessitating additional pretraining data to mitigate these performance disparities. In this work, we introduce the Spatio-temporal Graph Convolution Transformer (STCT) model for VideoQA. By leveraging Spatio-temporal Graph Convolution (STGC) and dynamic graph Transformers, our model explicitly captures the spatio-temporal relationships among visual objects, thereby facilitating dynamic interactions and enhancing visual reasoning capabilities. Moreover, our model introduces a novel cross-modal interaction approach utilizing dynamic graph attention mechanisms to adjust the attention weights of visual objects based on the posed question, thereby augmenting multimodal cooperative perception. By addressing the limitations of graph-based algorithms dependent on pretraining for performance enhancement through meticulously designed graph structures and cross-modal interaction mechanisms, our model achieves superior performance in visual description and reasoning tasks with simpler unimodal encoders and multimodal fusion modules. Comprehensive analyses and comparisons of the model's performance across multiple datasets, including NExT-QA, MSVD-QA, and MSRVTT-QA datasets, have confirmed its robust capabilities in video reasoning and description.
引用
收藏
页码:131664 / 131680
页数:17
相关论文
共 50 条
  • [41] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
  • [42] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2023, : 189 - 200
  • [43] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    arXiv, 2023,
  • [44] TwiRGCN: Temporally Weighted Graph Convolution for Question Answering over Temporal Knowledge Graphs
    Sharma, Aditya
    Saxena, Apoorv
    Gupta, Chitrank
    Kazemi, Mehran
    Talukdar, Partha
    Chakrabarti, Soumen
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2049 - 2060
  • [45] STGT: Forecasting Pedestrian Motion Using Spatio-Temporal Graph Transformer
    Syed, Arsal
    Morris, Brendan
    2021 32ND IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2021, : 1553 - 1558
  • [46] Cross-scale hierarchical spatio-temporal transformer for video enhancement
    Jiang, Qin
    Wang, Qinglin
    Chi, Lihua
    Liu, Jie
    KNOWLEDGE-BASED SYSTEMS, 2025, 309
  • [47] Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring
    Zhang, Huicong
    Xie, Haozhe
    Yao, Hongxun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 2673 - 2681
  • [48] Efficient spatio-temporal point convolution
    Maxim, Bogdan
    Nedevschi, Sergiu
    2020 IEEE 16TH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTER COMMUNICATION AND PROCESSING (ICCP 2020), 2020, : 377 - 382
  • [49] Embedding VLAD in Transformer for Video Question Answering
    Guo D.
    Yao S.-T.
    Wang H.
    Wang M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
  • [50] Point Spatio-Temporal Transformer Networks for Point Cloud Video Modeling
    Fan, Hehe
    Yang, Yi
    Kankanhalli, Mohan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (02) : 2181 - 2192