Spatio-Temporal Two-stage Fusion for video question answering

被引:1
|
作者
Xu, Feifei [1 ]
Zhu, Yitao [1 ]
Wang, Chun [1 ]
Cao, Yangze [1 ]
Zhong, Zheng [1 ]
Li, Xiongmin [2 ]
机构
[1] Shanghai Univ Elect Power, 1851 Hucheng Ring Rd, Shanghai 201306, Peoples R China
[2] Cognizant Technol Solut US Corp, 211 Qual Circle, College Stn, TX 77845 USA
关键词
Video question answering; Vision transformer; Spatio-temporal two-stage fusion; NETWORK;
D O I
10.1016/j.cviu.2023.103821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (VideoQA) has attracted much interest from of scholars as one of the most representative multimodal tasks in recent years. The task requires the model to interact and reason between the video and the question. Most known approaches use pre-trained networks to extract complex embeddings of videos and questions independently before performing multimodal fusion. However, they overlook two factors: (1) These feature extractors are pre-trained for the image or video classification task without taking the question into consideration, therefore may not be suitable for VideoQA task. (2) Using multiple feature extractors to extract features at different levels introduce more irrelevant information to some extent, thus making the task more difficult. For the above reasons, we propose a new model named Spatio-Temporal Two-Stage Fusion, which ties together multiple levels of feature extraction processes and divides them into two distinct stages: spatial fusion and temporal fusion. Specifically, in the spatial fusion stage, we use Vision Transformer to integrate the intra-frame information to generate frame-level features. At the same time, we design a multimodal temporal fusion module that enables the video to fuse textual information and assign different levels of attention to each frame. Then the obtained frame-level features are used to generate global video features by another Vision Transformer. In order to efficiently generate modal interaction information, we design a video-text symmetric fusion module to retain the most relevant information by mutual guidance between the two modalities. Our method is evaluated on three benchmark datasets: MSVD-QA, MSRVTT-QA and TGIF-QA, and achieves competitive results.
引用
收藏
页数:10
相关论文
共 50 条
  • [31] Video Waterdrop Removal via Spatio-Temporal Fusion in Driving Scenes
    Wen, Qiang
    Wu, Yue
    Chen, Qifeng
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 10003 - 10009
  • [32] Spatio-Temporal Information Fusion Network for Compressed Video Quality Enhancement
    Huang, Weiwei
    Jia, Kebin
    Liu, Pengyu
    Yu, Yuan
    [J]. 2023 DATA COMPRESSION CONFERENCE, DCC, 2023, : 343 - 343
  • [33] Video synthesis with high spatio-temporal resolution using spectral fusion
    Watanabe, Kiyotaka
    Iwai, Yoshio
    Nagahara, Hajime
    Yachida, Masahiko
    Suzuki, Toshiya
    [J]. MULTIMEDIA CONTENT REPRESENTATION, CLASSIFICATION AND SECURITY, 2006, 4105 : 683 - 690
  • [34] Human Action Recognition in Video by Fusion of Structural and Spatio-temporal Features
    Borzeshi, Ehsan Zare
    Concha, Oscar Perez
    Piccardi, Massimo
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, 2012, 7626 : 474 - 482
  • [35] Uncovering the Temporal Context for Video Question Answering
    Linchao Zhu
    Zhongwen Xu
    Yi Yang
    Alexander G. Hauptmann
    [J]. International Journal of Computer Vision, 2017, 124 : 409 - 421
  • [36] Uncovering the Temporal Context for Video Question Answering
    Zhu, Linchao
    Xu, Zhongwen
    Yang, Yi
    Hauptmann, Alexander G.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
  • [37] Multi-Stage Spatio-Temporal Fusion Network for Fast and Accurate Video Bit-Depth Enhancement
    Liu, Jing
    Fan, Zhiwei
    Yang, Ziwen
    Su, Yuting
    Yang, Xiaokang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 2444 - 2455
  • [38] A two-stage approach to estimate spatial and spatio-temporal disease risks in the presence of local discontinuities and clusters
    Adin, A.
    Lee, D.
    Goicoa, T.
    Dolores Ugarte, Maria
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2019, 28 (09) : 2595 - 2613
  • [39] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Xiao, Shaoning
    Li, Yimeng
    Ye, Yunan
    Chen, Long
    Pu, Shiliang
    Zhao, Zhou
    Shao, Jian
    Xiao, Jun
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (02) : 993 - 1003
  • [40] Remember and forget: video and text fusion for video question answering
    Gao, Feng
    Ge, Yuanyuan
    Liu, Yongge
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29269 - 29282