Spatio-Temporal Two-stage Fusion for video question answering

被引:1
|
作者
Xu, Feifei [1 ]
Zhu, Yitao [1 ]
Wang, Chun [1 ]
Cao, Yangze [1 ]
Zhong, Zheng [1 ]
Li, Xiongmin [2 ]
机构
[1] Shanghai Univ Elect Power, 1851 Hucheng Ring Rd, Shanghai 201306, Peoples R China
[2] Cognizant Technol Solut US Corp, 211 Qual Circle, College Stn, TX 77845 USA
关键词
Video question answering; Vision transformer; Spatio-temporal two-stage fusion; NETWORK;
D O I
10.1016/j.cviu.2023.103821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (VideoQA) has attracted much interest from of scholars as one of the most representative multimodal tasks in recent years. The task requires the model to interact and reason between the video and the question. Most known approaches use pre-trained networks to extract complex embeddings of videos and questions independently before performing multimodal fusion. However, they overlook two factors: (1) These feature extractors are pre-trained for the image or video classification task without taking the question into consideration, therefore may not be suitable for VideoQA task. (2) Using multiple feature extractors to extract features at different levels introduce more irrelevant information to some extent, thus making the task more difficult. For the above reasons, we propose a new model named Spatio-Temporal Two-Stage Fusion, which ties together multiple levels of feature extraction processes and divides them into two distinct stages: spatial fusion and temporal fusion. Specifically, in the spatial fusion stage, we use Vision Transformer to integrate the intra-frame information to generate frame-level features. At the same time, we design a multimodal temporal fusion module that enables the video to fuse textual information and assign different levels of attention to each frame. Then the obtained frame-level features are used to generate global video features by another Vision Transformer. In order to efficiently generate modal interaction information, we design a video-text symmetric fusion module to retain the most relevant information by mutual guidance between the two modalities. Our method is evaluated on three benchmark datasets: MSVD-QA, MSRVTT-QA and TGIF-QA, and achieves competitive results.
引用
收藏
页数:10
相关论文
共 50 条
  • [41] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Shaoning Xiao
    Yimeng Li
    Yunan Ye
    Long Chen
    Shiliang Pu
    Zhou Zhao
    Jian Shao
    Jun Xiao
    [J]. Neural Processing Letters, 2020, 52 : 993 - 1003
  • [42] Remember and forget: video and text fusion for video question answering
    Feng Gao
    Yuanyuan Ge
    Yongge Liu
    [J]. Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
  • [43] Video Segmentation with Spatio-Temporal Tubes
    Trichet, Remi
    Nevatia, Ramakant
    [J]. 2013 10TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS 2013), 2013, : 330 - 335
  • [44] Spatio-temporal segmentation for video surveillance
    Sun, HZ
    Tan, TN
    [J]. ELECTRONICS LETTERS, 2001, 37 (01) : 20 - 21
  • [45] Spatio-temporal segmentation for video surveillance
    Sun, HZ
    Feng, T
    Tan, TN
    [J]. 15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS: COMPUTER VISION AND IMAGE ANALYSIS, 2000, : 843 - 846
  • [46] Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
    Yu, Ting
    Yu, Jun
    Yu, Zhou
    Tao, Dacheng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 1204 - 1218
  • [47] VideoZoom Spatio-Temporal Video Browser
    Smith, John R.
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 1999, 1 (02) : 157 - 171
  • [48] Spatio-temporal video contrast enhancement
    Celik, Turgay
    [J]. IET IMAGE PROCESSING, 2013, 7 (06) : 543 - 555
  • [49] Spatio-Temporal Perturbations for Video Attribution
    Li, Zhenqiang
    Wang, Weimin
    Li, Zuoyue
    Huang, Yifei
    Sato, Yoichi
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2043 - 2056
  • [50] Spatio-temporal querying in video databases
    Köprülü, M
    Çiçekli, NK
    Yazici, A
    [J]. FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2002, 2522 : 251 - 262