Spatio-Temporal Two-stage Fusion for video question answering

被引：1

作者：

Xu, Feifei ^{[1
]}

Zhu, Yitao ^{[1
]}

Wang, Chun ^{[1
]}

Cao, Yangze ^{[1
]}

Zhong, Zheng ^{[1
]}

Li, Xiongmin ^{[2
]}

机构：

[1] Shanghai Univ Elect Power, 1851 Hucheng Ring Rd, Shanghai 201306, Peoples R China

[2] Cognizant Technol Solut US Corp, 211 Qual Circle, College Stn, TX 77845 USA

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2023年 / 237卷

关键词：

Video question answering; Vision transformer; Spatio-temporal two-stage fusion; NETWORK;

D O I：

10.1016/j.cviu.2023.103821

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video question answering (VideoQA) has attracted much interest from of scholars as one of the most representative multimodal tasks in recent years. The task requires the model to interact and reason between the video and the question. Most known approaches use pre-trained networks to extract complex embeddings of videos and questions independently before performing multimodal fusion. However, they overlook two factors: (1) These feature extractors are pre-trained for the image or video classification task without taking the question into consideration, therefore may not be suitable for VideoQA task. (2) Using multiple feature extractors to extract features at different levels introduce more irrelevant information to some extent, thus making the task more difficult. For the above reasons, we propose a new model named Spatio-Temporal Two-Stage Fusion, which ties together multiple levels of feature extraction processes and divides them into two distinct stages: spatial fusion and temporal fusion. Specifically, in the spatial fusion stage, we use Vision Transformer to integrate the intra-frame information to generate frame-level features. At the same time, we design a multimodal temporal fusion module that enables the video to fuse textual information and assign different levels of attention to each frame. Then the obtained frame-level features are used to generate global video features by another Vision Transformer. In order to efficiently generate modal interaction information, we design a video-text symmetric fusion module to retain the most relevant information by mutual guidance between the two modalities. Our method is evaluated on three benchmark datasets: MSVD-QA, MSRVTT-QA and TGIF-QA, and achieves competitive results.

引用

页数：10

共 50 条

[41] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
Shaoning Xiao
Yimeng Li
Yunan Ye
Long Chen
Shiliang Pu
Zhou Zhao
Jian Shao
Jun Xiao
[J]. Neural Processing Letters, 2020, 52 : 993 - 1003
[42] Remember and forget: video and text fusion for video question answering
Feng Gao
Yuanyuan Ge
Yongge Liu
[J]. Multimedia Tools and Applications, 2018, 77 : 29269 - 29282
[43] Video Segmentation with Spatio-Temporal Tubes
Trichet, Remi
Nevatia, Ramakant
[J]. 2013 10TH IEEE INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS 2013), 2013, : 330 - 335
[44] Spatio-temporal segmentation for video surveillance
Sun, HZ
Tan, TN
[J]. ELECTRONICS LETTERS, 2001, 37 (01) : 20 - 21
[45] Spatio-temporal segmentation for video surveillance
Sun, HZ
Feng, T
Tan, TN
[J]. 15TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 1, PROCEEDINGS: COMPUTER VISION AND IMAGE ANALYSIS, 2000, : 843 - 846
[46] Compositional Attention Networks With Two-Stream Fusion for Video Question Answering
Yu, Ting
Yu, Jun
Yu, Zhou
Tao, Dacheng
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 1204 - 1218
[47] VideoZoom Spatio-Temporal Video Browser
Smith, John R.
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 1999, 1 (02) : 157 - 171
[48] Spatio-temporal video contrast enhancement
Celik, Turgay
[J]. IET IMAGE PROCESSING, 2013, 7 (06) : 543 - 555
[49] Spatio-Temporal Perturbations for Video Attribution
Li, Zhenqiang
Wang, Weimin
Li, Zuoyue
Huang, Yifei
Sato, Yoichi
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (04) : 2043 - 2056
[50] Spatio-temporal querying in video databases
Köprülü, M
Çiçekli, NK
Yazici, A
[J]. FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2002, 2522 : 251 - 262

← 1 2 3 4 5 →