Spatio-Temporal Two-stage Fusion for video question answering

被引:1
|
作者
Xu, Feifei [1 ]
Zhu, Yitao [1 ]
Wang, Chun [1 ]
Cao, Yangze [1 ]
Zhong, Zheng [1 ]
Li, Xiongmin [2 ]
机构
[1] Shanghai Univ Elect Power, 1851 Hucheng Ring Rd, Shanghai 201306, Peoples R China
[2] Cognizant Technol Solut US Corp, 211 Qual Circle, College Stn, TX 77845 USA
关键词
Video question answering; Vision transformer; Spatio-temporal two-stage fusion; NETWORK;
D O I
10.1016/j.cviu.2023.103821
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (VideoQA) has attracted much interest from of scholars as one of the most representative multimodal tasks in recent years. The task requires the model to interact and reason between the video and the question. Most known approaches use pre-trained networks to extract complex embeddings of videos and questions independently before performing multimodal fusion. However, they overlook two factors: (1) These feature extractors are pre-trained for the image or video classification task without taking the question into consideration, therefore may not be suitable for VideoQA task. (2) Using multiple feature extractors to extract features at different levels introduce more irrelevant information to some extent, thus making the task more difficult. For the above reasons, we propose a new model named Spatio-Temporal Two-Stage Fusion, which ties together multiple levels of feature extraction processes and divides them into two distinct stages: spatial fusion and temporal fusion. Specifically, in the spatial fusion stage, we use Vision Transformer to integrate the intra-frame information to generate frame-level features. At the same time, we design a multimodal temporal fusion module that enables the video to fuse textual information and assign different levels of attention to each frame. Then the obtained frame-level features are used to generate global video features by another Vision Transformer. In order to efficiently generate modal interaction information, we design a video-text symmetric fusion module to retain the most relevant information by mutual guidance between the two modalities. Our method is evaluated on three benchmark datasets: MSVD-QA, MSRVTT-QA and TGIF-QA, and achieves competitive results.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [2] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    [J]. International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [3] Spatio-Temporal Context Networks for Video Question Answering
    Gao, Kun
    Han, Yahong
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
  • [4] Discovering Spatio-Temporal Rationales for Video Question Answering
    Li, Yicong
    Xiao, Junbin
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13823 - 13832
  • [5] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [6] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
  • [7] Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
    Dang, Long Hoang
    Le, Thao Minh
    Le, Vuong
    Tran, Truyen
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 636 - 642
  • [8] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
    Jiang, Jianwen
    Chen, Ziqiang
    Lin, Haojie
    Zhao, Xibin
    Gao, Yue
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108
  • [9] Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
    Cheng, Yi
    Fan, Hehe
    Lin, Dongyun
    Sun, Ying
    Kankanhalli, Mohan
    Lim, Joo-Hwee
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6131 - 6141
  • [10] (2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
    Cherian, Anoop
    Hori, Chiori
    Marks, Tim K.
    Le Roux, Jonathan
    [J]. THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 444 - 453