Harnessing Representative Spatial-Temporal Information for Video Question Answering

被引:0
|
作者
Wang, Yuanyuan [1 ]
Liu, Meng [2 ]
Song, Xuemeng [1 ]
Nie, Liqiang [3 ]
机构
[1] Shandong Univ, Qingdao, Peoples R China
[2] Shandong Jianzhu Univ, Jinan, Peoples R China
[3] Harbin Inst Technol, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Additional Key Words and Phrases; Video question answering; uncertainty estimation; expectation-maximization attention;
D O I
10.1145/3675399
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] Large Language Models are Temporal and Causal Reasoners for Video Question Answering
    Ko, Dohwan
    Lee, Ji Soo
    Kang, Wooyoung
    Roh, Byungseok
    Kim, Hyunwoo J.
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4300 - 4316
  • [32] Application of temporal information extraction techniques to question answering systems
    Teresa Vicente-Diez, Maria
    Martinez, Paloma
    Martinez-Gonzalez, Angel
    Luis Martinez-Fernandez, Jose
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (42): : 25 - 30
  • [33] Spatio-Temporal Graph Convolution Transformer for Video Question Answering
    Tang, Jiahao
    Hu, Jianguo
    Huang, Wenjun
    Shen, Shengzhi
    Pan, Jiakai
    Wang, Deming
    Ding, Yanyu
    IEEE ACCESS, 2024, 12 : 131664 - 131680
  • [34] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [35] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [36] Video Captioning Based on the Spatial-Temporal Saliency Tracing
    Zhou, Yuanen
    Hu, Zhenzhen
    Liu, Xueliang
    Wang, Meng
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 59 - 70
  • [37] Deep Video Harmonization by Improving Spatial-temporal Consistency
    Xiuwen Chen
    Li Fang
    Long Ye
    Qin Zhang
    Machine Intelligence Research, 2024, 21 : 46 - 54
  • [38] Spatial-Temporal Separable Attention for Video Action Recognition
    Guo, Xi
    Hu, Yikun
    Chen, Fang
    Jin, Yuhui
    Qiao, Jian
    Huang, Jian
    Yang, Qin
    2022 INTERNATIONAL CONFERENCE ON FRONTIERS OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, FAIML, 2022, : 224 - 228
  • [39] Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
    Wang, Lishun
    Cao, Miao
    Zhong, Yong
    Yuan, Xin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 9072 - 9089
  • [40] ShiftFormer: Spatial-Temporal Shift Operation in Video Transformer
    Yang, Beiying
    Zhu, Guibo
    Ge, Guojing
    Luo, Jinzhao
    Wang, Jinqiao
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1895 - 1900