Harnessing Representative Spatial-Temporal Information for Video Question Answering

被引:0
|
作者
Wang, Yuanyuan [1 ]
Liu, Meng [2 ]
Song, Xuemeng [1 ]
Nie, Liqiang [3 ]
机构
[1] Shandong Univ, Qingdao, Peoples R China
[2] Shandong Jianzhu Univ, Jinan, Peoples R China
[3] Harbin Inst Technol, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Additional Key Words and Phrases; Video question answering; uncertainty estimation; expectation-maximization attention;
D O I
10.1145/3675399
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
    Wang, Yueqian
    Wang, Yuxuan
    Chen, Kai
    Zhao, Dongyan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19215 - 19223
  • [2] Video Question Answering via Knowledge-based Progressive Spatial-Temporal Attention Network
    Jin, Weike
    Zhao, Zhou
    Li, Yimeng
    Li, Jie
    Xiao, Jun
    Zhuang, Yueting
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (02)
  • [3] MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
    Gao, Difei
    Zhou, Luowei
    Ji, Lei
    Zhu, Linchao
    Yang, Yi
    Shou, Mike Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14773 - 14783
  • [4] A video segmentation algorithm based on spatial-temporal information
    Zhu, H
    Li, ZM
    2002 INTERNATIONAL CONFERENCE ON COMMUNICATIONS, CIRCUITS AND SYSTEMS AND WEST SINO EXPOSITION PROCEEDINGS, VOLS 1-4, 2002, : 566 - 569
  • [5] Question answering with imperfect temporal information
    Schockaert, Steven
    Ahn, David
    De Cock, Martine
    Kerre, Etienne E.
    FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2006, 4027 : 647 - 658
  • [6] Uncovering the Temporal Context for Video Question Answering
    Linchao Zhu
    Zhongwen Xu
    Yi Yang
    Alexander G. Hauptmann
    International Journal of Computer Vision, 2017, 124 : 409 - 421
  • [7] Uncovering the Temporal Context for Video Question Answering
    Zhu, Linchao
    Xu, Zhongwen
    Yang, Yi
    Hauptmann, Alexander G.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
  • [8] Video foreground segmentation based on analysis of spatial-temporal information
    Min, Hua-Qing
    Chen, Cong
    Luo, Rong-Hua
    Zhu, Jin-Hui
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2011, 24 (04): : 582 - 590
  • [9] Underwater video dehazing based on spatial-temporal information fusion
    Qing, Chunmei
    Yu, Feng
    Xu, Xiangmin
    Huang, Wenyou
    Jin, Jianxiu
    MULTIDIMENSIONAL SYSTEMS AND SIGNAL PROCESSING, 2016, 27 (04) : 909 - 924
  • [10] Event Graph Guided Compositional Spatial--Temporal Reasoning for Video Question Answering
    Bai, Ziyi
    Wang, Ruiping
    Gao, Difei
    Chen, Xilin
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1109 - 1121