Harnessing Representative Spatial-Temporal Information for Video Question Answering

被引:0
|
作者
Wang, Yuanyuan [1 ]
Liu, Meng [2 ]
Song, Xuemeng [1 ]
Nie, Liqiang [3 ]
机构
[1] Shandong Univ, Qingdao, Peoples R China
[2] Shandong Jianzhu Univ, Jinan, Peoples R China
[3] Harbin Inst Technol, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
Additional Key Words and Phrases; Video question answering; uncertainty estimation; expectation-maximization attention;
D O I
10.1145/3675399
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video question answering, aiming to answer a natural language question related to the given video, has become prevalent in the past few years. Although remarkable improvements have been obtained, it is still exposed to the challenge of insufficient comprehension of video content. To this end, we propose a spatial-temporal representative visual exploitation network for video question answering, which enhances the understanding of the video by merely summarizing representative visual information. In order to explore representative object information, we advance adaptive attention based on uncertainty estimation. At the same time, to capture representative frame-level and clip-level visual information, we structure a much more compact set of representations iteratively in an expectation-maximization manner to deprecate noisy information. Both the quantitative and qualitative results on NExT-QA, TGIF-QA, MSRVTT-QA, and MSVD-QA datasets demonstrate the superiority of our model over several state-of-the-art approaches.
引用
收藏
页数:20
相关论文
共 50 条
  • [41] A spatial-temporal approach for video caption detection and recognition
    Tang, X
    Gao, XB
    Liu, JZ
    Zhang, HJ
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2002, 13 (04): : 961 - 971
  • [42] Using Spatial-Temporal Attention for Video Quality Evaluation
    Chi, Biwei
    Su, Ruifang
    Chen, Xinhui
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2024, 2024
  • [43] Enhanced spatial-temporal freedom for video frame interpolation
    Li, Hao-Dong
    Yin, Hui
    Liu, Zhi-Hao
    Huang, Hua
    APPLIED INTELLIGENCE, 2023, 53 (09) : 10535 - 10547
  • [44] STAT: Spatial-Temporal Attention Mechanism for Video Captioning
    Yan, Chenggang
    Tu, Yunbin
    Wang, Xingzheng
    Zhang, Yongbing
    Hao, Xinhong
    Zhang, Yongdong
    Dai, Qionghai
    IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (01) : 229 - 241
  • [45] A new spatial-temporal representation structure of symbolic video
    Yu, Ping
    ICIC Express Letters, 2011, 5 (11): : 4013 - 4019
  • [46] Spatial-temporal features for smoke detections on video images
    Ma, Li
    PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON MULTIMEDIA TECHNOLOGY (ICMT-13), 2013, 84 : 1284 - 1291
  • [47] Efficient Video Transformers with Spatial-Temporal Token Selection
    Wang, Junke
    Yang, Xitong
    Li, Hengduo
    Liu, Li
    Wu, Zuxuan
    Jiang, Yu-Gang
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 69 - 86
  • [48] Video Scene Graph Generation with Spatial-Temporal Knowledge
    Pu, Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9340 - 9344
  • [49] Contrast Based Hierarchical Spatial-Temporal Saliency for Video
    Le, Trung-Nghia
    Sugimoto, Akihiro
    IMAGE AND VIDEO TECHNOLOGY, PSIVT 2015, 2016, 9431 : 734 - 748
  • [50] COLLABORATIVE SPATIAL-TEMPORAL DISTILLATION FOR EFFICIENT VIDEO DERAINING
    Hu, Yuzhang
    Liu, Minghao
    Yang, Wenhan
    Liu, Jiaying
    Guo, Zongming
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1937 - 1942