Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

被引:0
|
作者
Shaoning Xiao
Yimeng Li
Yunan Ye
Long Chen
Shiliang Pu
Zhou Zhao
Jian Shao
Jun Xiao
机构
[1] Zhejiang University,
来源
Neural Processing Letters | 2020年 / 52卷
关键词
Video question answering; Multi-grained representation; Temporal co-attention;
D O I
暂无
中图分类号
学科分类号
摘要
This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.
引用
收藏
页码:993 / 1003
页数:10
相关论文
共 50 条
  • [21] Multi-Grained Temporal Segmentation Attention Modeling for Skeleton-Based Action Recognition
    Lv, Jinrong
    Gong, Xun
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 927 - 931
  • [22] Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks
    Jin, Weike
    Zhao, Zhou
    Gu, Mao
    Yu, Jun
    Xiao, Jun
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 465 - 474
  • [23] Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation
    Liu, Nian
    Nan, Kepan
    Zhao, Wangbo
    Liu, Yuanwei
    Yao, Xiwen
    Khan, Salman
    Cholakkal, Hisham
    Anwer, Rao Muhammad
    Han, Junwei
    Khan, Fahad Shahbaz
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 18816 - 18825
  • [24] Spatio-Temporal Two-stage Fusion for video question answering
    Xu, Feifei
    Zhu, Yitao
    Wang, Chun
    Cao, Yangze
    Zhong, Zheng
    Li, Xiongmin
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 237
  • [25] Video Question Answering via Hierarchical Dual-Level Attention Network Learning
    Zhao, Zhou
    Lin, Jinghao
    Jiang, Xinghua
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1050 - 1058
  • [26] Multichannel Attention Refinement for Video Question Answering
    Zhuang, Yueting
    Xu, Dejing
    Yan, Xin
    Cheng, Wenzhuo
    Zhao, Zhou
    Pu, Shiliang
    Xiao, Jun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [27] A Multi-grained Video Encryption Method Based on Spark
    Zhou, Yang
    Cheng, Yingye
    [J]. PROCEEDINGS OF THE 2016 6TH INTERNATIONAL CONFERENCE ON MACHINERY, MATERIALS, ENVIRONMENT, BIOTECHNOLOGY AND COMPUTER (MMEBC), 2016, 88 : 1091 - 1095
  • [28] Feature Fusion Attention Visual Question Answering
    Wang, Chunlin
    Sun, Jianyong
    Chen, Xiaolin
    [J]. ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
  • [29] Hierarchical synchronization with structured multi-granularity interaction for video question answering
    Qi, Shanshan
    Yang, Luxi
    Li, Chunguo
    [J]. NEUROCOMPUTING, 2024, 582
  • [30] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
    Jiang, Jianwen
    Chen, Ziqiang
    Lin, Haojie
    Zhao, Xibin
    Gao, Yue
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108