Relation-aware Hierarchical Attention Framework for Video Question Answering

被引:7
|
作者
Li, Fangtao [1 ]
Liu, Zihe [1 ]
Bai, Ting [1 ]
Yan, Chenghao [1 ]
Cao, Chenyu [1 ]
Wu, Bin [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video Question Answering; Hierarchical Attention; Multimodal Fusion; Relation Understanding;
D O I
10.1145/3460426.3463635
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task. To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects. To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods.
引用
下载
收藏
页码:164 / 172
页数:9
相关论文
共 50 条
  • [21] A relation-aware representation approach for the question matching system
    Yanmin Chen
    Enhong Chen
    Kun Zhang
    Qi Liu
    Ruijun Sun
    World Wide Web, 2024, 27
  • [22] Video Question Answering via Hierarchical Dual-Level Attention Network Learning
    Zhao, Zhou
    Lin, Jinghao
    Jiang, Xinghua
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1050 - 1058
  • [23] Video Captioning via Relation-Aware Graph Learning
    Zheng, Yi
    Jing, Heming
    Xie, Qiujie
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Gao, Shang
    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2023, 2023-June
  • [24] Multichannel Attention Refinement for Video Question Answering
    Zhuang, Yueting
    Xu, Dejing
    Yan, Xin
    Cheng, Wenzhuo
    Zhao, Zhou
    Pu, Shiliang
    Xiao, Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (01)
  • [25] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Xiao, Shaoning
    Li, Yimeng
    Ye, Yunan
    Chen, Long
    Pu, Shiliang
    Zhao, Zhou
    Shao, Jian
    Xiao, Jun
    NEURAL PROCESSING LETTERS, 2020, 52 (02) : 993 - 1003
  • [26] Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering
    Shaoning Xiao
    Yimeng Li
    Yunan Ye
    Long Chen
    Shiliang Pu
    Zhou Zhao
    Jian Shao
    Jun Xiao
    Neural Processing Letters, 2020, 52 : 993 - 1003
  • [27] Scene Segmentation With Dual Relation-Aware Attention Network
    Fu, Jun
    Liu, Jing
    Jiang, Jie
    Li, Yong
    Bao, Yongjun
    Lu, Hanqing
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (06) : 2547 - 2560
  • [28] Saliency Prediction with Relation-Aware Global Attention Module
    Cao, Ge
    Jo, Kang-Hyun
    FRONTIERS OF COMPUTER VISION, IW-FCV 2021, 2021, 1405 : 309 - 316
  • [29] Pay Attention to Target: Relation-Aware Temporal Consistency for Domain Adaptive Video Semantic Segmentation
    Mai, Huayu
    Sun, Rui
    Wang, Yuan
    Zhang, Tianzhu
    Wu, Feng
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4162 - 4170
  • [30] Redundancy-aware Transformer for Video Question Answering
    Li, Yicong
    Yang, Xun
    Zhang, An
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3172 - 3180