Relation-aware Hierarchical Attention Framework for Video Question Answering

被引:7
|
作者
Li, Fangtao [1 ]
Liu, Zihe [1 ]
Bai, Ting [1 ]
Yan, Chenghao [1 ]
Cao, Chenyu [1 ]
Wu, Bin [1 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Video Question Answering; Hierarchical Attention; Multimodal Fusion; Relation Understanding;
D O I
10.1145/3460426.3463635
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Question Answering (VideoQA) is a challenging video understanding task since it requires a deep understanding of both question and video. Previous studies mainly focus on extracting sophisticated visual and language embeddings, fusing them by delicate hand-crafted networks. However, the relevance of different frames, objects, and modalities to the question are varied along with the time, which is ignored in most of existing methods. Lacking understanding of the the dynamic relationships and interactions among objects brings a great challenge to VideoQA task. To address this problem, we propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos. In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features. Then a graph-based relation encoder is utilized to extract the static relationship between visual objects. To capture the dynamic changes of multimodal objects in different video frames, we consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer. We conduct extensive experiments on a large scale VideoQA dataset, and the experimental results demonstrate that our RHA outperforms the state-of-the-art methods.
引用
收藏
页码:164 / 172
页数:9
相关论文
共 50 条
  • [1] ReGR: Relation-aware graph reasoning framework for video question answering
    Wang, Zheng
    Li, Fangtao
    Ota, Kaoru
    Dong, Mianxiong
    Wu, Bin
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (04)
  • [2] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321
  • [3] Visual question answering with gated relation-aware auxiliary
    Shao, Xiangjun
    Xiang, Zhenglong
    Li, Yuanxiang
    [J]. IET IMAGE PROCESSING, 2022, 16 (05) : 1424 - 1432
  • [4] A BERT-based Approach with Relation-aware Attention for Knowledge Base Question Answering
    Luo, Da
    Su, Jindian
    Yu, Shanshan
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [5] Relation-Aware Image Captioning with Hybrid-Attention for Explainable Visual Question Answering
    Lin, Ying-Jia
    Tseng, Ching-Shan
    Kao, Hung-Yu
    [J]. JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2024, 40 (03) : 649 - 659
  • [6] Relation-Aware Image Captioning for Explainable Visual Question Answering
    Tseng, Ching-Shan
    Lin, Ying-Jia
    Kao, Hung-Yu
    [J]. 2022 INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, TAAI, 2022, : 149 - 154
  • [7] HIERARCHICAL RELATIONAL ATTENTION FOR VIDEO QUESTION ANSWERING
    Chowdhury, Muhammad Iqbal Hasan
    Kien Nguyen
    Sridharan, Sridha
    Fookes, Clinton
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 599 - 603
  • [8] Question-Directed Reasoning With Relation-Aware Graph Attention Network for Complex Question Answering Over Knowledge Graph
    Zhang, Geng
    Liu, Jin
    Zhou, Guangyou
    Zhao, Kunsong
    Xie, Zhiwen
    Huang, Bo
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1915 - 1927
  • [9] Relation-Aware Fine-Grained Reasoning Network for Textbook Question Answering
    Ma, Jie
    Liu, Jun
    Wang, Yaxian
    Li, Junjun
    Liu, Tongliang
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (01) : 15 - 27
  • [10] Relation-aware attention for video captioning via graph learning
    Tu, Yunbin
    Zhou, Chang
    Guo, Junjun
    Li, Huafeng
    Gao, Shengxiang
    Yu, Zhengtao
    [J]. PATTERN RECOGNITION, 2023, 136