Hierarchical Recurrent Contextual Attention Network for Video Question Answering

被引:0
|
作者
Zhou, Fei [1 ,2 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Tianjin Univ, Tianjin Int Engn Inst, Tianjin, Peoples R China
来源
关键词
Video question answering; Video understanding; Multi-modal fusion and inference;
D O I
10.1007/978-3-031-20500-2_23
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering (VideoQA) is a task of answering a natural language question related to the content of a video. Existing methods that utilize the fine-grained object information have achieved significant improvements, however, they rely on costly external object detectors or fail to explore the rich structure of videos. In this work, we propose to understand video from two dimensions: temporal and semantic. In semantic space, videos are organized in a hierarchical structure (pixels, objects, activities, events). In temporal space, video can be viewed as a sequence of events, which contain multiple objects and activities. Based on this insight, we propose a reusable neural unit called recurrent contextual attention (RCA). RCA receives a 2D grid feature and conditional features as input, and computes multiple high-order compositional semantic representations. We then stack these units to build our hierarchy and utilize recurrent attention to generate diverse representations for different views of each subsequence. Without the bells and whistles, our model achieves excellent performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA using only grid features. Visualization results further validate the effectiveness of our method.
引用
收藏
页码:280 / 290
页数:11
相关论文
共 50 条
  • [1] HIERARCHICAL RELATIONAL ATTENTION FOR VIDEO QUESTION ANSWERING
    Chowdhury, Muhammad Iqbal Hasan
    Kien Nguyen
    Sridharan, Sridha
    Fookes, Clinton
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 599 - 603
  • [2] Video Question Answering via Hierarchical Dual-Level Attention Network Learning
    Zhao, Zhou
    Lin, Jinghao
    Jiang, Xinghua
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1050 - 1058
  • [3] Progressive Graph Attention Network for Video Question Answering
    Peng, Liang
    Yang, Shuangji
    Bin, Yi
    Wang, Guoqing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2871 - 2879
  • [4] Relation-aware Hierarchical Attention Framework for Video Question Answering
    Li, Fangtao
    Liu, Zihe
    Bai, Ting
    Yan, Chenghao
    Cao, Chenyu
    Wu, Bin
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 164 - 172
  • [5] Frame Augmented Alternating Attention Network for Video Question Answering
    Zhang, Wenqiao
    Tang, Siliang
    Cao, Yanpeng
    Pu, Shiliang
    Wu, Fei
    Zhuang, Yueting
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) : 1032 - 1041
  • [6] Question Answering with Hierarchical Attention Networks
    Alpay, Tayfun
    Heinrich, Stefan
    Nelskamp, Michael
    Wermter, Stefan
    [J]. 2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
  • [7] Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering
    Gao, Lianli
    Lei, Yu
    Zeng, Pengpeng
    Song, Jingkuan
    Wang, Meng
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 202 - 215
  • [8] Video Question Answering by Frame Attention
    Fang, Jiannan
    Sun, Lingling
    Wang, Yaqi
    [J]. ELEVENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2019), 2019, 11179
  • [9] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
  • [10] Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering
    Jiang, Jianwen
    Chen, Ziqiang
    Lin, Haojie
    Zhao, Xibin
    Gao, Yue
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11101 - 11108