Video Question Answering via Hierarchical Dual-Level Attention Network Learning

被引:27
|
作者
Zhao, Zhou [1 ]
Lin, Jinghao [1 ]
Jiang, Xinghua [1 ]
Cai, Deng [2 ]
He, Xiaofei [2 ]
Zhuang, Yueting [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou, Zhejiang, Peoples R China
[2] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou, Zhejiang, Peoples R China
基金
中国国家自然科学基金;
关键词
Video Question Answering; Hierarchical Attention Network;
D O I
10.1145/3123266.3123364
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Video question answering is a challenging task in visual information retrieval, which provides the accurate answer from the referenced video contents according to the given question. However, the existing visual question answering approaches mainly tackle the problem of static image question answering, which may be ineffectively applied for video question answering directly, due to the insufficiency of modeling the video temporal dynamics. In this paper, we study the problem of video question answering from the viewpoint of hierarchical dual-level attention network learning. We obtain the object appearance and movement information in the video based on both frame-level and segment-level feature representation methods. We then develop the hierarchical dual-level attention networks to learn the question-aware video representations with word-level and question-level attention mechanisms. We next devise the question-level fusion attention mechanism for our proposed networks to learn the questiona-ware joint video representation for video question answering. We construct two large-scale video question answering datasets. The extensive experiments validate the effectiveness of our method.
引用
收藏
页码:1050 / 1058
页数:9
相关论文
共 50 条
  • [1] Hierarchical Recurrent Contextual Attention Network for Video Question Answering
    Zhou, Fei
    Han, Yahong
    [J]. ARTIFICIAL INTELLIGENCE, CICAI 2022, PT II, 2022, 13605 : 280 - 290
  • [2] Video question answering via grounded cross-attention network learning
    Ye, Yunan
    Zhang, Shifeng
    Li, Yimeng
    Qian, Xufeng
    Tang, Siliang
    Pu, Shiliang
    Xiao, Jun
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [3] Video Question Answering via Attribute-Augmented Attention Network Learning
    Ye, Yunan
    Zhao, Zhou
    Li, Yimeng
    Chen, Long
    Xiao, Jun
    Zhuang, Yueting
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 829 - 832
  • [4] HIERARCHICAL RELATIONAL ATTENTION FOR VIDEO QUESTION ANSWERING
    Chowdhury, Muhammad Iqbal Hasan
    Kien Nguyen
    Sridharan, Sridha
    Fookes, Clinton
    [J]. 2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 599 - 603
  • [5] Link Prediction via Ranking Metric Dual-Level Attention Network Learning
    Zhao, Zhou
    Gao, Ben
    Zheng, Vincent W.
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3525 - 3531
  • [6] Video Question Answering via Hierarchical Spatio-Temporal Attention Networks
    Zhao, Zhou
    Yang, Qifan
    Cai, Deng
    He, Xiaofei
    Zhuang, Yueting
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3518 - 3524
  • [7] Multi-Turn Video Question Answering via Multi-Stream Hierarchical Attention Context Network
    Zhao, Zhou
    Jiang, Xinghua
    Cai, Deng
    Xiao, Jun
    He, Xiaofei
    Pu, Shiliang
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 3690 - 3696
  • [8] Progressive Graph Attention Network for Video Question Answering
    Peng, Liang
    Yang, Shuangji
    Bin, Yi
    Wang, Guoqing
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 2871 - 2879
  • [9] Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks
    Zhao, Zhou
    Zhang, Zhu
    Jiang, Xinghua
    Cai, Deng
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (08) : 3860 - 3872
  • [10] Multimodal Dual Attention Memory for Video Story Question Answering
    Kim, Kyung-Min
    Choi, Seong-Ho
    Kim, Jin-Hwa
    Zhang, Byoung-Tak
    [J]. COMPUTER VISION - ECCV 2018, PT 15, 2018, 11219 : 698 - 713