Unifying the Video and Question Attentions for Open-Ended Video Question Answering

被引:47
|
作者
Xue, Hongyang [1 ]
Zhao, Zhou [2 ]
Cai, Deng [1 ]
机构
[1] Zhejiang Univ, State Key Lab CAD&CG, Hangzhou 310027, Zhejiang, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Zhejiang, Peoples R China
关键词
Video question answering; attention model; scene understanding;
D O I
10.1109/TIP.2017.2746267
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video question answering is an important task toward scene understanding and visual data retrieval. However, current visual question answering works mainly focus on a single static image, which is distinct from the dynamic and sequential visual data in the real world. Their approaches cannot utilize the temporal information in videos. In this paper, we introduce the task of free-form open-ended video question answering. The open-ended answers enable wider applications compared with the common multiple-choice tasks in Visual-QA. We first propose a data set for open-ended Video-QA with the automatic question generation approaches. Then, we propose our sequential video attention and temporal question attention models. These two models apply the attention mechanism on videos and questions, while preserving the sequential and temporal structures of the guides. The two models are integrated into the model of unified attention. After the video and the question are encoded, the answers are generated wordwisely from our models by a decoder. In the end, we evaluate our models on the proposed data set. The experimental results demonstrate the effectiveness of our proposed model.
引用
收藏
页码:5656 / 5666
页数:11
相关论文
共 50 条
  • [41] On the hidden treasure of dialog in video question answering
    Engin, Deniz
    Schnitzler, Francois
    Duong, Ngoc Q. K.
    Avrithis, Yannis
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2044 - 2053
  • [42] Question answering on large news video archive
    Chua, TS
    ISPA 2003: PROCEEDINGS OF THE 3RD INTERNATIONAL SYMPOSIUM ON IMAGE AND SIGNAL PROCESSING AND ANALYSIS, PTS 1 AND 2, 2003, : 289 - 294
  • [43] Uncovering the Temporal Context for Video Question Answering
    Zhu, Linchao
    Xu, Zhongwen
    Yang, Yi
    Hauptmann, Alexander G.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
  • [44] Video Question Answering With Semantic Disentanglement and Reasoning
    Liu, Jin
    Wang, Guoxiang
    Xie, Jialong
    Zhou, Fengyu
    Xu, Huijuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (05) : 3663 - 3673
  • [45] Embedding VLAD in Transformer for Video Question Answering
    Guo D.
    Yao S.-T.
    Wang H.
    Wang M.
    Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 671 - 689
  • [46] Video Question Answering: a Survey of Models and Datasets
    Sun, Guanglu
    Liang, Lili
    Li, Tianlin
    Yu, Bo
    Wu, Meng
    Zhang, Bolun
    MOBILE NETWORKS & APPLICATIONS, 2021, 26 (05): : 1904 - 1937
  • [47] Complementary spatiotemporal network for video question answering
    Xinrui Li
    Aming Wu
    Yahong Han
    Multimedia Systems, 2022, 28 : 161 - 169
  • [48] Video Question Answering: a Survey of Models and Datasets
    Guanglu Sun
    Lili Liang
    Tianlin Li
    Bo Yu
    Meng Wu
    Bolun Zhang
    Mobile Networks and Applications, 2021, 26 : 1904 - 1937
  • [49] Video question answering via traffic knowledge database and question classification
    Xiaoyong Sun
    Yu Dai
    Yuchen Wang
    Weifeng Ma
    Xuefen Lin
    Multimedia Systems, 2024, 30
  • [50] Video question answering via traffic knowledge database and question classification
    Sun, Xiaoyong
    Dai, Yu
    Wang, Yuchen
    Ma, Weifeng
    Lin, Xuefen
    MULTIMEDIA SYSTEMS, 2024, 30 (01)