Uncovering the Temporal Context for Video Question Answering

被引:85
|
作者
Zhu, Linchao [1 ]
Xu, Zhongwen [1 ]
Yang, Yi [1 ]
Hauptmann, Alexander G. [2 ]
机构
[1] Univ Technol Sydney, CAI, Sydney, NSW, Australia
[2] Carnegie Mellon Univ, SCS, Pittsburgh, PA 15213 USA
关键词
Video sequence modeling; Video question answering; Video prediction; Cross-media;
D O I
10.1007/s11263-017-1033-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder-decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of "fill-in-the-blank", and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.
引用
收藏
页码:409 / 421
页数:13
相关论文
共 50 条
  • [1] Uncovering the Temporal Context for Video Question Answering
    Linchao Zhu
    Zhongwen Xu
    Yi Yang
    Alexander G. Hauptmann
    [J]. International Journal of Computer Vision, 2017, 124 : 409 - 421
  • [2] Spatio-Temporal Context Networks for Video Question Answering
    Gao, Kun
    Han, Yahong
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
  • [3] Video -Context Aligned Transformer for Video Question Answering
    Zong, Linlin
    Wan, Jiahui
    Zhang, Xianchao
    Liu, Xinyue
    Liang, Wenxin
    Xu, Bo
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19795 - 19803
  • [4] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [5] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    [J]. International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [6] Discovering Spatio-Temporal Rationales for Video Question Answering
    Li, Yicong
    Xiao, Junbin
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13823 - 13832
  • [7] Harnessing Representative Spatial-Temporal Information for Video Question Answering
    Wang, Yuanyuan
    Liu, Meng
    Song, Xuemeng
    Nie, Liqiang
    [J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20 (10)
  • [8] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [9] Spatio-Temporal Graph Convolution Transformer for Video Question Answering
    Tang, Jiahao
    Hu, Jianguo
    Huang, Wenjun
    Shen, Shengzhi
    Pan, Jiakai
    Wang, Deming
    Ding, Yanyu
    [J]. IEEE Access, 2024, 12 : 131664 - 131680
  • [10] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    [J]. NEUROCOMPUTING, 2019, 363 : 125 - 139