Uncovering the Temporal Context for Video Question Answering

被引:1
|
作者
Linchao Zhu
Zhongwen Xu
Yi Yang
Alexander G. Hauptmann
机构
[1] University of Technology Sydney,CAI
[2] Carnegie Mellon University,SCS
来源
关键词
Video sequence modeling; Video question answering; Video prediction; Cross-media;
D O I
暂无
中图分类号
学科分类号
摘要
In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of “fill-in-the-blank”, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.
引用
收藏
页码:409 / 421
页数:12
相关论文
共 50 条
  • [1] Uncovering the Temporal Context for Video Question Answering
    Zhu, Linchao
    Xu, Zhongwen
    Yang, Yi
    Hauptmann, Alexander G.
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 124 (03) : 409 - 421
  • [2] Spatio-Temporal Context Networks for Video Question Answering
    Gao, Kun
    Han, Yahong
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2017, PT II, 2018, 10736 : 108 - 118
  • [3] Video -Context Aligned Transformer for Video Question Answering
    Zong, Linlin
    Wan, Jiahui
    Zhang, Xianchao
    Liu, Xinyue
    Liang, Wenxin
    Xu, Bo
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19795 - 19803
  • [4] Video Question Answering with Spatio-Temporal Reasoning
    Jang, Yunseok
    Song, Yale
    Kim, Chris Dongjoo
    Yu, Youngjae
    Kim, Youngjin
    Kim, Gunhee
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2019, 127 (10) : 1385 - 1412
  • [5] Video Question Answering with Spatio-Temporal Reasoning
    Yunseok Jang
    Yale Song
    Chris Dongjoo Kim
    Youngjae Yu
    Youngjin Kim
    Gunhee Kim
    [J]. International Journal of Computer Vision, 2019, 127 : 1385 - 1412
  • [6] Discovering Spatio-Temporal Rationales for Video Question Answering
    Li, Yicong
    Xiao, Junbin
    Feng, Chun
    Wang, Xiang
    Chua, Tat-Seng
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 13823 - 13832
  • [7] Spatio-Temporal Graph Convolution Transformer for Video Question Answering
    Tang, Jiahao
    Hu, Jianguo
    Huang, Wenjun
    Shen, Shengzhi
    Pan, Jiakai
    Wang, Deming
    Ding, Yanyu
    [J]. IEEE Access, 2024, 12 : 131664 - 131680
  • [8] Dynamic Spatio-Temporal Modular Network for Video Question Answering
    Qian, Zi
    Wang, Xin
    Duan, Xuguang
    Chen, Hong
    Zhu, Wenwu
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4466 - 4477
  • [9] Harnessing Representative Spatial-Temporal Information for Video Question Answering
    Wang, Yuanyuan
    Liu, Meng
    Song, Xuemeng
    Nie, Liqiang
    [J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20 (10)
  • [10] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    [J]. NEUROCOMPUTING, 2019, 363 : 125 - 139