Video Question Answering with Spatio-Temporal Reasoning

被引:0
|
作者
Yunseok Jang
Yale Song
Chris Dongjoo Kim
Youngjae Yu
Youngjin Kim
Gunhee Kim
机构
[1] Seoul National University,
[2] Microsoft AI & Research,undefined
来源
关键词
VQA; Spatio-temporal reasoning; Large-scale video QA dataset; Spatial and temporal attention;
D O I
暂无
中图分类号
学科分类号
摘要
Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.
引用
下载
收藏
页码:1385 / 1412
页数:27
相关论文
共 50 条
  • [41] Spatio-temporal querying in video databases
    Köprülü, M
    Çiçekli, NK
    Yazici, A
    FLEXIBLE QUERY ANSWERING SYSTEMS, PROCEEDINGS, 2002, 2522 : 251 - 262
  • [42] Spatio-temporal video contrast enhancement
    Celik, Turgay
    IET IMAGE PROCESSING, 2013, 7 (06) : 543 - 555
  • [43] Spatio-temporal querying in video databases
    Koprulu, M
    Cicekli, NK
    Yazici, A
    INFORMATION SCIENCES, 2004, 160 (1-4) : 131 - 152
  • [44] Developing an object-oriented video database system with spatio-temporal reasoning capabilities
    Chan, SSM
    Li, Q
    CONCEPTUAL MODELING - ER'99, 1999, 1728 : 47 - 61
  • [45] BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues
    Le, Hung
    Sahoo, Doyen
    Che, Nancy F.
    Hoit, Steven C. H.
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 1846 - 1859
  • [46] AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning
    Grunde-McLaughlin, Madeleine
    Krishna, Ranjay
    Agrawala, Maneesh
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11282 - 11292
  • [47] Deductive and inductive reasoning on spatio-temporal data
    Nanni, M
    Raffaetà, A
    Renso, C
    Tirini, F
    APPLICATIONS OF DECLARATIVE PROGRAMMING AND KNOWLEDGE MANAGEMENT, 2005, 3392 : 98 - 115
  • [48] Explore Multi-Step Reasoning in Video Question Answering
    Han, Yahong
    PROCEEDINGS OF THE 1ST WORKSHOP AND CHALLENGE ON COMPREHENSIVE VIDEO UNDERSTANDING IN THE WILD (COVIEW'18), 2018, : 5 - 5
  • [49] Explore Multi-Step Reasoning in Video Question Answering
    Song, Xiaomeng
    Shi, Yucheng
    Chen, Xin
    Han, Yahong
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 239 - 247
  • [50] SPATIO-TEMPORAL VIDEO FILTERING FOR VIDEO SURVEILLANCE APPLICATIONS
    Ben Hamida, Amal
    Koubaa, Mohamed
    Nicolas, Henri
    Ben Amar, Chokri
    ELECTRONIC PROCEEDINGS OF THE 2013 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (ICMEW), 2013,