Video Question Answering with Procedural Programs

被引:0
|
作者
Choudhury, Rohan [1 ]
Niinuma, Koichiro [2 ]
Kitani, Kris M. [1 ]
Jeni, Laszlo A. [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Fujitsu Res Amer, Santa Clara, CA USA
来源
COMPUTER VISION-ECCV 2024, PT XXXVIII | 2025年 / 15096卷
关键词
D O I
10.1007/978-3-031-72920-1_18
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/.
引用
收藏
页码:315 / 332
页数:18
相关论文
共 50 条
  • [41] Multimodal Graph Reasoning and Fusion for Video Question Answering
    Zhang, Shuai
    Wang, Xingfu
    Hawbani, Ammar
    Zhao, Liang
    Alsamhi, Saeed Hamood
    2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
  • [42] EgoVQA - An Egocentric Video Question Answering Benchmark Dataset
    Fan, Chenyou
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4359 - 4366
  • [43] Conditional Cross Correlation Network for Video Question Answering
    Ouenniche, Kaouther
    Tapu, Ruxandra
    Zaharia, Titus
    2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 25 - 32
  • [44] Video Question Answering with Phrases via Semantic Roles
    Sadhu, Arka
    Chen, Kan
    Nevatia, Ram
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 2460 - 2478
  • [45] Pairwise VLAD Interaction Network for Video Question Answering
    Wang, Hui
    Guo, Dan
    Hua, Xian-Sheng
    Wang, Meng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5119 - 5127
  • [46] Instance-sequence reasoning for video question answering
    Liu, Rui
    Han, Yahong
    FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (06)
  • [47] A comparative study of language transformers for video question answering
    Yang, Zekun
    Garcia, Noa
    Chu, Chenhui
    Otani, Mayu
    Nakashima, Yuta
    Takemura, Haruo
    NEUROCOMPUTING, 2021, 445 : 121 - 133
  • [48] The forgettable-watcher model for video question answering
    Chu, Wenqing
    Xue, Hongyang
    Zhao, Zhou
    Cai, Deng
    Yao, Chengwei
    NEUROCOMPUTING, 2018, 314 : 386 - 393
  • [49] A dataset for medical instructional video classification and question answering
    Gupta, Deepak
    Attal, Kush
    Demner-Fushman, Dina
    SCIENTIFIC DATA, 2023, 10 (01)
  • [50] A Video Question Answering Model Based on Knowledge Distillation
    Shao, Zhuang
    Wan, Jiahui
    Zong, Linlin
    INFORMATION, 2023, 14 (06)