Video Question Answering with Procedural Programs

被引：0

作者：

Choudhury, Rohan ^{[1
]}

Niinuma, Koichiro ^{[2
]}

Kitani, Kris M. ^{[1
]}

Jeni, Laszlo A. ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

[2] Fujitsu Res Amer, Santa Clara, CA USA

来源：

COMPUTER VISION-ECCV 2024, PT XXXVIII | 2025年 / 15096卷

关键词：

D O I：

10.1007/978-3-031-72920-1_18

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We propose to answer questions about videos by generating short procedural programs that solve visual subtasks to obtain a final answer. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but cannot effectively or efficiently answer questions about videos due to their image-centric modules and lack of temporal reasoning ability. We address this by providing ProViQ with novel modules intended for video understanding, allowing it to generalize to a wide variety of videos with no additional training. As a result, ProViQ can efficiently find relevant moments in long videos, do causal and temporal reasoning, and summarize videos over long time horizons in order to answer complex questions. This code generation framework additionally enables ProViQ to perform other video tasks beyond question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, multiple-choice and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023/.

引用

页码：315 / 332

页数：18

共 50 条

[41] Multimodal Graph Reasoning and Fusion for Video Question Answering
Zhang, Shuai
Wang, Xingfu
Hawbani, Ammar
Zhao, Liang
Alsamhi, Saeed Hamood
2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, 2022, : 1410 - 1415
[42] EgoVQA - An Egocentric Video Question Answering Benchmark Dataset
Fan, Chenyou
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, : 4359 - 4366
[43] Conditional Cross Correlation Network for Video Question Answering
Ouenniche, Kaouther
Tapu, Ruxandra
Zaharia, Titus
2023 IEEE 17TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC, 2023, : 25 - 32
[44] Video Question Answering with Phrases via Semantic Roles
Sadhu, Arka
Chen, Kan
Nevatia, Ram
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 2460 - 2478
[45] Pairwise VLAD Interaction Network for Video Question Answering
Wang, Hui
Guo, Dan
Hua, Xian-Sheng
Wang, Meng
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5119 - 5127
[46] Instance-sequence reasoning for video question answering
Liu, Rui
Han, Yahong
FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (06)
[47] A comparative study of language transformers for video question answering
Yang, Zekun
Garcia, Noa
Chu, Chenhui
Otani, Mayu
Nakashima, Yuta
Takemura, Haruo
NEUROCOMPUTING, 2021, 445 : 121 - 133
[48] The forgettable-watcher model for video question answering
Chu, Wenqing
Xue, Hongyang
Zhao, Zhou
Cai, Deng
Yao, Chengwei
NEUROCOMPUTING, 2018, 314 : 386 - 393
[49] A dataset for medical instructional video classification and question answering
Gupta, Deepak
Attal, Kush
Demner-Fushman, Dina
SCIENTIFIC DATA, 2023, 10 (01)
[50] A Video Question Answering Model Based on Knowledge Distillation
Shao, Zhuang
Wan, Jiahui
Zong, Linlin
INFORMATION, 2023, 14 (06)

← 1 2 3 4 5 →