An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM

被引：0

作者：

Kim, Wonkyun ^{[1
]}

Choi, Changin ^{[2
]}

Lee, Wonseok ^{[2
]}

Rhee, Wonjong ^{[1
,2
,3
]}

机构：

[1] Seoul Natl Univ, Dept Intelligence & Informat, Seoul 08826, South Korea

[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul 08826, South Korea

[3] Seoul Natl Univ, AI Inst, Seoul 08826, South Korea

来源：

IEEE ACCESS | 2024年 / 12卷

基金：

新加坡国家研究基金会;

关键词：

Benchmark testing; Training; Question answering (information retrieval); Cognition; Visualization; Data models; Focusing; Computational modeling; Tuning; Streaming media; Image grid; video question answering; video representation; vision language model;

D O I：

10.1109/ACCESS.2024.3517625

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in our proposed grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure at the pixel level. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot VQA benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks. We also discuss how IG-VLM can be extended for long videos and provide an extension method that consistently and reliably improves the performance. Our code is are available at: https://github.com/imagegridworth/IG-VLM

引用

页码：193057 / 193075

页数：19

共 50 条

[1] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Yang, Antoine
Miech, Antoine
Sivic, Josef
Laptev, Ivan
Schmid, Cordelia
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[2] Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
Engin, Deniz
Avrithis, Yannis
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2796 - 2802
[3] Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
Pan, Junting
Lin, Ziyi
Ge, Yuying
Zhu, Xiatian
Zhang, Renrui
Wang, Yi
Qiao, Yu
Li, Hongsheng
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 272 - 283
[4] Zero-Shot Visual Question Answering Using Knowledge Graph
Chen, Zhuo
Chen, Jiaoyan
Geng, Yuxia
Pan, Jeff Z.
Yuan, Zonggang
Chen, Huajun
SEMANTIC WEB - ISWC 2021, 2021, 12922 : 146 - 162
[5] Zero-Shot Video Retrieval Using Content and Concepts
Dalton, Jeffrey
Allan, James
Mirajkar, Pranav
PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1857 - 1860
[6] Zero-shot Event Causality Identification with Question Answering
Liakhovets, Daria
Schlarb, Sven
PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2022, 2022, : 113 - 119
[7] VidToMe: Video Token Merging for Zero-Shot Video Editing
Li, Xirui
Ma, Chao
Yang, Xiaokang
Yang, Ming-Hsuan
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7486 - 7495
[8] Toward Zero-Shot and Zero-Resource Multilingual Question Answering
Kuo, Chia-Chih
Chen, Kuan-Yu
IEEE ACCESS, 2022, 10 : 99754 - 99761
[9] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
Khachatryan, Levon
Movsisyan, Andranik
Tadevosyan, Vahram
Henschel, Roberto
Wang, Zhangyang
Navasardyan, Shant
Shi, Humphrey
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15908 - 15918
[10] Zero-shot Visual Question Answering with Language Model Feedback
Du, Yifan
Li, Junyi
Tang, Tianyi
Zhao, Wayne Xin
Wen, Ji-Rong
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 9268 - 9281

← 1 2 3 4 5 →