An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM

被引:0
|
作者
Kim, Wonkyun [1 ]
Choi, Changin [2 ]
Lee, Wonseok [2 ]
Rhee, Wonjong [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ, Dept Intelligence & Informat, Seoul 08826, South Korea
[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul 08826, South Korea
[3] Seoul Natl Univ, AI Inst, Seoul 08826, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Benchmark testing; Training; Question answering (information retrieval); Cognition; Visualization; Data models; Focusing; Computational modeling; Tuning; Streaming media; Image grid; video question answering; video representation; vision language model;
D O I
10.1109/ACCESS.2024.3517625
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in our proposed grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure at the pixel level. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot VQA benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks. We also discuss how IG-VLM can be extended for long videos and provide an extension method that consistently and reliably improves the performance. Our code is are available at: https://github.com/imagegridworth/IG-VLM
引用
收藏
页码:193057 / 193075
页数:19
相关论文
共 50 条
  • [1] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
    Yang, Antoine
    Miech, Antoine
    Sivic, Josef
    Laptev, Ivan
    Schmid, Cordelia
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
    Engin, Deniz
    Avrithis, Yannis
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 2796 - 2802
  • [3] Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
    Pan, Junting
    Lin, Ziyi
    Ge, Yuying
    Zhu, Xiatian
    Zhang, Renrui
    Wang, Yi
    Qiao, Yu
    Li, Hongsheng
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 272 - 283
  • [4] Zero-Shot Visual Question Answering Using Knowledge Graph
    Chen, Zhuo
    Chen, Jiaoyan
    Geng, Yuxia
    Pan, Jeff Z.
    Yuan, Zonggang
    Chen, Huajun
    SEMANTIC WEB - ISWC 2021, 2021, 12922 : 146 - 162
  • [5] Zero-Shot Video Retrieval Using Content and Concepts
    Dalton, Jeffrey
    Allan, James
    Mirajkar, Pranav
    PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1857 - 1860
  • [6] Zero-shot Event Causality Identification with Question Answering
    Liakhovets, Daria
    Schlarb, Sven
    PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2022, 2022, : 113 - 119
  • [7] VidToMe: Video Token Merging for Zero-Shot Video Editing
    Li, Xirui
    Ma, Chao
    Yang, Xiaokang
    Yang, Ming-Hsuan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7486 - 7495
  • [8] Toward Zero-Shot and Zero-Resource Multilingual Question Answering
    Kuo, Chia-Chih
    Chen, Kuan-Yu
    IEEE ACCESS, 2022, 10 : 99754 - 99761
  • [9] Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
    Khachatryan, Levon
    Movsisyan, Andranik
    Tadevosyan, Vahram
    Henschel, Roberto
    Wang, Zhangyang
    Navasardyan, Shant
    Shi, Humphrey
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15908 - 15918
  • [10] Zero-shot Visual Question Answering with Language Model Feedback
    Du, Yifan
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Wen, Ji-Rong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 9268 - 9281