An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM

被引:0
|
作者
Kim, Wonkyun [1 ]
Choi, Changin [2 ]
Lee, Wonseok [2 ]
Rhee, Wonjong [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ, Dept Intelligence & Informat, Seoul 08826, South Korea
[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul 08826, South Korea
[3] Seoul Natl Univ, AI Inst, Seoul 08826, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Benchmark testing; Training; Question answering (information retrieval); Cognition; Visualization; Data models; Focusing; Computational modeling; Tuning; Streaming media; Image grid; video question answering; video representation; vision language model;
D O I
10.1109/ACCESS.2024.3517625
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in our proposed grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure at the pixel level. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot VQA benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks. We also discuss how IG-VLM can be extended for long videos and provide an extension method that consistently and reliably improves the performance. Our code is are available at: https://github.com/imagegridworth/IG-VLM
引用
收藏
页码:193057 / 193075
页数:19
相关论文
共 50 条
  • [31] Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization
    Pang, Zongshang
    Nakashima, Yuta
    Otani, Mayu
    Nagahara, Hajime
    JOURNAL OF IMAGING, 2024, 10 (09)
  • [32] SKETCHQL Demonstration: Zero-shot Video Moment Querying with Sketches
    Wu, Renzhi
    Chunduri, Pramod
    Shah, Dristi j
    Aravind, Ashmitha Julius
    Payani, Ali
    Chu, Xu
    Arulraj, Joy
    Rong, Kexin
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (12): : 4429 - 4432
  • [33] Zero-Shot Video Grounding With Pseudo Query Lookup and Verification
    Lu, Yu
    Quan, Ruijie
    Zhu, Linchao
    Yang, Yi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1643 - 1654
  • [34] Language-free Training for Zero-shot Video Grounding
    Kim, Dahye
    Park, Jungin
    Lee, Jiyoung
    Park, Seongheon
    Sohn, Kwanghoon
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 2538 - 2547
  • [35] Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation
    Yuan, Yichen
    Wang, Yifan
    Wang, Lijun
    Zhao, Xiaoqi
    Lu, Huchuan
    Wang, Yu
    Su, Weibo
    Zhang, Lei
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 966 - 976
  • [36] Semantic-Guided Zero-Shot Learning for Low-Light Image/Video Enhancement
    Zheng, Shen
    Gupta, Gaurav
    2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 581 - 590
  • [37] Zero-Shot Question Classification Using Synthetic Samples
    Fu, Hao
    Yuan, Caixia
    Wang, Xiaojie
    Sang, Zhijie
    Hu, Shuo
    Shi, Yuanyuan
    PROCEEDINGS OF 2018 5TH IEEE INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENCE SYSTEMS (CCIS), 2018, : 714 - 718
  • [38] Zero-Shot Learning for IMU-Based Activity Recognition Using Video Embeddings
    Tong, Catherine
    Ge, Jinchen
    Lane, Nicholas D.
    PROCEEDINGS OF THE ACM ON INTERACTIVE MOBILE WEARABLE AND UBIQUITOUS TECHNOLOGIES-IMWUT, 2021, 5 (04):
  • [39] Zero-Shot Rationalization by Multi-Task Transfer Learning from Question Answering
    Kung, Po-Nien
    Yang, Tse-Hsuan
    Chen, Yi-Cheng
    Yin, Sheng-Siang
    Chen, Yun-Nung
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2187 - 2197
  • [40] Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts
    Lan, Yunshi
    Li, Xiang
    Liu, Xin
    Li, Yang
    Qin, Wei
    Qian, Weining
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4389 - 4400