An Image Grid Can Be Worth a Video: Zero-Shot Video Question Answering Using a VLM

被引:0
|
作者
Kim, Wonkyun [1 ]
Choi, Changin [2 ]
Lee, Wonseok [2 ]
Rhee, Wonjong [1 ,2 ,3 ]
机构
[1] Seoul Natl Univ, Dept Intelligence & Informat, Seoul 08826, South Korea
[2] Seoul Natl Univ, Interdisciplinary Program Artificial Intelligence, Seoul 08826, South Korea
[3] Seoul Natl Univ, AI Inst, Seoul 08826, South Korea
来源
IEEE ACCESS | 2024年 / 12卷
基金
新加坡国家研究基金会;
关键词
Benchmark testing; Training; Question answering (information retrieval); Cognition; Visualization; Data models; Focusing; Computational modeling; Tuning; Streaming media; Image grid; video question answering; video representation; vision language model;
D O I
10.1109/ACCESS.2024.3517625
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in our proposed grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure at the pixel level. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot VQA benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks. We also discuss how IG-VLM can be extended for long videos and provide an extension method that consistently and reliably improves the performance. Our code is are available at: https://github.com/imagegridworth/IG-VLM
引用
收藏
页码:193057 / 193075
页数:19
相关论文
共 50 条
  • [21] Zero-Shot Video Moment Retrieval Using BLIP-Based Models
    Wattasseril, Jobin Idiculla
    Shekhar, Sumit
    Doellner, Juergen
    Trapp, Matthias
    ADVANCES IN VISUAL COMPUTING, ISVC 2023, PT I, 2023, 14361 : 160 - 171
  • [22] CAR: Conceptualization-Augmented Reasoner for Zero-Shot Commonsense Question Answering
    Wang, Weiqi
    Fang, Tianqing
    Ding, Wenxuan
    Xu, Baixuan
    Li, Xin
    Song, Yangqiu
    Bosselut, Antoine
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13520 - 13545
  • [23] Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question Answering
    Riabi, Arij
    Scialom, Thomas
    Keraron, Rachel
    Sagot, Benoit
    Seddah, Djame
    Staiano, Jacopo
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7016 - 7030
  • [24] MuHeQA: Zero-shot question answering over multiple and heterogeneous knowledge bases
    Badenes-Olmedo, Carlos
    Corcho, Oscar
    SEMANTIC WEB, 2024, 15 (05) : 1547 - 1561
  • [25] Chart question answering with multimodal graph representation learning and zero-shot classification
    Farahani, Ali Mazraeh
    Adibi, Peyman
    Ehsani, Mohammad Saeed
    Hutter, Hans-Peter
    Darvishy, Alireza
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 270
  • [26] Self-Supervised Knowledge Triplet Learning for Zero-Shot Question Answering
    Banerjee, Pratyay
    Baral, Chitta
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 151 - 162
  • [27] Zero-Shot End-To-End Spoken Question Answering In Medical Domain
    Labrak, Yanis
    Moumeni, Adel
    Dufour, Richard
    Rouvier, Mickael
    INTERSPEECH 2024, 2024, : 2020 - 2024
  • [28] Zero-shot Generalization in Dialog State Tracking through Generative Question Answering
    Li, Shuyang
    Cao, Jin
    Sridhar, Mukund
    Zhu, Henghui
    Li, Shang-Wen
    Hamza, Wael
    McAuley, Julian
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1063 - 1074
  • [29] Efficient and consistent zero-shot video generation with diffusion models
    Frakes, Ethan
    Khalid, Umar
    Chen, Chen
    REAL-TIME IMAGE PROCESSING AND DEEP LEARNING 2024, 2024, 13034
  • [30] Prompt-based Zero-shot Video Moment Retrieval
    Wang, Guolong
    Wu, Xun
    Liu, Zhaoyuan
    Yan, Junchi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,