LongVLM: Efficient Long Video Understanding via Large Language Models

被引:1
|
作者
Weng, Yuetian [1 ]
Han, Mingfei [3 ,4 ]
He, Haoyu [1 ]
Chang, Xiaojun [2 ,3 ]
Zhuang, Bohan [1 ]
机构
[1] Monash Univ, ZIP Lab, Melbourne, Vic, Australia
[2] Univ Sci & Technol China, Sch Informat Sci & Technol, Hefei, Peoples R China
[3] MBZUAI, Dept Comp Vis, Hefei, Peoples R China
[4] UTS, AAII, ReLER, Sydney, NSW, Australia
来源
关键词
D O I
10.1007/978-3-031-73414-4_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples show that our model produces more precise responses for long video understanding. Code is available at https://github.com/ziplab/LongVLM.
引用
收藏
页码:453 / 470
页数:18
相关论文
共 50 条
  • [1] Prompting Visual-Language Models for Efficient Video Understanding
    Ju, Chen
    Han, Tengda
    Zheng, Kunhao
    Zhang, Ya
    Xie, Weidi
    COMPUTER VISION - ECCV 2022, PT XXXV, 2022, 13695 : 105 - 124
  • [2] VISA: Reasoning Video Object Segmentation via Large Language Models
    Yan, Cilin
    Wang, Haochen
    Yan, Shilin
    Jiang, Xiaolong
    Hu, Yao
    Kang, Guoliang
    Xie, Weidi
    Gavves, Efstratios
    COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 98 - 115
  • [3] Towards Language-Driven Video Inpainting via Multimodal Large Language Models
    Wu, Jianzong
    Li, Xiangtai
    Si, Chenyang
    Zhou, Shangchen
    Yang, Jingkang
    Zhang, Jiangning
    Li, Yining
    Chen, Kai
    Tong, Yunhai
    Liu, Ziwei
    Loy, Chen Change
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12501 - 12511
  • [4] Surveillance Video-and-Language Understanding: From Small to Large Multimodal Models
    Yuan, Tongtong
    Zhang, Xuange
    Liu, Bo
    Liu, Kun
    Jin, Jian
    Jiao, Zhenzhen
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (01) : 300 - 314
  • [5] VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
    Wang, Xiaohan
    Zhang, Yuhui
    Zohar, Orr
    Yeung-Levy, Serena
    COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 58 - 76
  • [6] The Importance of Understanding Language in Large Language Models
    Youssef, Alaa
    Stein, Samantha
    Clapp, Justin
    Magnus, David
    AMERICAN JOURNAL OF BIOETHICS, 2023, 23 (10): : 6 - 7
  • [7] Meaning and understanding in large language models
    Havlik, Vladimir
    SYNTHESE, 2024, 205 (01)
  • [8] Understanding Telecom Language Through Large Language Models
    Bariah, Lina
    Zou, Hang
    Zhao, Qiyang
    Mouhouche, Belkacem
    Bader, Faouzi
    Debbah, Merouane
    IEEE CONFERENCE ON GLOBAL COMMUNICATIONS, GLOBECOM, 2023, : 6542 - 6547
  • [9] Understanding HTML']HTML with Large Language Models
    Gur, Izzeddin
    Nachum, Ofir
    Miao, Yingjie
    Safdari, Mustafa
    Huang, Austin
    Chowdhery, Aakanksha
    Narang, Sharan
    Fiedel, Noah
    Faust, Aleksandra
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2803 - 2821
  • [10] Shortcut Learning of Large Language Models in Natural Language Understanding
    Du, Mengnan
    He, Fengxiang
    Zou, Na
    Tao, Dacheng
    Hu, Xia
    COMMUNICATIONS OF THE ACM, 2024, 67 (01) : 110 - 120