TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

被引:0
|
作者
Ren, Shuhuai [1 ]
Chen, Sishuo [2 ]
Li, Shicheng [1 ]
Sun, Xu [1 ]
Hou, Lu [3 ]
机构
[1] Peking Univ, Sch Comp Sci, Natl Key Lab Multimedia Informat Proc, Beijing, Peoples R China
[2] Peking Univ, Ctr Data Sci, Beijing, Peoples R China
[3] Huawei Noahs Ark Lab, Beijing, Peoples R China
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.(1)
引用
收藏
页码:932 / 947
页数:16
相关论文
共 14 条
  • [1] Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
    Sun, Yuchong
    Xue, Hongwei
    Song, Ruihua
    Liu, Bei
    Yang, Huan
    Fu, Jianlong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Towards Long-Form Video Understanding
    Wu, Chao-Yuan
    Krahenbuhl, Philipp
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1884 - 1894
  • [3] VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
    Wang, Xiaohan
    Zhang, Yuhui
    Zohar, Orr
    Yeung-Levy, Serena
    COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 58 - 76
  • [4] EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
    Mangalam, Karttikeya
    Akshkulakov, Raiymbek
    Malik, Jitendra
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
    Afham, Mohamed
    Shukla, Satya Narayan
    Poursaeed, Omid
    Zhang, Pengchuan
    Shah, Ashish
    Lim, Sernam
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 1181 - 1186
  • [6] Temporal-spatial information mining and aggregation for video matting
    Zhiwei Ma
    Guilin Yao
    Multimedia Tools and Applications, 2024, 83 : 29221 - 29237
  • [7] Temporal-spatial information mining and aggregation for video matting
    Ma, Zhiwei
    Yao, Guilin
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (10) : 29221 - 29237
  • [8] VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
    Li, Shicheng
    Li, Lei
    Liu, Yi
    Ren, Shuhuai
    Liu, Yuanxin
    Gao, Rundong
    Sun, Xu
    Hou, Lu
    COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 331 - 348
  • [9] Selective Structured State-Spaces for Long-Form Video Understanding
    Wang, Jue
    Zhu, Wentao
    Wang, Pichao
    Yu, Xiang
    Liu, Linda
    Omar, Mohamed
    Hamid, Raffay
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6387 - 6397
  • [10] MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
    Gao, Difei
    Zhou, Luowei
    Ji, Lei
    Zhu, Linchao
    Yang, Yi
    Shou, Mike Zheng
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14773 - 14783