TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

被引：0

作者：

Ren, Shuhuai ^{[1
]}

Chen, Sishuo ^{[2
]}

Li, Shicheng ^{[1
]}

Sun, Xu ^{[1
]}

Hou, Lu ^{[3
]}

机构：

[1] Peking Univ, Sch Comp Sci, Natl Key Lab Multimedia Informat Proc, Beijing, Peoples R China

[2] Peking Univ, Ctr Data Sci, Beijing, Peoples R China

[3] Huawei Noahs Ark Lab, Beijing, Peoples R China

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale video-language pre-training has made remarkable strides in advancing video-language understanding tasks. However, the heavy computational burden of video encoding remains a formidable efficiency bottleneck, particularly for long-form videos. These videos contain massive visual tokens due to their inherent 3D properties and spatiotemporal redundancy, making it challenging to capture complex temporal and spatial relationships. To tackle this issue, we propose an efficient method called TEmporal-Spatial Token Aggregation (TESTA). TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame. TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block. We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks. Experimental results show that TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.(1)

引用

页码：932 / 947

页数：16

共 14 条

[1] Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Sun, Yuchong
Xue, Hongwei
Song, Ruihua
Liu, Bei
Yang, Huan
Fu, Jianlong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[2] Towards Long-Form Video Understanding
Wu, Chao-Yuan
Krahenbuhl, Philipp
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 1884 - 1894
[3] VideoAgent: Long-Form Video Understanding with Large Language Model as Agent
Wang, Xiaohan
Zhang, Yuhui
Zohar, Orr
Yeung-Levy, Serena
COMPUTER VISION - ECCV 2024, PT LXXX, 2025, 15138 : 58 - 76
[4] EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
Mangalam, Karttikeya
Akshkulakov, Raiymbek
Malik, Jitendra
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[5] Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding
Afham, Mohamed
Shukla, Satya Narayan
Poursaeed, Omid
Zhang, Pengchuan
Shah, Ashish
Lim, Sernam
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 1181 - 1186
[6] Temporal-spatial information mining and aggregation for video matting
Zhiwei Ma
Guilin Yao
Multimedia Tools and Applications, 2024, 83 : 29221 - 29237
[7] Temporal-spatial information mining and aggregation for video matting
Ma, Zhiwei
Yao, Guilin
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (10) : 29221 - 29237
[8] VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
Li, Shicheng
Li, Lei
Liu, Yi
Ren, Shuhuai
Liu, Yuanxin
Gao, Rundong
Sun, Xu
Hou, Lu
COMPUTER VISION - ECCV 2024, PT LXX, 2025, 15128 : 331 - 348
[9] Selective Structured State-Spaces for Long-Form Video Understanding
Wang, Jue
Zhu, Wentao
Wang, Pichao
Yu, Xiang
Liu, Linda
Omar, Mohamed
Hamid, Raffay
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6387 - 6397
[10] MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Gao, Difei
Zhou, Luowei
Ji, Lei
Zhu, Linchao
Yang, Yi
Shou, Mike Zheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14773 - 14783

← 1 2 →