Enhancing Video Summarization via Vision-Language Embedding

被引：46

作者：

Plummer, Bryan A. ^{[1
]}

Brown, Matthew ^{[2
]}

Lazebnik, Svetlana ^{[1
]}

机构：

[1] Univ Illinois, Urbana, IL 61801 USA

[2] Google Res, Mountain View, CA USA

来源：

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/CVPR.2017.118

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.

引用

页码：1052 / 1060

页数：9

共 50 条

[21] ENHANCING REPRESENTATION IN MEDICAL VISION-LANGUAGE FOUNDATION MODELS VIA MULTI-SCALE INFORMATION EXTRACTION TECHNIQUES
Huang, Weijian
Li, Cheng
Zhou, Hong-Yu
Liu, Jiarun
Yang, Hao
Liang, Yong
Shi, Guangming
Zheng, Hairong
Wang, Shanshan
IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI 2024, 2024,
[22] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
[23] Enhancing Vision-Language Models Incorporating TSK Fuzzy System for Domain Adaptation
Shi, Kuo
Lu, Jie
Fang, Zhen
Zhang, Guangquan
2024 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, FUZZ-IEEE 2024, 2024,
[24] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Wang, Weihan
Yang, Zhen
Xu, Bin
Li, Juanzi
Sun, Yankui
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
[25] VLPSR: Enhancing Zero-Shot Object ReID with Vision-Language Model
Hu, Mingzhe
ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT II, 2025, 15047 : 56 - 69
[26] Temporal Modeling Approach for Video Action Recognition Based on Vision-language Models
Huang, Yue
Gu, Xiaodong
NEURAL INFORMATION PROCESSING, ICONIP 2023, PT III, 2024, 14449 : 512 - 523
[27] Meta-Personalizing Vision-Language Models to Find Named Instances in Video
Yeh, Chun-Hsiao
Russell, Bryan
Sivic, Josef
Heilbron, Fabian Caba
Jenni, Simon
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19123 - 19132
[28] Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling
Liu, Yuxi
Zheng, Wenyu
Chen, Sihong
Zheng, Xinming
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 525 - 539
[29] VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection
Wu, Peng
Zhou, Xuerong
Pang, Guansong
Zhou, Lingru
Yan, Qingsen
Wang, Peng
Zhang, Yanning
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6074 - 6082
[30] Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning
Yu, Tong
Shen, Yilin
Zhang, Ruiyi
Zeng, Xiangyu
Jin, Hongxia
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 39 - 47

← 1 2 3 4 5 →