Enhancing Video Summarization via Vision-Language Embedding

被引：46

作者：

Plummer, Bryan A. ^{[1
]}

Brown, Matthew ^{[2
]}

Lazebnik, Svetlana ^{[1
]}

机构：

[1] Univ Illinois, Urbana, IL 61801 USA

[2] Google Res, Mountain View, CA USA

来源：

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年

基金：

美国国家科学基金会;

关键词：

D O I：

10.1109/CVPR.2017.118

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.

引用

页码：1052 / 1060

页数：9

共 50 条

[1] Hierarchical Vision-Language Alignment for Video Captioning
Zhang, Junchao
Peng, Yuxin
MULTIMEDIA MODELING (MMM 2019), PT I, 2019, 11295 : 42 - 54
[2] Open-World Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
Liu, Quande
Wen, Youpeng
Han, Jianhua
Xu, Chunjing
Xu, Hang
Liang, Xiaodan
COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 275 - 292
[3] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
Zhang, Wenbo
Zhang, Yifan
Lin, Jianfeng
Huang, Binqiang
Zhang, Jinlu
Yu, Wenhao
PATTERN RECOGNITION, 2025, 164
[4] Vision-Language Knowledge Exploration for Video Saliency Prediction
Zhou, Fei
Huang, Baitao
Qiu, Guoping
PATTERN RECOGNITION AND COMPUTER VISION, PT IX, PRCV 2024, 2025, 15039 : 191 - 205
[5] MixPrompt: Enhancing Generalizability and Adversarial Robustness for Vision-Language Models via Prompt Fusion
Fan, Hao
Ma, Zhaoyang
Li, Yong
Tian, Rui
Chen, Yunli
Gao, Chenlong
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IX, ICIC 2024, 2024, 14870 : 328 - 339
[6] FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding
Song, Duanxiao
Gao, Dehong
Liu, Gongshen
Li, Xiaoyong
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT V, 2024, 15020 : 308 - 323
[7] Enhancing Concept-Based Explanation with Vision-Language Models
Hossain, Md Hnran
Zamzmi, Ghada
Mouton, Peter
Sun, Yu
Goldgof, Dmitry
2024 IEEE 37TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, CBMS 2024, 2024, : 219 - 224
[8] Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Wu, Wenhao
Sun, Zhun
Ouyang, Wanli
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 2847 - 2855
[9] Constraint embedding for prompt tuning in vision-language pre-trained model
Cheng, Keyang
Wei, Liutao
Tang, Jingfeng
Zhan, Yongzhao
MULTIMEDIA SYSTEMS, 2025, 31 (01)
[10] Zero-shot Object Detection Through Vision-Language Embedding Alignment
Xie, Johnathan
Zheng, Shuai
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW, 2022, : 926 - 940

← 1 2 3 4 5 →