Enhancing Video Summarization via Vision-Language Embedding

被引:46
|
作者
Plummer, Bryan A. [1 ]
Brown, Matthew [2 ]
Lazebnik, Svetlana [1 ]
机构
[1] Univ Illinois, Urbana, IL 61801 USA
[2] Google Res, Mountain View, CA USA
来源
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR.2017.118
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.
引用
收藏
页码:1052 / 1060
页数:9
相关论文
共 50 条
  • [1] Hierarchical Vision-Language Alignment for Video Captioning
    Zhang, Junchao
    Peng, Yuxin
    MULTIMEDIA MODELING (MMM 2019), PT I, 2019, 11295 : 42 - 54
  • [2] Open-World Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
    Liu, Quande
    Wen, Youpeng
    Han, Jianhua
    Xu, Chunjing
    Xu, Hang
    Liang, Xiaodan
    COMPUTER VISION, ECCV 2022, PT XX, 2022, 13680 : 275 - 292
  • [3] DC-CLIP: Multilingual CLIP Compression via vision-language distillation and vision-language alignment
    Zhang, Wenbo
    Zhang, Yifan
    Lin, Jianfeng
    Huang, Binqiang
    Zhang, Jinlu
    Yu, Wenhao
    PATTERN RECOGNITION, 2025, 164
  • [4] Vision-Language Knowledge Exploration for Video Saliency Prediction
    Zhou, Fei
    Huang, Baitao
    Qiu, Guoping
    PATTERN RECOGNITION AND COMPUTER VISION, PT IX, PRCV 2024, 2025, 15039 : 191 - 205
  • [5] MixPrompt: Enhancing Generalizability and Adversarial Robustness for Vision-Language Models via Prompt Fusion
    Fan, Hao
    Ma, Zhaoyang
    Li, Yong
    Tian, Rui
    Chen, Yunli
    Gao, Chenlong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IX, ICIC 2024, 2024, 14870 : 328 - 339
  • [6] FashionGPT: A Large Vision-Language Model for Enhancing Fashion Understanding
    Song, Duanxiao
    Gao, Dehong
    Liu, Gongshen
    Li, Xiaoyong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT V, 2024, 15020 : 308 - 323
  • [7] Enhancing Concept-Based Explanation with Vision-Language Models
    Hossain, Md Hnran
    Zamzmi, Ghada
    Mouton, Peter
    Sun, Yu
    Goldgof, Dmitry
    2024 IEEE 37TH INTERNATIONAL SYMPOSIUM ON COMPUTER-BASED MEDICAL SYSTEMS, CBMS 2024, 2024, : 219 - 224
  • [8] Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
    Wu, Wenhao
    Sun, Zhun
    Ouyang, Wanli
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 2847 - 2855
  • [9] Constraint embedding for prompt tuning in vision-language pre-trained model
    Cheng, Keyang
    Wei, Liutao
    Tang, Jingfeng
    Zhan, Yongzhao
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [10] Zero-shot Object Detection Through Vision-Language Embedding Alignment
    Xie, Johnathan
    Zheng, Shuai
    2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW, 2022, : 926 - 940