Enhancing Video Summarization via Vision-Language Embedding

被引:46
|
作者
Plummer, Bryan A. [1 ]
Brown, Matthew [2 ]
Lazebnik, Svetlana [1 ]
机构
[1] Univ Illinois, Urbana, IL 61801 USA
[2] Google Res, Mountain View, CA USA
来源
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年
基金
美国国家科学基金会;
关键词
D O I
10.1109/CVPR.2017.118
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.
引用
收藏
页码:1052 / 1060
页数:9
相关论文
共 50 条
  • [21] ENHANCING REPRESENTATION IN MEDICAL VISION-LANGUAGE FOUNDATION MODELS VIA MULTI-SCALE INFORMATION EXTRACTION TECHNIQUES
    Huang, Weijian
    Li, Cheng
    Zhou, Hong-Yu
    Liu, Jiarun
    Yang, Hao
    Liang, Yong
    Shi, Guangming
    Zheng, Hairong
    Wang, Shanshan
    IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI 2024, 2024,
  • [22] Vision-Language Models for Vision Tasks: A Survey
    Zhang, Jingyi
    Huang, Jiaxing
    Jin, Sheng
    Lu, Shijian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
  • [23] Enhancing Vision-Language Models Incorporating TSK Fuzzy System for Domain Adaptation
    Shi, Kuo
    Lu, Jie
    Fang, Zhen
    Zhang, Guangquan
    2024 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, FUZZ-IEEE 2024, 2024,
  • [24] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [25] VLPSR: Enhancing Zero-Shot Object ReID with Vision-Language Model
    Hu, Mingzhe
    ADVANCES IN VISUAL COMPUTING, ISVC 2024, PT II, 2025, 15047 : 56 - 69
  • [26] Temporal Modeling Approach for Video Action Recognition Based on Vision-language Models
    Huang, Yue
    Gu, Xiaodong
    NEURAL INFORMATION PROCESSING, ICONIP 2023, PT III, 2024, 14449 : 512 - 523
  • [27] Meta-Personalizing Vision-Language Models to Find Named Instances in Video
    Yeh, Chun-Hsiao
    Russell, Bryan
    Sivic, Josef
    Heilbron, Fabian Caba
    Jenni, Simon
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19123 - 19132
  • [28] Improving Video Representation of Vision-Language Model with Decoupled Explicit Temporal Modeling
    Liu, Yuxi
    Zheng, Wenyu
    Chen, Sihong
    Zheng, Xinming
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT VII, 2025, 15037 : 525 - 539
  • [29] VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection
    Wu, Peng
    Zhou, Xuerong
    Pang, Guansong
    Zhou, Lingru
    Yan, Qingsen
    Wang, Peng
    Zhang, Yanning
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6074 - 6082
  • [30] Vision-Language Recommendation via Attribute Augmented Multimodal Reinforcement Learning
    Yu, Tong
    Shen, Yilin
    Zhang, Ruiyi
    Zeng, Xiangyu
    Jin, Hongxia
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 39 - 47