Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

被引:36
|
作者
Cai, Sijia [1 ,2 ]
Zuo, Wangmeng [3 ]
Davis, Larry S. [4 ]
Zhang, Lei [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China
[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China
[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
来源
关键词
Video summarization; Variational autoencoder;
D O I
10.1007/978-3-030-01264-9_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.
引用
收藏
页码:193 / 210
页数:18
相关论文
共 50 条
  • [1] Video Summarization With Attention-Based Encoder-Decoder Networks
    Ji, Zhong
    Xiong, Kailin
    Pang, Yanwei
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (06) : 1709 - 1717
  • [2] AUTOMATIC SINGING TRANSCRIPTION BASED ON ENCODER-DECODER RECURRENT NEURAL NETWORKS WITH A WEAKLY-SUPERVISED ATTENTION MECHANISM
    Nishikimi, Ryo
    Nakamura, Eita
    Fukayama, Satoru
    Goto, Masataka
    Yoshii, Kazuyoshi
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 161 - 165
  • [3] Variational Memory Encoder-Decoder
    Hung Le
    Truyen Tran
    Thin Nguyen
    Venkatesh, Svetha
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [4] Effective Video Summarization Using Channel Attention-Assisted Encoder-Decoder Framework
    Alharbi, Faisal
    Habib, Shabana
    Albattah, Waleed
    Jan, Zahoor
    Alanazi, Meshari D.
    Islam, Muhammad
    [J]. SYMMETRY-BASEL, 2024, 16 (06):
  • [5] An encoder-decoder framework with dynamic convolution for weakly supervised instance segmentation
    Zhu, Liangjun
    Peng, Li
    Ding, Shuchen
    Liu, Zhongren
    [J]. IET COMPUTER VISION, 2023, 17 (08) : 883 - 894
  • [6] Encoder-Decoder Architectures based Video Summarization using Key-Shot Selection Model
    Yashwanth, Kolli
    Soni, Badal
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (11) : 31395 - 31415
  • [7] Encoder-Decoder Architectures based Video Summarization using Key-Shot Selection Model
    Kolli Yashwanth
    Badal Soni
    [J]. Multimedia Tools and Applications, 2024, 83 : 31395 - 31415
  • [8] A Normalized Encoder-Decoder Model for Abstractive Summarization Using Focal Loss
    Shi, Yunsheng
    Meng, Jun
    Wang, Jian
    Lin, Hongfei
    Li, Yumeng
    [J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2018, PT II, 2018, 11109 : 383 - 392
  • [9] A Dual Attention Encoder-Decoder Text Summarization Model
    Hakami, Nada Ali
    Mahmoud, Hanan Ahmed Hosni
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 3697 - 3710
  • [10] A Multimodal Variational Encoder-Decoder Framework for Micro-video Popularity Prediction
    Xie, Jiayi
    Zhu, Yaochen
    Zhang, Zhibin
    Peng, Jian
    Yi, Jing
    Hu, Yaosi
    Liu, Hongyi
    Chen, Zhenzhong
    [J]. WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2542 - 2548