Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

被引:36
|
作者
Cai, Sijia [1 ,2 ]
Zuo, Wangmeng [3 ]
Davis, Larry S. [4 ]
Zhang, Lei [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China
[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China
[3] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China
[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
来源
关键词
Video summarization; Variational autoencoder;
D O I
10.1007/978-3-030-01264-9_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.
引用
收藏
页码:193 / 210
页数:18
相关论文
共 50 条
  • [11] AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization
    Zhang, Xiao-Yu
    Li, Changsheng
    Shi, Haichao
    Zhu, Xiaobin
    Li, Peng
    Dong, Jing
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1852 - 1863
  • [12] Video to Text Study using an Encoder-Decoder Networks Approach
    Ismael Orozco, Carlos
    Elena Buemi, Maria
    Jacobo Berlles, Julio
    2018 37TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2018,
  • [13] Social image refinement and annotation via weakly-supervised variational auto-encoder
    Xu, Chaoyang
    Dai, Yuanfei
    Lin, Renjie
    Wang, Shiping
    KNOWLEDGE-BASED SYSTEMS, 2020, 192
  • [14] Encoder-Decoder Joint Enhancement for Video Chat
    Zhang, Zhenghao
    Wang, Zhao
    Ye, Yan
    Wang, Shiqi
    Zheng, Changwen
    2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [15] CEREBRUM: a fast and fully-volumetric Convolutional Encoder-decodeR for weakly-supervised sEgmentation of BRain strUctures from out-of-the-scanner MRI
    Bontempi, Dennis
    Benini, Sergio
    Signoroni, Alberto
    Svanera, Michele
    Muckli, Lars
    MEDICAL IMAGE ANALYSIS, 2020, 62
  • [16] Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems
    Manakul, Potsawee
    Gales, Mark J. F.
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9359 - 9368
  • [17] On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond
    Wu, Chen
    Wang, Prince Zizhuang
    Wang, William Yang
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3449 - 3464
  • [18] Weakly-Supervised Opinion Summarization by Leveraging External Information
    Zhao, Chao
    Chaturvedi, Snigdha
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9644 - 9651
  • [19] Weakly-Supervised Alignment of Video With Text
    Bojanowski, P.
    Lajugie, R.
    Grave, E.
    Bach, F.
    Laptev, I.
    Ponce, J.
    Schmid, C.
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4462 - 4470
  • [20] Encoder-Decoder Model for Automatic Video Captioning Using Yolo Algorithm
    Alkalouti, Hanan Nasser
    Al Masre, Mayada Ahmed
    2021 IEEE INTERNATIONAL IOT, ELECTRONICS AND MECHATRONICS CONFERENCE (IEMTRONICS), 2021, : 718 - 721