Weakly-Supervised Video Summarization Using Variational Encoder-Decoder and Web Prior

被引：36

作者：

Cai, Sijia ^{[1
,2
]}

Zuo, Wangmeng ^{[3
]}

Davis, Larry S. ^{[4
]}

Zhang, Lei ^{[1
]}

机构：

[1] Hong Kong Polytech Univ, Dept Comp, Kowloon, Hong Kong, Peoples R China

[2] DAMO Acad, Alibaba Grp, Hangzhou, Peoples R China

[3] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China

[4] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA

来源：

COMPUTER VISION - ECCV 2018, PT XIV | 2018年 / 11218卷

关键词：

Video summarization; Variational autoencoder;

D O I：

10.1007/978-3-030-01264-9_12

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization is a challenging under-constrained problem because the underlying summary of a single video strongly depends on users' subjective understandings. Data-driven approaches, such as deep neural networks, can deal with the ambiguity inherent in this task to some extent, but it is extremely expensive to acquire the temporal annotations of a large-scale video dataset. To leverage the plentiful web-crawled videos to improve the performance of video summarization, we present a generative modelling framework to learn the latent semantic video representations to bridge the benchmark data and web data. Specifically, our framework couples two important components: a variational autoencoder for learning the latent semantics from web videos, and an encoder-attention-decoder for saliency estimation of raw video and summary generation. A loss term to learn the semantic matching between the generated summaries and web videos is presented, and the overall framework is further formulated into a unified conditional variational encoder-decoder, called variational encoder-summarizer-decoder (VESD). Experiments conducted on the challenging datasets CoSum and TVSum demonstrate the superior performance of the proposed VESD to existing state-of-the-art methods. The source code of this work can be found at https://github.com/cssjcai/vesd.

引用

页码：193 / 210

页数：18

共 50 条

[11] AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization
Zhang, Xiao-Yu
Li, Changsheng
Shi, Haichao
Zhu, Xiaobin
Li, Peng
Dong, Jing
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (04) : 1852 - 1863
[12] Video to Text Study using an Encoder-Decoder Networks Approach
Ismael Orozco, Carlos
Elena Buemi, Maria
Jacobo Berlles, Julio
2018 37TH INTERNATIONAL CONFERENCE OF THE CHILEAN COMPUTER SCIENCE SOCIETY (SCCC), 2018,
[13] Social image refinement and annotation via weakly-supervised variational auto-encoder
Xu, Chaoyang
Dai, Yuanfei
Lin, Renjie
Wang, Shiping
KNOWLEDGE-BASED SYSTEMS, 2020, 192
[14] Encoder-Decoder Joint Enhancement for Video Chat
Zhang, Zhenghao
Wang, Zhao
Ye, Yan
Wang, Shiqi
Zheng, Changwen
2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
[15] CEREBRUM: a fast and fully-volumetric Convolutional Encoder-decodeR for weakly-supervised sEgmentation of BRain strUctures from out-of-the-scanner MRI
Bontempi, Dennis
Benini, Sergio
Signoroni, Alberto
Svanera, Michele
Muckli, Lars
MEDICAL IMAGE ANALYSIS, 2020, 62
[16] Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems
Manakul, Potsawee
Gales, Mark J. F.
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9359 - 9368
[17] On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond
Wu, Chen
Wang, Prince Zizhuang
Wang, William Yang
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3449 - 3464
[18] Weakly-Supervised Opinion Summarization by Leveraging External Information
Zhao, Chao
Chaturvedi, Snigdha
THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9644 - 9651
[19] Weakly-Supervised Alignment of Video With Text
Bojanowski, P.
Lajugie, R.
Grave, E.
Bach, F.
Laptev, I.
Ponce, J.
Schmid, C.
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4462 - 4470
[20] Encoder-Decoder Model for Automatic Video Captioning Using Yolo Algorithm
Alkalouti, Hanan Nasser
Al Masre, Mayada Ahmed
2021 IEEE INTERNATIONAL IOT, ELECTRONICS AND MECHATRONICS CONFERENCE (IEMTRONICS), 2021, : 718 - 721

← 1 2 3 4 5 →