Progressive Video Summarization via Multimodal Self-supervised Learning

被引:16
|
作者
Li, Haopeng [1 ]
Ke, Qiuhong [3 ]
Gong, Mingming [2 ]
Drummond, Tom [1 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Melbourne, Vic, Australia
[2] Univ Melbourne, Sch Math & Stat, Melbourne, Vic, Australia
[3] Monash Univ, Dept Data Sci & AI, Clayton, Vic, Australia
关键词
D O I
10.1109/WACV56688.2023.00554
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern video summarization methods are based on deep neural networks that require a large amount of annotated data for training. However, existing datasets for video summarization are small-scale, easily leading to over-fitting of the deep models. Considering that the annotation of largescale datasets is time-consuming, we propose a multimodal self-supervised learning framework to obtain semantic representations of videos, which benefits the video summarization task. Specifically, the self-supervised learning is conducted by exploring the semantic consistency between the videos and text in both coarse-grained and fine-grained fashions, as well as recovering masked frames in the videos. The multimodal framework is trained on a newly-collected dataset that consists of video-text pairs. Additionally, we introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries. Extensive experiments have proved the effectiveness and superiority of our method in rank correlation coefficients and F-score1.
引用
收藏
页码:5573 / 5582
页数:10
相关论文
共 50 条
  • [1] Unsupervised Multimodal Video-to-Video Translation via Self-Supervised Learning
    Liu, Kangning
    Gu, Shuhang
    Romero, Andres
    Timofte, Radu
    [J]. 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1029 - 1039
  • [2] Self-Supervised Multimodal Opinion Summarization
    Im, Jinbae
    Kim, Moonki
    Lee, Hoyeop
    Cho, Hyunsouk
    Chung, Sehee
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 388 - 403
  • [3] Self-Supervised Learning for Contextualized Extractive Summarization
    Wang, Hong
    Wang, Xin
    Xiong, Wenhan
    Yu, Mo
    Guo, Xiaoxiao
    Chang, Shiyu
    Wang, William Yang
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2221 - 2227
  • [4] ADOPTING SELF-SUPERVISED LEARNING INTO UNSUPERVISED VIDEO SUMMARIZATION THROUGH RESTORATIVE SCORE.
    Abbasi, Mehryar
    Saeedi, Parvaneh
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 425 - 429
  • [5] Self-supervised Spatiotemporal Learning via Video Clip Order Prediction
    Xu, Dejing
    Xiao, Jun
    Zhao, Zhou
    Shao, Jian
    Xie, Di
    Zhuang, Yueting
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10326 - 10335
  • [6] VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
    Akbari, Hassan
    Yuan, Liangzhe
    Qian, Rui
    Chuang, Wei-Hong
    Chang, Shih-Fu
    Cui, Yin
    Gong, Boqing
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021,
  • [7] Self-supervised Learning for Endoscopic Video Analysis
    Hirsch, Roy
    Caron, Mathilde
    Cohen, Regev
    Livne, Amir
    Shapiro, Ron
    Golany, Tomer
    Goldenberg, Roman
    Freedman, Daniel
    Rivlin, Ehud
    [J]. MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT V, 2023, 14224 : 569 - 578
  • [8] Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport
    Wang, Yutong
    Xu, Hongteng
    Luo, Dixin
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6611 - 6622
  • [9] Self-supervised learning for robust video indexing
    Ewerth, Ralph
    Freisleben, Bernd
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO - ICME 2006, VOLS 1-5, PROCEEDINGS, 2006, : 1749 - +
  • [10] Federated Self-supervised Learning for Video Understanding
    Rehman, Yasar Abbas Ur
    Gao, Yan
    Shen, Jiajun
    de Gusmao, Pedro Porto Buarque
    Lane, Nicholas
    [J]. COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 : 506 - 522