Video summarization with u-shaped transformer

被引:8
|
作者
Chen, Yaosen [1 ,3 ]
Guo, Bing [1 ]
Shen, Yan [2 ]
Zhou, Renshuang [1 ,3 ]
Lu, Weichen [3 ]
Wang, Wei [1 ,3 ,4 ]
Wen, Xuming [3 ,4 ]
Suo, Xinhua [1 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Sichuan, Peoples R China
[2] Chengdu Univ Informat Technol, Sch Comp Sci, Chengdu 610225, Sichuan, Peoples R China
[3] ChengDu Sobey Digital Technol Co Ltd, Media Intelligence Lab, Chengdu 610041, Sichuan, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Video summarization; Transformer; Multi-scale;
D O I
10.1007/s10489-022-03451-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call "Uformer". Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN's parameters.
引用
收藏
页码:17864 / 17880
页数:17
相关论文
共 50 条
  • [31] Flow in a U-shaped burrow
    Mori, K
    Shikasho, S
    Hiramatsu, K
    Takegaki, T
    JOURNAL OF THE FACULTY OF AGRICULTURE KYUSHU UNIVERSITY, 2000, 45 (01): : 267 - 276
  • [32] On the Necessity of U-Shaped Learning
    Carlucci, Lorenzo
    Case, John
    TOPICS IN COGNITIVE SCIENCE, 2013, 5 (01) : 56 - 88
  • [33] ALCOHOL AND THE U-SHAPED CURVE
    KAGAN, A
    YANO, K
    REED, DM
    LANCET, 1989, 1 (8631): : 224 - 225
  • [34] THE U-SHAPED CURVE OF CONCERN
    REICHMAN, LB
    AMERICAN REVIEW OF RESPIRATORY DISEASE, 1991, 144 (04): : 741 - 742
  • [35] Variations on U-shaped learning
    Carlucci, L
    Jain, SA
    Kinber, E
    Stephan, F
    LEARNING THEORY, PROCEEDINGS, 2005, 3559 : 382 - 397
  • [36] UViT: Efficient and lightweight U-shaped hybrid vision transformer for human pose estimation
    Li B.
    Tang S.
    Li W.
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (04): : 8345 - 8359
  • [37] mSwinUNet: A multi-modal U-shaped swin transformer for supervised change detection
    Lu, Tianjun
    Zhong, Xian
    Zhong, Luo
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 4243 - 4252
  • [38] DIRformer: A Novel Image Restoration Approach Based on U-shaped Transformer and Diffusion Models
    Hu, Cong
    Wei, Xiao-zhong
    Wu, Xiao-jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (02)
  • [39] A U-Shaped Convolution-Aided Transformer with Double Attention for Hyperspectral Image Classification
    Qin, Ruiru
    Wang, Chuanzhi
    Wu, Yongmei
    Du, Huafei
    Lv, Mingyun
    REMOTE SENSING, 2024, 16 (02)
  • [40] U-Shaped Mechanical Activation 4 U?
    Prinzen, Frits W.
    Kroon, Wilco
    Auricchio, Angelo
    JACC-CARDIOVASCULAR IMAGING, 2013, 6 (08) : 874 - 876