Video summarization with u-shaped transformer

被引:8
|
作者
Chen, Yaosen [1 ,3 ]
Guo, Bing [1 ]
Shen, Yan [2 ]
Zhou, Renshuang [1 ,3 ]
Lu, Weichen [3 ]
Wang, Wei [1 ,3 ,4 ]
Wen, Xuming [3 ,4 ]
Suo, Xinhua [1 ]
机构
[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Sichuan, Peoples R China
[2] Chengdu Univ Informat Technol, Sch Comp Sci, Chengdu 610225, Sichuan, Peoples R China
[3] ChengDu Sobey Digital Technol Co Ltd, Media Intelligence Lab, Chengdu 610041, Sichuan, Peoples R China
[4] Peng Cheng Lab, Shenzhen 518055, Peoples R China
基金
中国国家自然科学基金;
关键词
Video summarization; Transformer; Multi-scale;
D O I
10.1007/s10489-022-03451-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call "Uformer". Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN's parameters.
引用
收藏
页码:17864 / 17880
页数:17
相关论文
共 50 条
  • [41] TU-Former: A Hybrid U-Shaped Transformer Network for SAR Image Denoising
    Tian, Shikang
    Liu, Shuaiqi
    Zhao, Yuhang
    Liu, Siyuan
    Zhao, Shuhuan
    Zhao, Jie
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XI, 2024, 14435 : 377 - 389
  • [42] The modeling method of anisotropic U-shaped plate and U-shaped plate-cavity coupled system
    Chen, Tianlin
    Zhang, Hong
    Ren, Wenhui
    Shi, Dongyan
    Huang, Bo
    THIN-WALLED STRUCTURES, 2022, 172
  • [43] A U-Shaped cross sectional antenna on a U-Shaped ground plane with an offset parabolic reflector for WLAN
    Thongsopa, Chanchai
    Srimoon, Duang-Arthit
    Jarataku, Prapol
    2007 IEEE ANTENNAS AND PROPAGATION SOCIETY INTERNATIONAL SYMPOSIUM, VOLS 1-12, 2007, : 4692 - +
  • [44] Video Summarization With Frame Index Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
  • [45] A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction
    Xiao, Xiao
    Guo, Wenliang
    Chen, Rui
    Hui, Yilong
    Wang, Jianing
    Zhao, Hongyu
    REMOTE SENSING, 2022, 14 (11)
  • [46] GUFORMER: a gradient-aware U-shaped transformer neural network for real image denoising
    Bai, Xuefei
    Wan, Yongsong
    Wang, Weiming
    Zhou, Bin
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (01):
  • [47] U-shaped effect of eGFR and mortality
    Bhandari, Sunil
    KIDNEY INTERNATIONAL, 2012, 81 (11) : 1152 - 1152
  • [48] U-shaped development of corticospinal innervation
    Kamiyama, Tsutomu
    Maeda, Hitoshi
    Sakurai, Masaki
    NEUROSCIENCE RESEARCH, 2007, 58 : S50 - S50
  • [49] U-shaped spatial–temporal transformer network for 3D human pose estimation
    Honghong Yang
    Longfei Guo
    Yumei Zhang
    Xiaojun Wu
    Machine Vision and Applications, 2022, 33
  • [50] U-shaped network based on Transformer for 3D point clouds semantic segmentation
    Zhang, Jiazhe
    Li, Xingwei
    Zhao, Xianfa
    Ge, Yizhi
    Zhang, Zheng
    2021 THE 5TH INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, ICVIP 2021, 2021, : 170 - 176