Video Summarization With Spatiotemporal Vision Transformer

被引:8
|
作者
Hsu, Tzu-Chun [1 ]
Liao, Yi-Sheng [1 ]
Huang, Chun-Rong [1 ,2 ,3 ]
机构
[1] Natl Chung Hsing Univ, Dept Comp Sci & Engn, Taichung 402, Taiwan
[2] Natl Cheng Kung Univ, Cross Coll Elite Program, Tainan 701, Taiwan
[3] Natl Cheng Kung Univ, Acad Innovat Semicond & Sustainable Mfg, Tainan 701, Taiwan
关键词
Correlation; Transformers; Spatiotemporal phenomena; Indexes; Generative adversarial networks; Feature extraction; Task analysis; Video summarization; transformer; vision transformer; multi-head self-attention; temporal inter-frame correlation; spatial intra-frame attention; multi-frame loss; SHOT; LOCALIZATION;
D O I
10.1109/TIP.2023.3275069
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.
引用
收藏
页码:3013 / 3026
页数:14
相关论文
共 50 条
  • [1] Video Summarization With Frame Index Vision Transformer
    Hsu, Tzu-Chun
    Liao, Yi-Sheng
    Huang, Chun-Rong
    [J]. PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
  • [2] Efficient Transformer for Video Summarization
    Kolmakova, Tatiana
    Makarov, Ilya
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2023, PT II, 2023, 14135 : 52 - 65
  • [3] Spatiotemporal Feature Fusion for Video Summarization
    Kashid, Shamal
    Awasthi, Lalit K.
    Berwal, Krishan
    Saini, Parul
    [J]. IEEE MULTIMEDIA, 2024, 31 (03) : 88 - 97
  • [4] A spatiotemporal motion model for video summarization
    Vasconcelos, N
    Lippman, A
    [J]. 1998 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, PROCEEDINGS, 1998, : 361 - 366
  • [5] Video summarization with u-shaped transformer
    Chen, Yaosen
    Guo, Bing
    Shen, Yan
    Zhou, Renshuang
    Lu, Weichen
    Wang, Wei
    Wen, Xuming
    Suo, Xinhua
    [J]. APPLIED INTELLIGENCE, 2022, 52 (15) : 17864 - 17880
  • [6] Video summarization with u-shaped transformer
    Yaosen Chen
    Bing Guo
    Yan Shen
    Renshuang Zhou
    Weichen Lu
    Wei Wang
    Xuming Wen
    Xinhua Suo
    [J]. Applied Intelligence, 2022, 52 : 17864 - 17880
  • [7] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [8] Multi-Level Spatiotemporal Network for Video Summarization
    Yao, Ming
    Bai, Yu
    Du, Wei
    Zhang, Xuejun
    Quan, Heng
    Cai, Fuli
    Kang, Hongwei
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [9] Spatiotemporal Modeling and Label Distribution Learning for Video Summarization
    Chu, Wei-Ta
    Liu, Yu-Hsin
    [J]. 2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,
  • [10] Deepfake Video Detection with Spatiotemporal Dropout Transformer
    Zhang, Daichi
    Lin, Fanzhao
    Hua, Yingying
    Wang, Pengju
    Zeng, Dan
    Ge, Shiming
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5833 - 5841