Video summarization with u-shaped transformer

被引：8

作者：

Chen, Yaosen ^{[1
,3
]}

Guo, Bing ^{[1
]}

Shen, Yan ^{[2
]}

Zhou, Renshuang ^{[1
,3
]}

Lu, Weichen ^{[3
]}

Wang, Wei ^{[1
,3
,4
]}

Wen, Xuming ^{[3
,4
]}

Suo, Xinhua ^{[1
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Sichuan, Peoples R China

[2] Chengdu Univ Informat Technol, Sch Comp Sci, Chengdu 610225, Sichuan, Peoples R China

[3] ChengDu Sobey Digital Technol Co Ltd, Media Intelligence Lab, Chengdu 610041, Sichuan, Peoples R China

[4] Peng Cheng Lab, Shenzhen 518055, Peoples R China

来源：

APPLIED INTELLIGENCE | 2022年 / 52卷 / 15期

基金：

中国国家自然科学基金;

关键词：

Video summarization; Transformer; Multi-scale;

D O I：

10.1007/s10489-022-03451-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call "Uformer". Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN's parameters.

引用

页码：17864 / 17880

页数：17

共 50 条

[31] Flow in a U-shaped burrow
Mori, K
Shikasho, S
Hiramatsu, K
Takegaki, T
JOURNAL OF THE FACULTY OF AGRICULTURE KYUSHU UNIVERSITY, 2000, 45 (01): : 267 - 276
[32] On the Necessity of U-Shaped Learning
Carlucci, Lorenzo
Case, John
TOPICS IN COGNITIVE SCIENCE, 2013, 5 (01) : 56 - 88
[33] ALCOHOL AND THE U-SHAPED CURVE
KAGAN, A
YANO, K
REED, DM
LANCET, 1989, 1 (8631): : 224 - 225
[34] THE U-SHAPED CURVE OF CONCERN
REICHMAN, LB
AMERICAN REVIEW OF RESPIRATORY DISEASE, 1991, 144 (04): : 741 - 742
[35] Variations on U-shaped learning
Carlucci, L
Jain, SA
Kinber, E
Stephan, F
LEARNING THEORY, PROCEEDINGS, 2005, 3559 : 382 - 397
[36] UViT: Efficient and lightweight U-shaped hybrid vision transformer for human pose estimation
Li B.
Tang S.
Li W.
Journal of Intelligent and Fuzzy Systems, 2024, 46 (04): : 8345 - 8359
[37] mSwinUNet: A multi-modal U-shaped swin transformer for supervised change detection
Lu, Tianjun
Zhong, Xian
Zhong, Luo
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 4243 - 4252
[38] DIRformer: A Novel Image Restoration Approach Based on U-shaped Transformer and Diffusion Models
Hu, Cong
Wei, Xiao-zhong
Wu, Xiao-jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2025, 21 (02)
[39] A U-Shaped Convolution-Aided Transformer with Double Attention for Hyperspectral Image Classification
Qin, Ruiru
Wang, Chuanzhi
Wu, Yongmei
Du, Huafei
Lv, Mingyun
REMOTE SENSING, 2024, 16 (02)
[40] U-Shaped Mechanical Activation 4 U?
Prinzen, Frits W.
Kroon, Wilco
Auricchio, Angelo
JACC-CARDIOVASCULAR IMAGING, 2013, 6 (08) : 874 - 876

← 1 2 3 4 5 →