Video summarization with u-shaped transformer

被引：8

作者：

Chen, Yaosen ^{[1
,3
]}

Guo, Bing ^{[1
]}

Shen, Yan ^{[2
]}

Zhou, Renshuang ^{[1
,3
]}

Lu, Weichen ^{[3
]}

Wang, Wei ^{[1
,3
,4
]}

Wen, Xuming ^{[3
,4
]}

Suo, Xinhua ^{[1
]}

机构：

[1] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Sichuan, Peoples R China

[2] Chengdu Univ Informat Technol, Sch Comp Sci, Chengdu 610225, Sichuan, Peoples R China

[3] ChengDu Sobey Digital Technol Co Ltd, Media Intelligence Lab, Chengdu 610041, Sichuan, Peoples R China

[4] Peng Cheng Lab, Shenzhen 518055, Peoples R China

来源：

APPLIED INTELLIGENCE | 2022年 / 52卷 / 15期

基金：

中国国家自然科学基金;

关键词：

Video summarization; Transformer; Multi-scale;

D O I：

10.1007/s10489-022-03451-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, supervised video summarization has made tremendous progress with treating it as a sequence-to-sequence learning task. However, traditional recurrent neural networks (RNNs) have limitations in sequence modeling of long sequences, and the use of a transformer for sequence modeling requires a large number of parameters. We propose an efficient U-shaped transformer for video summarization tasks in this paper to address this issue, which we call "Uformer". Precisely, Uformer consists of three key components: embedding, Uformer block, and prediction head. First of all, the image features sequence is represented by the pre-trained deep convolutional network, then represented by a liner embedding. The image feature sequence differences are also represented by another liner embedding and concatenate together to form a two-stream embedding feature in the embedding component. Secondly, we stack multiple transformer layers into a U-shaped block to integrate the representations learned from the previous layers. Multi-scale Uformer can not only learn longer sequence information but also reduce the number of parameters and calculations. Finally, prediction head regression the localization of the keyframes and learning the corresponding classification scores. Uformer combine with non-maximum suppression (NMS) for post-processing to get the final video summarization. We improved the F-score from 50.2% to 53.9% by 3.7% on the SumMe dataset and improved F-score from 62.1% to 63.0% by 0.9% on the TVSum dataset. Our proposed model with 0.85M parameters which are only 32.32% of DR-DSN's parameters.

引用

页码：17864 / 17880

页数：17

共 50 条

[41] TU-Former: A Hybrid U-Shaped Transformer Network for SAR Image Denoising
Tian, Shikang
Liu, Shuaiqi
Zhao, Yuhang
Liu, Siyuan
Zhao, Shuhuan
Zhao, Jie
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT XI, 2024, 14435 : 377 - 389
[42] The modeling method of anisotropic U-shaped plate and U-shaped plate-cavity coupled system
Chen, Tianlin
Zhang, Hong
Ren, Wenhui
Shi, Dongyan
Huang, Bo
THIN-WALLED STRUCTURES, 2022, 172
[43] A U-Shaped cross sectional antenna on a U-Shaped ground plane with an offset parabolic reflector for WLAN
Thongsopa, Chanchai
Srimoon, Duang-Arthit
Jarataku, Prapol
2007 IEEE ANTENNAS AND PROPAGATION SOCIETY INTERNATIONAL SYMPOSIUM, VOLS 1-12, 2007, : 4692 - +
[44] Video Summarization With Frame Index Vision Transformer
Hsu, Tzu-Chun
Liao, Yi-Sheng
Huang, Chun-Rong
PROCEEDINGS OF 17TH INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS (MVA 2021), 2021,
[45] A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction
Xiao, Xiao
Guo, Wenliang
Chen, Rui
Hui, Yilong
Wang, Jianing
Zhao, Hongyu
REMOTE SENSING, 2022, 14 (11)
[46] GUFORMER: a gradient-aware U-shaped transformer neural network for real image denoising
Bai, Xuefei
Wan, Yongsong
Wang, Weiming
Zhou, Bin
JOURNAL OF SUPERCOMPUTING, 2025, 81 (01):
[47] U-shaped effect of eGFR and mortality
Bhandari, Sunil
KIDNEY INTERNATIONAL, 2012, 81 (11) : 1152 - 1152
[48] U-shaped development of corticospinal innervation
Kamiyama, Tsutomu
Maeda, Hitoshi
Sakurai, Masaki
NEUROSCIENCE RESEARCH, 2007, 58 : S50 - S50
[49] U-shaped spatial–temporal transformer network for 3D human pose estimation
Honghong Yang
Longfei Guo
Yumei Zhang
Xiaojun Wu
Machine Vision and Applications, 2022, 33
[50] U-shaped network based on Transformer for 3D point clouds semantic segmentation
Zhang, Jiazhe
Li, Xingwei
Zhao, Xianfa
Ge, Yizhi
Zhang, Zheng
2021 THE 5TH INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, ICVIP 2021, 2021, : 170 - 176

← 1 2 3 4 5 →