Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion

被引：11

作者：

Ye, Qinghao ^{[1
,2
]}

Shen, Xiyue ^{[3
]}

Gao, Yuan ^{[4
]}

Wang, Zirui ^{[1
]}

Bi, Qi ^{[5
]}

Li, Ping ^{[1
]}

Yang, Guang ^{[6
]}

机构：

[1] Hangzhou Dianzi Univ, Hangzhou, Peoples R China

[2] Univ Calif San Diego, San Diego, CA 92103 USA

[3] East China Normal Univ, Shanghai, Peoples R China

[4] Univ Oxford, Oxford, England

[5] Wuhan Univ, Wuhan, Peoples R China

[6] Imperial Coll London, London, England

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/ICCV48922.2021.00785

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video highlight detection plays an increasingly important role in social media content filtering, however, it remains highly challenging to develop automated video highlight detection methods because of the lack of temporal annotations (i.e., where the highlight moments are in long videos) for supervised learning. In this paper, we propose a novel weakly supervised method that can learn to detect highlights by mining video characteristics with video level annotations (topic tags) only. Particularly, we exploit audio-visual features to enhance video representation and take temporal cues into account for improving detection performance. Our contributions are threefold: 1) we propose an audio-visual tensor fusion mechanism that efficiently models the complex association between two modalities while reducing the gap of the heterogeneity between the two modalities; 2) we introduce a novel hierarchical temporal context encoder to embed local temporal clues in between neighboring segments; 3) finally, we alleviate the gradient vanishing problem theoretically during model optimization with attention-gated instance aggregation. Extensive experiments on two benchmark datasets (YouTube Highlights and TVSum) have demonstrated our method outperforms other state-of-the-art methods with remarkable improvements.

引用

页码：7930 / 7939

页数：10

共 50 条

[1] AUDIO-VISUAL OBJECT LOCALIZATION AND SEPARATION USING LOW-RANK AND SPARSITY
Pu, Jie
Panagakis, Yannis
Petridis, Stavros
Pantic, Maja
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2901 - 2905
[2] Blind Audio-Visual Localization and Separation via Low-Rank and Sparsity
Pu, Jie
Panagakis, Yannis
Petridis, Stavros
Shen, Jie
Pantic, Maja
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (05) : 2288 - 2301
[3] Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion
Chen, Chenglizhao
Li, Shuai
Wang, Yongguang
Qin, Hong
Hao, Aimin
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (07) : 3156 - 3170
[4] Video concept detection by audio-visual grouplets
Wei Jiang
Alexander C. Loui
[J]. International Journal of Multimedia Information Retrieval, 2012, 1 (4) : 223 - 238
[5] Video concept detection by audio-visual grouplets
Jiang, Wei
Loui, Alexander C.
[J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2012, 1 (04) : 223 - 238
[6] Joint Visual and Audio Learning for Video Highlight Detection
Badamdorj, Taivanbat
Rochan, Mrigank
Wang, Yang
Cheng, Li
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8107 - 8117
[7] news video story segmentation silence clip shot detection audio-visual fusion
Song, Yu
Wang, Wenhong
Guo, Fengjuan
[J]. ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 1065 - +
[8] Audio-visual synchrony for detection of monologues in video archives
Iyengar, G
Nock, HJ
Neti, C
[J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 772 - 775
[9] Attention-Based Audio-Visual Fusion for Video Summarization
Fang, Yinghong
Zhang, Junpeng
Lu, Cewu
[J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
[10] Audio-visual synchrony for detection of monologues in video archives
Iyengar, G
Nock, HJ
Neti, C
[J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I, PROCEEDINGS, 2003, : 329 - 332

← 1 2 3 4 5 →