Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion

被引:11
|
作者
Ye, Qinghao [1 ,2 ]
Shen, Xiyue [3 ]
Gao, Yuan [4 ]
Wang, Zirui [1 ]
Bi, Qi [5 ]
Li, Ping [1 ]
Yang, Guang [6 ]
机构
[1] Hangzhou Dianzi Univ, Hangzhou, Peoples R China
[2] Univ Calif San Diego, San Diego, CA 92103 USA
[3] East China Normal Univ, Shanghai, Peoples R China
[4] Univ Oxford, Oxford, England
[5] Wuhan Univ, Wuhan, Peoples R China
[6] Imperial Coll London, London, England
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICCV48922.2021.00785
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video highlight detection plays an increasingly important role in social media content filtering, however, it remains highly challenging to develop automated video highlight detection methods because of the lack of temporal annotations (i.e., where the highlight moments are in long videos) for supervised learning. In this paper, we propose a novel weakly supervised method that can learn to detect highlights by mining video characteristics with video level annotations (topic tags) only. Particularly, we exploit audio-visual features to enhance video representation and take temporal cues into account for improving detection performance. Our contributions are threefold: 1) we propose an audio-visual tensor fusion mechanism that efficiently models the complex association between two modalities while reducing the gap of the heterogeneity between the two modalities; 2) we introduce a novel hierarchical temporal context encoder to embed local temporal clues in between neighboring segments; 3) finally, we alleviate the gradient vanishing problem theoretically during model optimization with attention-gated instance aggregation. Extensive experiments on two benchmark datasets (YouTube Highlights and TVSum) have demonstrated our method outperforms other state-of-the-art methods with remarkable improvements.
引用
收藏
页码:7930 / 7939
页数:10
相关论文
共 50 条
  • [1] AUDIO-VISUAL OBJECT LOCALIZATION AND SEPARATION USING LOW-RANK AND SPARSITY
    Pu, Jie
    Panagakis, Yannis
    Petridis, Stavros
    Pantic, Maja
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2901 - 2905
  • [2] Blind Audio-Visual Localization and Separation via Low-Rank and Sparsity
    Pu, Jie
    Panagakis, Yannis
    Petridis, Stavros
    Shen, Jie
    Pantic, Maja
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (05) : 2288 - 2301
  • [3] Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion
    Chen, Chenglizhao
    Li, Shuai
    Wang, Yongguang
    Qin, Hong
    Hao, Aimin
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (07) : 3156 - 3170
  • [4] Video concept detection by audio-visual grouplets
    Wei Jiang
    Alexander C. Loui
    [J]. International Journal of Multimedia Information Retrieval, 2012, 1 (4) : 223 - 238
  • [5] Video concept detection by audio-visual grouplets
    Jiang, Wei
    Loui, Alexander C.
    [J]. INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2012, 1 (04) : 223 - 238
  • [6] Joint Visual and Audio Learning for Video Highlight Detection
    Badamdorj, Taivanbat
    Rochan, Mrigank
    Wang, Yang
    Cheng, Li
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8107 - 8117
  • [7] news video story segmentation silence clip shot detection audio-visual fusion
    Song, Yu
    Wang, Wenhong
    Guo, Fengjuan
    [J]. ICCSSE 2009: PROCEEDINGS OF 2009 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, 2009, : 1065 - +
  • [8] Audio-visual synchrony for detection of monologues in video archives
    Iyengar, G
    Nock, HJ
    Neti, C
    [J]. 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL V, PROCEEDINGS: SENSOR ARRAY & MULTICHANNEL SIGNAL PROCESSING AUDIO AND ELECTROACOUSTICS MULTIMEDIA SIGNAL PROCESSING, 2003, : 772 - 775
  • [9] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [10] Audio-visual synchrony for detection of monologues in video archives
    Iyengar, G
    Nock, HJ
    Neti, C
    [J]. 2003 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I, PROCEEDINGS, 2003, : 329 - 332