Attention-Based Audio-Visual Fusion for Video Summarization

被引:0
|
作者
Fang, Yinghong [1 ]
Zhang, Junpeng [1 ]
Lu, Cewu [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
关键词
Video summarization; Audio-visual fusion; Self-attention; CHALLENGES;
D O I
10.1007/978-3-030-36711-4_28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization compresses videos while preserving the most meaningful content for users. Many image-based works focus on how to effectively utilize video visual cues to choose keyframes. However, apart from visual content, videos also contain useful audio information. In this paper, we propose a novel attention-based audio-visual fusion framework which integrates the audio information with visual information. Our framework is composed of two key components: asymmetrical self-attention mechanism, and odd-even attention. The asymmetrical self-attention mechanism addresses the problem that visual information is more strongly related to video summarization than audio information. The odd-even attention focuses on alleviating the memory requirements. Besides, we create ViAu-SumMe, an audio-visual dataset, which is based on SumMe dataset. Experimental results on the dataset show that our proposed method outperforms the state-of-the-art methods.
引用
收藏
页码:328 / 340
页数:13
相关论文
共 50 条
  • [1] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
    Sterpu, George
    Saam, Christian
    Harte, Naomi
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
  • [2] Attention-based Visual-Audio Fusion for Video Caption Generation
    Guo, Ningning
    Liu, Huaping
    Jiang, Linhua
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844
  • [3] Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
    Hou, Yuanbo
    Yu, Zhesong
    Liang, Xia
    Du, Xingjian
    Zhu, Bilei
    Ma, Zejun
    Botteldooren, Dick
    [J]. INTERSPEECH 2021, 2021, : 321 - 325
  • [4] VIDEO CODING BASED ON AUDIO-VISUAL ATTENTION
    Lee, Jong-Seok
    De Simone, Francesca
    Ebrahimi, Touradj
    [J]. ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 57 - 60
  • [5] A audio-visual model for efficient video summarization
    El-Nagar, Gamal
    El-Sawy, Ahmed
    Rashad, Metwally
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
  • [6] Efficient video coding based on audio-visual focus of attention
    Lee, Jong-Seok
    De Simone, Francesca
    Ebrahimi, Touradj
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2011, 22 (08) : 704 - 711
  • [7] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    [J]. SENSORS, 2023, 23 (24)
  • [8] NOISE-TOLERANT AUDIO-VISUAL ONLINE PERSON VERIFICATION USING AN ATTENTION-BASED NEURAL NETWORK FUSION
    Shon, Suwon
    Oh, Tae-Hyun
    Glass, James
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3995 - 3999
  • [9] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
    Zhang, Wen
    Shao, Jie
    [J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401
  • [10] The effect of using video title in attention-based video summarization
    Li, Changwei
    Yeh, Zhi-Ting
    Gunuganti, Jeshmitha
    Chang, Jia-Bin
    Norouzi, Mehdi
    [J]. 2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,