Attention-Based Audio-Visual Fusion for Video Summarization

被引：0

作者：

Fang, Yinghong ^{[1
]}

Zhang, Junpeng ^{[1
]}

Lu, Cewu ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China

来源：

NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II | 2019年 / 11954卷

关键词：

Video summarization; Audio-visual fusion; Self-attention; CHALLENGES;

D O I：

10.1007/978-3-030-36711-4_28

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization compresses videos while preserving the most meaningful content for users. Many image-based works focus on how to effectively utilize video visual cues to choose keyframes. However, apart from visual content, videos also contain useful audio information. In this paper, we propose a novel attention-based audio-visual fusion framework which integrates the audio information with visual information. Our framework is composed of two key components: asymmetrical self-attention mechanism, and odd-even attention. The asymmetrical self-attention mechanism addresses the problem that visual information is more strongly related to video summarization than audio information. The odd-even attention focuses on alleviating the memory requirements. Besides, we create ViAu-SumMe, an audio-visual dataset, which is based on SumMe dataset. Experimental results on the dataset show that our proposed method outperforms the state-of-the-art methods.

引用

页码：328 / 340

页数：13

共 50 条

[1] Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
Sterpu, George
Saam, Christian
Harte, Naomi
[J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 111 - 115
[2] Attention-based Visual-Audio Fusion for Video Caption Generation
Guo, Ningning
Liu, Huaping
Jiang, Linhua
[J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844
[3] Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Hou, Yuanbo
Yu, Zhesong
Liang, Xia
Du, Xingjian
Zhu, Bilei
Ma, Zejun
Botteldooren, Dick
[J]. INTERSPEECH 2021, 2021, : 321 - 325
[4] VIDEO CODING BASED ON AUDIO-VISUAL ATTENTION
Lee, Jong-Seok
De Simone, Francesca
Ebrahimi, Touradj
[J]. ICME: 2009 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-3, 2009, : 57 - 60
[5] A audio-visual model for efficient video summarization
El-Nagar, Gamal
El-Sawy, Ahmed
Rashad, Metwally
[J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2024, 100
[6] Efficient video coding based on audio-visual focus of attention
Lee, Jong-Seok
De Simone, Francesca
Ebrahimi, Touradj
[J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2011, 22 (08) : 704 - 711
[7] Audio-Visual Fusion Based on Interactive Attention for Person Verification
Jing, Xuebin
He, Liang
Song, Zhida
Wang, Shaolei
[J]. SENSORS, 2023, 23 (24)
[8] NOISE-TOLERANT AUDIO-VISUAL ONLINE PERSON VERIFICATION USING AN ATTENTION-BASED NEURAL NETWORK FUSION
Shon, Suwon
Oh, Tae-Hyun
Glass, James
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3995 - 3999
[9] Multi-Attention Audio-Visual Fusion Network for Audio Spatialization
Zhang, Wen
Shao, Jie
[J]. PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 394 - 401
[10] The effect of using video title in attention-based video summarization
Li, Changwei
Yeh, Zhi-Ting
Gunuganti, Jeshmitha
Chang, Jia-Bin
Norouzi, Mehdi
[J]. 2024 2ND ASIA CONFERENCE ON COMPUTER VISION, IMAGE PROCESSING AND PATTERN RECOGNITION, CVIPPR 2024, 2024,

← 1 2 3 4 5 →