Attention-Based Audio-Visual Fusion for Video Summarization

被引：0

作者：

Fang, Yinghong ^{[1
]}

Zhang, Junpeng ^{[1
]}

Lu, Cewu ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China

来源：

NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II | 2019年 / 11954卷

关键词：

Video summarization; Audio-visual fusion; Self-attention; CHALLENGES;

D O I：

10.1007/978-3-030-36711-4_28

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video summarization compresses videos while preserving the most meaningful content for users. Many image-based works focus on how to effectively utilize video visual cues to choose keyframes. However, apart from visual content, videos also contain useful audio information. In this paper, we propose a novel attention-based audio-visual fusion framework which integrates the audio information with visual information. Our framework is composed of two key components: asymmetrical self-attention mechanism, and odd-even attention. The asymmetrical self-attention mechanism addresses the problem that visual information is more strongly related to video summarization than audio information. The odd-even attention focuses on alleviating the memory requirements. Besides, we create ViAu-SumMe, an audio-visual dataset, which is based on SumMe dataset. Experimental results on the dataset show that our proposed method outperforms the state-of-the-art methods.

引用

页码：328 / 340

页数：13

共 50 条

[21] Audio-visual speech processing and attention
Sams, M
[J]. PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
[22] A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism
Li, Yuqing
Wang, Xianke
Wu, Ruimin
Xu, Wei
Cheng, Wenqing
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3618 - 3630
[23] High Definition Visual Attention based Video Summarization
Qian, Yiming
Kyan, Matthew
[J]. PROCEEDINGS OF THE 2014 9TH INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS (VISAPP), VOL 1, 2014, : 634 - 640
[24] Effective Video Summarization Approach Based on Visual Attention
Ahmad, Hilal
Khan, Habib Ullah
Ali, Sikandar
Rahman, Syed Ijaz Ur
Wahid, Fazli
Khattak, Hizbullah
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1427 - 1442
[25] Audio-Visual Salieny Network with Audio Attention Module
Cheng, Shuaiyang
Gao, Xing
Song, Liang
Xiahou, Jianbing
[J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
[26] Fusion and combination in audio-visual integration
Omata, Kei
Mogi, Ken
[J]. PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2008, 464 (2090): : 319 - 340
[27] End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
Hou, Congcong
Wu, Xiaoyu
Wang, Ge
[J]. PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 501 - 510
[28] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
[J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
[29] Speaker dependent video indexing based on audio-visual interaction
Tsekeridou, S
Pitas, I
[J]. 1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 358 - 362
[30] Audio-visual integration during overt visual attention
Quigley, Cliodhna
Onat, Selim
Harding, Sue
Cooke, Martin
Koenig, Peter
[J]. JOURNAL OF EYE MOVEMENT RESEARCH, 2007, 1 (02):

← 1 2 3 4 5 →