Attention-Based Audio-Visual Fusion for Video Summarization

被引:0
|
作者
Fang, Yinghong [1 ]
Zhang, Junpeng [1 ]
Lu, Cewu [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
关键词
Video summarization; Audio-visual fusion; Self-attention; CHALLENGES;
D O I
10.1007/978-3-030-36711-4_28
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video summarization compresses videos while preserving the most meaningful content for users. Many image-based works focus on how to effectively utilize video visual cues to choose keyframes. However, apart from visual content, videos also contain useful audio information. In this paper, we propose a novel attention-based audio-visual fusion framework which integrates the audio information with visual information. Our framework is composed of two key components: asymmetrical self-attention mechanism, and odd-even attention. The asymmetrical self-attention mechanism addresses the problem that visual information is more strongly related to video summarization than audio information. The odd-even attention focuses on alleviating the memory requirements. Besides, we create ViAu-SumMe, an audio-visual dataset, which is based on SumMe dataset. Experimental results on the dataset show that our proposed method outperforms the state-of-the-art methods.
引用
收藏
页码:328 / 340
页数:13
相关论文
共 50 条
  • [21] Audio-visual speech processing and attention
    Sams, M
    [J]. PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [22] A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism
    Li, Yuqing
    Wang, Xianke
    Wu, Ruimin
    Xu, Wei
    Cheng, Wenqing
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3618 - 3630
  • [23] High Definition Visual Attention based Video Summarization
    Qian, Yiming
    Kyan, Matthew
    [J]. PROCEEDINGS OF THE 2014 9TH INTERNATIONAL CONFERENCE ON COMPUTER VISION THEORY AND APPLICATIONS (VISAPP), VOL 1, 2014, : 634 - 640
  • [24] Effective Video Summarization Approach Based on Visual Attention
    Ahmad, Hilal
    Khan, Habib Ullah
    Ali, Sikandar
    Rahman, Syed Ijaz Ur
    Wahid, Fazli
    Khattak, Hizbullah
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 1427 - 1442
  • [25] Audio-Visual Salieny Network with Audio Attention Module
    Cheng, Shuaiyang
    Gao, Xing
    Song, Liang
    Xiahou, Jianbing
    [J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
  • [26] Fusion and combination in audio-visual integration
    Omata, Kei
    Mogi, Ken
    [J]. PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2008, 464 (2090): : 319 - 340
  • [27] End-to-End Bloody Video Recognition by Audio-Visual Feature Fusion
    Hou, Congcong
    Wu, Xiaoyu
    Wang, Ge
    [J]. PATTERN RECOGNITION AND COMPUTER VISION (PRCV 2018), PT I, 2018, 11256 : 501 - 510
  • [28] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    [J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [29] Speaker dependent video indexing based on audio-visual interaction
    Tsekeridou, S
    Pitas, I
    [J]. 1998 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING - PROCEEDINGS, VOL 1, 1998, : 358 - 362
  • [30] Audio-visual integration during overt visual attention
    Quigley, Cliodhna
    Onat, Selim
    Harding, Sue
    Cooke, Martin
    Koenig, Peter
    [J]. JOURNAL OF EYE MOVEMENT RESEARCH, 2007, 1 (02):