Multi-Attention Audio-Visual Fusion Network for Audio Spatialization

被引:1
|
作者
Zhang, Wen [1 ]
Shao, Jie [2 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu, Sichuan, Peoples R China
[2] Sichuan Artificial Intelligence Res Inst, Yibin, Sichuan, Peoples R China
基金
中国国家自然科学基金;
关键词
deep learning; joint audio-visual learning; audio spatialization; SOUND;
D O I
10.1145/3460426.3463624
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In our daily life, we are exposed to a large number of video files. Compared with video containing only mono audio, video with stereo can provide us with better audio-visual experience. However, a large number of ordinary users do not have professional equipment to record videos with high-quality stereo. In order to make it more convenient for users to obtain videos with stereo, we propose an effective method to convert mono audio in the video into stereo. One of the keys to this task is how to effectively inject visual information extracted from video frames into the audio signal. We design a novel multi-attention fusion network (MAFNet) based on the self-attention mechanism to extract the spatial features related to the sound source in the video frames and fuse them into audio features well. Furthermore, in order to obtain stereo with higher quality, we design an additional iterative structure which can refine and optimize the generated stereo sound by several iterations. Our proposed approach is validated on two challenging video datasets (FAIR-Play and YT-MUSIC), and achieves new state-of-the-art performance.
引用
收藏
页码:394 / 401
页数:8
相关论文
共 50 条
  • [1] Audio-Visual Salieny Network with Audio Attention Module
    Cheng, Shuaiyang
    Gao, Xing
    Song, Liang
    Xiahou, Jianbing
    [J]. PROCEEDINGS OF 2021 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS '21), 2021,
  • [2] Audio-Visual Fusion Based on Interactive Attention for Person Verification
    Jing, Xuebin
    He, Liang
    Song, Zhida
    Wang, Shaolei
    [J]. SENSORS, 2023, 23 (24)
  • [3] Audio-Visual Fusion for Sound Source Localization and Improved Attention
    Lee, Byoung-gi
    Choi, JongSuk
    Yoon, SangSuk
    Choi, Mun-Taek
    Kim, Munsang
    Kim, Daijin
    [J]. TRANSACTIONS OF THE KOREAN SOCIETY OF MECHANICAL ENGINEERS A, 2011, 35 (07) : 737 - 743
  • [4] Attention-Based Audio-Visual Fusion for Video Summarization
    Fang, Yinghong
    Zhang, Junpeng
    Lu, Cewu
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT II, 2019, 11954 : 328 - 340
  • [5] Attention Fusion for Audio-Visual Person Verification Using Multi-Scale Features
    Hoermann, Stefan
    Moiz, Abdul
    Knoche, Martin
    Rigoll, Gerhard
    [J]. 2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 281 - 285
  • [6] Audio-visual speech processing and attention
    Sams, M
    [J]. PSYCHOPHYSIOLOGY, 2003, 40 : S5 - S6
  • [7] Fusion and combination in audio-visual integration
    Omata, Kei
    Mogi, Ken
    [J]. PROCEEDINGS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2008, 464 (2090): : 319 - 340
  • [8] DEEP AUDIO-VISUAL FUSION NEURAL NETWORK FOR SALIENCY ESTIMATION
    Yao, Shunyu
    Min, Xiongkuo
    Zhai, Guangtao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 1604 - 1608
  • [9] Multimodal Attentive Fusion Network for audio-visual event recognition
    Brousmiche, Mathilde
    Rouat, Jean
    Dupont, Stephane
    [J]. INFORMATION FUSION, 2022, 85 : 52 - 59
  • [10] Audio-Visual Action Recognition Using Transformer Fusion Network
    Kim, Jun-Hwa
    Won, Chee Sun
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (03):