Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding

被引:0
|
作者
Wang R.-Q. [1 ]
Cheng H.-N. [2 ]
Ye L. [2 ]
机构
[1] Key Laboratory of Media Audio & Video, Communication University of China, Ministry of Education, Beijing
[2] State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing
来源
Ruan Jian Xue Bao/Journal of Software | 2024年 / 35卷 / 05期
关键词
binaural audio; hierarchical feature encoding and decoding; multimodal learning; skip connection; visually guided audio generation;
D O I
10.13328/j.cnki.jos.007027
中图分类号
学科分类号
摘要
Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism. © 2024 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:2165 / 2175
页数:10
相关论文
共 24 条
  • [1] Yang Y, Zhan DC, Jiang Y, Xiong H., Reliable multi-modal learning: A survey, Ruan Jian Xue Bao/Journal of Software, 32, 4, pp. 1067-1081, (2021)
  • [2] Ge XL., Influence of audiovisual congruency on the auditory intensity change judgment, (2011)
  • [3] Lv ZL., Study on generation of spatial audio using audio-visual cues, (2021)
  • [4] Cheng HN, Li SJ, Liu SG., Deep cross-modal synthesis of environmental sound, Journal of Computer-aided Design & Computer Graphics, 31, 12, pp. 2047-2055, (2019)
  • [5] Wang RQ, Cheng HN, Ye L, Qi QT., Reproduction transformation rule-based sound generation for film soundtrack, Journal of Computer-aided Design & Computer Graphics, 34, 10, pp. 1524-1532, (2022)
  • [6] Huang HM, Lin LF, Tong RF, Hu HJ, Zhang QW, Iwamoto Y, Han XH, Chen YW, Wu J., UNet 3+: A full-scale connected UNet for medical image segmentation, Proc. of the 2020 IEEE Int’l Conf. on Acoustics, Speech and Signal Processing, pp. 1055-1059, (2020)
  • [7] Gao RH, Grauman K., 2.5D visual sound, Proc. of the 2019 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 324-333, (2019)
  • [8] Zhou H, Xu XD, Lin DH, Wang XG, Liu ZW., Sep-stereo: Visually guided stereophonic audio generation by associating source separation, Proc. of the 16th European Conf. on Computer Vision, pp. 52-69, (2020)
  • [9] Li SJ, Liu SG, Manocha D., Binaural audio generation via multi-task learning, ACM Trans. on Graphics, 40, 6, (2021)
  • [10] Parida KK, Srivastava S, Sharma G., Beyond mono to binaural: Generating binaural audio from mono audio with depth and cross modal attention, Proc. of the 2022 IEEE/CVF Winter Conf. on Applications of Computer Vision, pp. 2151-2160, (2022)