Visually Guided Binaural Audio Generation Method Based on Hierarchical Feature Encoding and Decoding

被引:0
|
作者
Wang R.-Q. [1 ]
Cheng H.-N. [2 ]
Ye L. [2 ]
机构
[1] Key Laboratory of Media Audio & Video, Communication University of China, Ministry of Education, Beijing
[2] State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing
来源
Ruan Jian Xue Bao/Journal of Software | 2024年 / 35卷 / 05期
关键词
binaural audio; hierarchical feature encoding and decoding; multimodal learning; skip connection; visually guided audio generation;
D O I
10.13328/j.cnki.jos.007027
中图分类号
学科分类号
摘要
Visually guided binaural audio generation is one of the important tasks with wide application value in multimodal learning. The goal of the task is to generate binaural audio that conforms to audiovisual consistency with the given visual modal information and mono audio modal information. The existing visually guided binaural audio generation methods have unsatisfactory binaural audio generation effects due to insufficient utilization of audiovisual information in the encoding stage and neglect of shallow features in the decoding stage. To solve the above problems, this study proposes a visually guided binaural audio generation method based on hierarchical feature encoding and decoding, which effectively improves the quality of binaural audio generation. In order to effectively narrow the heterogeneous gap that hinders the association and fusion of audiovisual modal data, an encoder structure based on hierarchical coding and fusion of audiovisual features is proposed, which improves the comprehensive utilization efficiency of audiovisual modal data in the encoding stage. In order to realize the effective use of shallow structural feature information in the decoding process, a decoder structure with a skip connection between different depth feature layers from deep to shallow is constructed, which realizes the full use of shallow detail features and depth features of audiovisual modal information. Benefiting from the efficient use of audiovisual information and the hierarchical combination of deep and shallow structural features, the proposed method can effectively deal with binaural audio generation in complex visual scenes. Compared with the existing methods, the generation performance of the proposed method is improved by over 6% in terms of realism. © 2024 Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:2165 / 2175
页数:10
相关论文
共 24 条
  • [11] Lu YD, Lee HY, Tseng HY, Yang MH., Self-supervised audio spatialization with correspondence classifier, Proc. of the 2019 IEEE Int’l Conf. on Image Processing (ICIP), pp. 3347-3351, (2019)
  • [12] Xu XD, Zhou H, Liu ZW, Dai B, Wang XG, Lin DH., Visually informed binaural audio generation without binaural audios, Proc. of the 2021 IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 15480-15489, (2021)
  • [13] Lin YB, Wang YCF., Exploiting audio-visual consistency with partial supervision for spatial audio generation, Proc. of the 35th AAAI Conf. on Artificial Intelligence, the 33rd Conf. on Innovative Applications of Artificial Intelligence, the 11th Symp. on Educational Advances in Artificial Intelligence, pp. 2056-2063, (2021)
  • [14] Rachavarapu KK, Aakanksha A, Sundaresha V, Rajagopalan AN., Localize to binauralize: Audio spatialization from visual sound source localization, Proc. of the 2021 IEEE/CVF Int’l Conf. on Computer Vision, pp. 1910-1919, (2021)
  • [15] Garg R, Gao RH, Grauman K., Geometry-aware multi-task learning for binaural audio generation from video, Proc. of the 32nd British Machine Vision Conf, pp. 1082-1092, (2021)
  • [16] Cao JJ, Nie ZB, Zheng QB, Lu GJ, Zeng ZX., State-of-the-art survey of cross-modal data entity resolution, Ruan Jian Xue Bao/Journal of Software, 34, 12, pp. 5822-5847, (2023)
  • [17] Leng YC, Chen ZH, Guo JL, Liu HH, Chen JW, Tan X, Mandic DP, He L, Li XY, Qin T, Zhao S, Liu TY., BinauralGrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis, Proc. of the 36th Conf. on Neural Information Processing Systems, pp. 23689-23700, (2022)
  • [18] Wightman FL, Kistler DJ., The dominant role of low-frequency interaural time differences in sound localization, The Journal of the Acoustical Society of America, 91, 3, pp. 1648-1661, (1992)
  • [19] He KM, Zhang XY, Ren SQ, Sun J., Deep residual learning for image recognition, Proc. of the 2016 IEEE Conf. on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
  • [20] Ronneberger O, Fischer P, Brox T., U-Net: Convolutional networks for biomedical image segmentation, Proc. of the 18th Int’l Conf. on Medical Image Computing and Computer-assisted Intervention, pp. 234-241, (2015)