Cross-modal Audiovisual Separation Based on U-Net Network Combining Optical Flow Algorithm and Attention Mechanism

被引:0
|
作者
Lan C. [1 ]
Jiang P. [1 ]
Chen H. [2 ]
Han C. [1 ]
Guo X. [1 ]
机构
[1] School of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin
[2] China Ship Design and Research Center, Wuhan
基金
中国国家自然科学基金;
关键词
Audio-visual integration; Audio-Visual Speech Separation (AVSS); Cross-modal attention; Optical flow algorithm;
D O I
10.11999/JEIT221500
中图分类号
学科分类号
摘要
Most of the current audiovisual separation models are mostly based on simple splicing of video features and audio features, without fully considering the interrelationship of each modality, resulting in the underutilization of visual information, a new model is proposed to address this issue. Hence, in this paper, the interrelationship of each modality is taken into consideration. In addition, a multi-headed attention mechanism is used to combine the Farneback algorithm and the U-Net network to propose a cross-modal fusion optical Flow-Audio Visual Speech Separation (Flow-AVSS) model. The motion features and lip features are respectively extracted by the Farneback algorithm and the lightweight network ShuffleNet v2. Furthermore, the motion features are affine transformed with the lip features, and the visual features are obtained by the Temporal CoNvolution module (TCN). In order to utilize sufficiently the visual information, the multi-headed attention mechanism is used in the feature fusion to fuse the visual features with the audio features across modalities. Finally, the fused audio-visual features are passed through the U-Net separation network to obtain the separated speech. Using Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Source-to-Distortion Ratio (SDR) evaluation metrics, experimental tests are conducted on the AVspeech dataset. It is shown that the performance of the proposed method is improved by 2.23 dB and 1.68 dB compared with the pure speech separation network or the audio-visual separation network based on feature splicing. Thus, it is indicated that the feature fusion based on the cross-modal attention can make fuller use of the individual modal correlations. Besides, the increased lip motion features can effectively improve the robustness of video features and improve the separation effect. © 2023 Science Press. All rights reserved.
引用
收藏
页码:3538 / 3546
页数:8
相关论文
共 29 条
  • [1] WANG Deliang, BROWN G J., Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, pp. 1-14, (2006)
  • [2] SCHMIDT M N, OLSSON R K., Single-channel speech separation using sparse non-negative matrix factorization, The INTERSPEECH 2006, (2006)
  • [3] ZHOU Weili, Zhen ZHU, LIANG Peiying, Speech denoising using Bayesian NMF with online base update[J], Multimedia Tools and Applications, 78, 11, pp. 15647-15664, (2019)
  • [4] SUN Lei, DU Jun, DAI Lirong, Et al., Multiple-target deep learning for LSTM-RNN based speech enhancement[C], 2017 Hands-free Speech Communications and Microphone Arrays, pp. 136-140, (2017)
  • [5] HERSHEY J R, CHEN Zhuo, ROUX J L, Et al., Deep clustering: Discriminative embeddings for segmentation and separation[C], 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 31-35, (2016)
  • [6] YU Dong, KOLBAEK M, TAN Zhenghua, Et al., Permutation invariant training of deep models for speaker-independent multi-talker speech separation[C], 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 241-245, (2017)
  • [7] GOLUMBIC E Z, COGAN G B, SCHROEDER C E, Et al., Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”[J], The Journal of Neuroscience, 33, 4, pp. 1417-1426, (2013)
  • [8] SUSSMAN E S., Integration and segregation in auditory scene analysis[J], The Journal of the Acoustical Society of America, 117, 3, pp. 1285-1298, (2005)
  • [9] TAO Ruijie, PAN Zexu, DAS R K, Et al., Is someone speaking?: Exploring long-term temporal features for audiovisual active speaker detection[C], The ACM Multimedia Conference, pp. 3927-3935, (2021)
  • [10] LAKHAN A, MOHAMMED M A, KADRY S, Et al., Federated Learning-Aware Multi-Objective Modeling and blockchain-enable system for IIoT applications, Computers and Electrical Engineering, 100, (2022)