Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

被引:29
|
作者
Tan, Ke [1 ]
Xu, Yong [2 ]
Zhang, Shi-Xiong [2 ]
Yu, Meng [2 ]
Yu, Dong [2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Tencent AI Lab, Bellevue, WA 98004 USA
关键词
Noise measurement; Visualization; Speech processing; Reverberation; Microphone arrays; Training; Audio-visual; multimodal; speech separation and dereverberation; far-field; two-stage; deep learning; ENHANCEMENT; MASKING;
D O I
10.1109/JSTSP.2020.2987209
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.
引用
收藏
页码:542 / 553
页数:12
相关论文
共 50 条
  • [1] Two-stage audio-visual speech dereverberation and separation based on models of the interaural spatial cues and spatial covariance
    Khan, Muhammad Salman
    Naqvi, Syed Mohsen
    Chambers, Jonathon
    [J]. 2013 18TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2013,
  • [2] AUDIO-VISUAL MULTI-CHANNEL SPEECH SEPARATION, DEREVERBERATION AND RECOGNITION
    Li, Guinan
    Yu, Jianwei
    Deng, Jiajun
    Liu, Xunying
    Meng, Helen
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6042 - 6046
  • [3] A Two-Stage Audio-Visual Speech Separation Method Without Visual Signals for Testing and Tuples Loss With Dynamic Margin
    Liu, Yinggang
    Deng, Yuanjie
    Wei, Ying
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2024, 18 (03) : 459 - 472
  • [4] Multimodal Sparse Transformer Network for Audio-Visual Speech Recognition
    Song, Qiya
    Sun, Bin
    Li, Shutao
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (12) : 10028 - 10038
  • [5] DMMAN: A two-stage audio-visual fusion framework for sound separation and event localization
    Hu, Ruihan
    Zhou, Songbing
    Tang, Zhi Ri
    Chang, Sheng
    Huang, Qijun
    Liu, Yisen
    Han, Wei
    Wu, Edmond Q.
    [J]. NEURAL NETWORKS, 2021, 133 : 229 - 239
  • [6] Audio-Visual End-to-End Multi-Channel Speech Separation, Dereverberation and Recognition
    Li, Guinan
    Deng, Jiajun
    Geng, Mengzhe
    Jin, Zengrui
    Wang, Tianzi
    Hu, Shujie
    Cui, Mingyu
    Meng, Helen
    Liu, Xunying
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 2707 - 2723
  • [7] Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli
    Sodoyer, D
    Schwartz, JL
    Girin, L
    Klinkisch, J
    Jutten, C
    [J]. EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING, 2002, 2002 (11) : 1165 - 1173
  • [8] Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli
    David Sodoyer
    Jean-Luc Schwartz
    Laurent Girin
    Jacob Klinkisch
    Christian Jutten
    [J]. EURASIP Journal on Advances in Signal Processing, 2002
  • [9] Audio-Visual Fusion With Temporal Convolutional Attention Network for Speech Separation
    Liu, Debang
    Zhang, Tianqi
    Christensen, Mads Graesboll
    Yi, Chen
    An, Zeliang
    [J]. IEEE/ACM Transactions on Audio Speech and Language Processing, 2024, 32 : 4647 - 4660
  • [10] Audio-Visual Deep Clustering for Speech Separation
    Lu, Rui
    Duan, Zhiyao
    Zhang, Changshui
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (11) : 1697 - 1712