Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

被引:29
|
作者
Tan, Ke [1 ]
Xu, Yong [2 ]
Zhang, Shi-Xiong [2 ]
Yu, Meng [2 ]
Yu, Dong [2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Tencent AI Lab, Bellevue, WA 98004 USA
关键词
Noise measurement; Visualization; Speech processing; Reverberation; Microphone arrays; Training; Audio-visual; multimodal; speech separation and dereverberation; far-field; two-stage; deep learning; ENHANCEMENT; MASKING;
D O I
10.1109/JSTSP.2020.2987209
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.
引用
收藏
页码:542 / 553
页数:12
相关论文
共 50 条
  • [21] Perceptual Improvement of a Two-Stage Algorithm for Speech Dereverberation
    Prego, Thiago de M.
    de Lima, Amaro A.
    Netto, Sergio L.
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 216 - 219
  • [22] An audio-visual distance for audio-visual speech vector quantization
    Girin, L
    Foucher, E
    Feng, G
    [J]. 1998 IEEE SECOND WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, 1998, : 523 - 528
  • [23] Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training
    Zhang, Peng
    Xu, Jiaming
    Shi, Jing
    Hao, Yunzhe
    Qin, Lei
    Xu, Bo
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [24] Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network
    Li, Yangke
    Zhang, Xinman
    [J]. NEUROCOMPUTING, 2023, 549
  • [25] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
    [J]. Alm, M. (magnus.alm@svt.ntnu.no), 1600, Acoustical Society of America (134):
  • [26] Audio-visual speech experience with age influences perceived audio-visual asynchrony in speech
    Alm, Magnus
    Behne, Dawn
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2013, 134 (04): : 3001 - 3010
  • [27] FaceFilter: Audio-visual speech separation using still images
    Chung, Soo-Whan
    Choe, Soyeon
    Chung, Joon Son
    Kang, Hong-Goo
    [J]. INTERSPEECH 2020, 2020, : 3481 - 3485
  • [28] Deep audio-visual speech separation based on facial motion
    Rigal, Remi
    Chodorowski, Jacques
    Zerr, Benoit
    [J]. INTERSPEECH 2021, 2021, : 3540 - 3544
  • [29] DEEP VARIATIONAL GENERATIVE MODELS FOR AUDIO-VISUAL SPEECH SEPARATION
    Viet-Nhat Nguyen
    Sadeghi, Mostafa
    Ricci, Elisa
    Alameda-Pineda, Xavier
    [J]. 2021 IEEE 31ST INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2021,
  • [30] Audio-Visual Speech Separation Using I-Vectors
    Luo, Yiyu
    Wang, Jing
    Wang, Xinyao
    Wen, Liang
    Wang, Lizhong
    [J]. 2019 2ND IEEE INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP), 2019, : 276 - 280