TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

被引:29
|
作者
Wang Z.-Q. [1 ]
Cornell S. [2 ]
Choi S. [3 ]
Lee Y. [3 ]
Kim B.-Y. [3 ]
Watanabe S. [1 ]
机构
[1] Carnegie Mellon University, The Language Technologies Institute, Pittsburgh, 15213, PA
[2] Universitá Politecnica Delle Marche, The Department of Information Engineering, Ancona
[3] The Hyundai Motor Group and 42dot Inc., Seoul
关键词
Acoustic beamforming; complex spectral mapping; full- and sub-band integration; speech separation;
D O I
10.1109/TASLP.2023.3304482
中图分类号
学科分类号
摘要
We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this article is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. © 2014 IEEE.
引用
收藏
页码:3221 / 3236
页数:15
相关论文
共 50 条
  • [1] FSI-Net: A dual-stage full- and sub-band integration network for full-band speech enhancement
    Yu, Guochen
    Wang, Hui
    Li, Andong
    Liu, Wenzhe
    Zhang, Yuan
    Wang, Yutian
    Zheng, Chengshi
    APPLIED ACOUSTICS, 2023, 211
  • [2] TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASR
    Heitkaemper, Jens
    Caroselli, Joe
    Narayanan, Arun
    Howard, Nathan
    INTERSPEECH 2024, 2024, : 4473 - 4477
  • [3] ADAPTIVE-FSN: INTEGRATING FULL-BAND EXTRACTION AND ADAPTIVE SUB-BAND ENCODING FOR MONAURAL SPEECH ENHANCEMENT
    Tsao, Yu-Sheng
    Ho, Kuan-Hsun
    Hung, Jeih-Weih
    Chen, Berlin
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 458 - 464
  • [4] Sub-band speech recognition
    Primor, D
    Furst-Yust, M
    22ND CONVENTION OF ELECTRICAL AND ELECTRONICS ENGINEERS IN ISRAEL, PROCEEDINGS, 2002, : 10 - 12
  • [5] Sub-band weighted projection measure for sub-band speech recognition in noise
    Nasersharif, B.
    Akbari, A.
    ELECTRONICS LETTERS, 2006, 42 (14) : 829 - 831
  • [6] Modeling sub-band correlation for noise-robust speech recognition
    McAuley, J
    Ming, J
    Hanna, P
    Stewart, D
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 1017 - 1020
  • [7] Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement
    Chen, Zhuangqi
    Zhang, Pingjian
    INTERSPEECH 2022, 2022, : 921 - 925
  • [8] Microphone array sub-band speech recognition
    McCowan, IA
    Sridharan, S
    2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 185 - 188
  • [9] Sub-band based recognition of noisy speech
    Tibrewala, S
    Hermansky, H
    1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1255 - 1258
  • [10] DETECTING ADHD FROM SPEECH USING FULL-BAND AND SUB-BAND CONVOLUTION FUSION NETWORK
    Li, Shuanglin
    Nair, Rajesh
    Naqvi, Syed Mohsen
    2023 IEEE SENSORS, 2023,