TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

被引:29
|
作者
Wang Z.-Q. [1 ]
Cornell S. [2 ]
Choi S. [3 ]
Lee Y. [3 ]
Kim B.-Y. [3 ]
Watanabe S. [1 ]
机构
[1] Carnegie Mellon University, The Language Technologies Institute, Pittsburgh, 15213, PA
[2] Universitá Politecnica Delle Marche, The Department of Information Engineering, Ancona
[3] The Hyundai Motor Group and 42dot Inc., Seoul
关键词
Acoustic beamforming; complex spectral mapping; full- and sub-band integration; speech separation;
D O I
10.1109/TASLP.2023.3304482
中图分类号
学科分类号
摘要
We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this article is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. © 2014 IEEE.
引用
收藏
页码:3221 / 3236
页数:15
相关论文
共 50 条
  • [41] Modeling and System-Level Performance Evaluation of Sub-Band Full Duplexing for 5G-Advanced
    Mokhtari, Masoumeh
    Pocovi, Guillermo
    Maldonado, Roberto
    Pedersen, Klaus I.
    IEEE ACCESS, 2023, 11 : 71503 - 71516
  • [42] Enhancement of Noisy Speech using Sub-band Harmonic Regeneration and Speech Presence Uncertainty Estimator
    Kumar, Ravi
    Subbaiah, P. V.
    2016 IEEE INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ELECTRONICS, INFORMATION & COMMUNICATION TECHNOLOGY (RTEICT), 2016, : 456 - 460
  • [43] Noise Aware Sub-band Locality Preserving Projection for Robust Speech Recognition
    Karevan, Zahra
    Akbari, Ahmad
    Nasersharif, Babak
    ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING, AISP 2013, 2014, 427 : 203 - +
  • [44] Speech Enhancement using Sub-band Wiener Filter with Pitch Synchronous Analysis
    Sunnydayal, V.
    Kumar, T. Kishore
    2013 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2013, : 20 - 25
  • [45] Broadcast speech language recognition based on sub-band GMM-UBM
    Li, Siyi
    Dai, Beiqian
    Wang, Haixiang
    Shuju Caiji Yu Chuli/Journal of Data Acquisition and Processing, 2007, 22 (01): : 14 - 18
  • [46] Speech rate estimation via temporal correlation and selected sub-band correlation
    Narayanan, S
    Wang, D
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 413 - 416
  • [47] The effect of modified filter distribution on an adaptive, sub-band speech enhancement method
    Darlington, DJ
    Campbell, DR
    1996 IEEE DIGITAL SIGNAL PROCESSING WORKSHOP, PROCEEDINGS, 1996, : 153 - 156
  • [48] Overlapped sub-band modulation spectrum normalization techniques for robust speech recognition
    Fan, Hao-teng
    Yeh, Wei-jeih
    Hung, Jeih-weih
    2013 10TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2013, : 1035 - 1039
  • [49] Binaural sub-band adaptive speech enhancement using artificial neural networks
    Hussain, A
    Campbell, DR
    SPEECH COMMUNICATION, 1998, 25 (1-3) : 177 - 186
  • [50] Sub-Band Unvoiced/Voiced Parameter Extraction and Efficient Quantization for Speech Signal
    Chen, Liang
    Chen, Liang
    Zhang, Yi-peng
    Pang, Liang
    2013 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), 2013, : 1203 - 1207