TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

被引：29

作者：

Wang Z.-Q. ^{[1
]}

Cornell S. ^{[2
]}

Choi S. ^{[3
]}

Lee Y. ^{[3
]}

Kim B.-Y. ^{[3
]}

Watanabe S. ^{[1
]}

机构：

[1] Carnegie Mellon University, The Language Technologies Institute, Pittsburgh, 15213, PA

[2] Universitá Politecnica Delle Marche, The Department of Information Engineering, Ancona

[3] The Hyundai Motor Group and 42dot Inc., Seoul

来源：

IEEE/ACM Transactions on Audio Speech and Language Processing | 2023年 / 31卷

关键词：

Acoustic beamforming; complex spectral mapping; full- and sub-band integration; speech separation;

D O I：

10.1109/TASLP.2023.3304482

中图分类号：

学科分类号：

摘要：

We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this article is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. © 2014 IEEE.

引用

页码：3221 / 3236

页数：15

共 50 条

[1] FSI-Net: A dual-stage full- and sub-band integration network for full-band speech enhancement
Yu, Guochen
Wang, Hui
Li, Andong
Liu, Wenzhe
Zhang, Yuan
Wang, Yutian
Zheng, Chengshi
APPLIED ACOUSTICS, 2023, 211
[2] TfCleanformer: A streaming, array-agnostic, full- and sub-band modeling front-end for robust ASR
Heitkaemper, Jens
Caroselli, Joe
Narayanan, Arun
Howard, Nathan
INTERSPEECH 2024, 2024, : 4473 - 4477
[3] ADAPTIVE-FSN: INTEGRATING FULL-BAND EXTRACTION AND ADAPTIVE SUB-BAND ENCODING FOR MONAURAL SPEECH ENHANCEMENT
Tsao, Yu-Sheng
Ho, Kuan-Hsun
Hung, Jeih-Weih
Chen, Berlin
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 458 - 464
[4] Sub-band speech recognition
Primor, D
Furst-Yust, M
22ND CONVENTION OF ELECTRICAL AND ELECTRONICS ENGINEERS IN ISRAEL, PROCEEDINGS, 2002, : 10 - 12
[5] Sub-band weighted projection measure for sub-band speech recognition in noise
Nasersharif, B.
Akbari, A.
ELECTRONICS LETTERS, 2006, 42 (14) : 829 - 831
[6] Modeling sub-band correlation for noise-robust speech recognition
McAuley, J
Ming, J
Hanna, P
Stewart, D
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 1017 - 1020
[7] Lightweight Full-band and Sub-band Fusion Network for Real Time Speech Enhancement
Chen, Zhuangqi
Zhang, Pingjian
INTERSPEECH 2022, 2022, : 921 - 925
[8] Microphone array sub-band speech recognition
McCowan, IA
Sridharan, S
2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS: VOL I: SPEECH PROCESSING 1; VOL II: SPEECH PROCESSING 2 IND TECHNOL TRACK DESIGN & IMPLEMENTATION OF SIGNAL PROCESSING SYSTEMS NEURALNETWORKS FOR SIGNAL PROCESSING; VOL III: IMAGE & MULTIDIMENSIONAL SIGNAL PROCESSING MULTIMEDIA SIGNAL PROCESSING, 2001, : 185 - 188
[9] Sub-band based recognition of noisy speech
Tibrewala, S
Hermansky, H
1997 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I - V: VOL I: PLENARY, EXPERT SUMMARIES, SPECIAL, AUDIO, UNDERWATER ACOUSTICS, VLSI; VOL II: SPEECH PROCESSING; VOL III: SPEECH PROCESSING, DIGITAL SIGNAL PROCESSING; VOL IV: MULTIDIMENSIONAL SIGNAL PROCESSING, NEURAL NETWORKS - VOL V: STATISTICAL SIGNAL AND ARRAY PROCESSING, APPLICATIONS, 1997, : 1255 - 1258
[10] DETECTING ADHD FROM SPEECH USING FULL-BAND AND SUB-BAND CONVOLUTION FUSION NETWORK
Li, Shuanglin
Nair, Rajesh
Naqvi, Syed Mohsen
2023 IEEE SENSORS, 2023,

← 1 2 3 4 5 →