TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

被引：29

作者：

Wang Z.-Q. ^{[1
]}

Cornell S. ^{[2
]}

Choi S. ^{[3
]}

Lee Y. ^{[3
]}

Kim B.-Y. ^{[3
]}

Watanabe S. ^{[1
]}

机构：

[1] Carnegie Mellon University, The Language Technologies Institute, Pittsburgh, 15213, PA

[2] Universitá Politecnica Delle Marche, The Department of Information Engineering, Ancona

[3] The Hyundai Motor Group and 42dot Inc., Seoul

来源：

IEEE/ACM Transactions on Audio Speech and Language Processing | 2023年 / 31卷

关键词：

Acoustic beamforming; complex spectral mapping; full- and sub-band integration; speech separation;

D O I：

10.1109/TASLP.2023.3304482

中图分类号：

学科分类号：

摘要：

We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this article is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. © 2014 IEEE.

引用

页码：3221 / 3236

页数：15

共 50 条

[31] Modeling Sub-Band Information Through Discrete Wavelet Transform to Improve Intelligibility Assessment of Dysarthric Speech
Sahu, Laxmi Priya
Pradhan, Gayadhar
Singh, Jyoti Prakash
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2022, 7 (07): : 56 - 64
[32] DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT
Dang, Feng
Chen, Hangting
Zhangt, Pengyuan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6857 - 6861
[33] A probabilistic union model for sub-band based robust speech recognition
Ming, J
Smith, FJ
2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1787 - 1790
[34] DESIGN OF SUB-BAND CODERS FOR LOW BIT RATE SPEECH COMMUNICATIONS
CROCHIERE, RE
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1976, 60 : S108 - S109
[35] Error Resilient Speech Coding Using Sub-band Hilbert Envelopes
Ganapathy, Sriram
Motlicek, Petr
Hermansky, Hynek
TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2009, 5729 : 355 - +
[36] A new speech hiding scheme based upon sub-band coding
Chang, CC
Lee, RCT
Xiao, GX
Chen, TS
ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, 2003, : 980 - 984
[37] Arithmetic Coding of Sub-band Residuals in FDLP Speech/Audio Codec
Motlicek, Petr
Ganapathy, Sriram
Hermansky, Hynek
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2555 - +
[38] Sub-band based histogram equalization in cepstral domain for speech recognition
Joshi, Vikas
Bilgi, Raghvendra
Umesh, S.
Garcia, Luz
Benitez, Carmen
SPEECH COMMUNICATION, 2015, 69 : 46 - 65
[39] Investigation of Sub-Band Discriminative Information between Spoofed and Genuine Speech
Sriskandaraja, Kaavya
Sethu, Vidhyasaharan
Phu Ngoc Le
Ambikairajah, Eliathamby
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1710 - 1714
[40] Modeling of sub-band and diameter effect in carrier concentration of CNTFET
Mouatsi, Abdelmalek
Marir-Benabbas, Mimia
MATERIALS SCIENCE IN SEMICONDUCTOR PROCESSING, 2014, 28 : 115 - 120

← 1 2 3 4 5 →