TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

被引:29
|
作者
Wang Z.-Q. [1 ]
Cornell S. [2 ]
Choi S. [3 ]
Lee Y. [3 ]
Kim B.-Y. [3 ]
Watanabe S. [1 ]
机构
[1] Carnegie Mellon University, The Language Technologies Institute, Pittsburgh, 15213, PA
[2] Universitá Politecnica Delle Marche, The Department of Information Engineering, Ancona
[3] The Hyundai Motor Group and 42dot Inc., Seoul
关键词
Acoustic beamforming; complex spectral mapping; full- and sub-band integration; speech separation;
D O I
10.1109/TASLP.2023.3304482
中图分类号
学科分类号
摘要
We propose TF-GridNet for speech separation. The model is a novel deep neural network (DNN) integrating full- and sub-band modeling in the time-frequency (T-F) domain. It stacks several blocks, each consisting of an intra-frame full-band module, a sub-band temporal module, and a cross-frame self-attention module. It is trained to perform complex spectral mapping, where the real and imaginary (RI) components of input signals are stacked as features to predict target RI components. We first evaluate it on monaural anechoic speaker separation. Without using data augmentation and dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard dataset for two-speaker separation. To show its robustness to noise and reverberation, we evaluate it on monaural reverberant speaker separation using the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!, and obtain state-of-the-art performance on both datasets. We then extend TF-GridNet to multi-microphone conditions through multi-microphone complex spectral mapping, and integrate it into a two-DNN system with a beamformer in between (named as MISO-BF-MISO in earlier studies), where the beamformer proposed in this article is a novel multi-frame Wiener filter computed based on the outputs of the first DNN. State-of-the-art performance is obtained on the multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply the proposed algorithms to speech dereverberation and noisy-reverberant speech enhancement. State-of-the-art performance is obtained on a dereverberation dataset and on the dataset of the recent L3DAS22 multi-channel speech enhancement challenge. © 2014 IEEE.
引用
收藏
页码:3221 / 3236
页数:15
相关论文
共 50 条
  • [31] Modeling Sub-Band Information Through Discrete Wavelet Transform to Improve Intelligibility Assessment of Dysarthric Speech
    Sahu, Laxmi Priya
    Pradhan, Gayadhar
    Singh, Jyoti Prakash
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2022, 7 (07): : 56 - 64
  • [32] DPT-FSNET: DUAL-PATH TRANSFORMER BASED FULL-BAND AND SUB-BAND FUSION NETWORK FOR SPEECH ENHANCEMENT
    Dang, Feng
    Chen, Hangting
    Zhangt, Pengyuan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6857 - 6861
  • [33] A probabilistic union model for sub-band based robust speech recognition
    Ming, J
    Smith, FJ
    2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1787 - 1790
  • [34] DESIGN OF SUB-BAND CODERS FOR LOW BIT RATE SPEECH COMMUNICATIONS
    CROCHIERE, RE
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1976, 60 : S108 - S109
  • [35] Error Resilient Speech Coding Using Sub-band Hilbert Envelopes
    Ganapathy, Sriram
    Motlicek, Petr
    Hermansky, Hynek
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2009, 5729 : 355 - +
  • [36] A new speech hiding scheme based upon sub-band coding
    Chang, CC
    Lee, RCT
    Xiao, GX
    Chen, TS
    ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, 2003, : 980 - 984
  • [37] Arithmetic Coding of Sub-band Residuals in FDLP Speech/Audio Codec
    Motlicek, Petr
    Ganapathy, Sriram
    Hermansky, Hynek
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2555 - +
  • [38] Sub-band based histogram equalization in cepstral domain for speech recognition
    Joshi, Vikas
    Bilgi, Raghvendra
    Umesh, S.
    Garcia, Luz
    Benitez, Carmen
    SPEECH COMMUNICATION, 2015, 69 : 46 - 65
  • [39] Investigation of Sub-Band Discriminative Information between Spoofed and Genuine Speech
    Sriskandaraja, Kaavya
    Sethu, Vidhyasaharan
    Phu Ngoc Le
    Ambikairajah, Eliathamby
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1710 - 1714
  • [40] Modeling of sub-band and diameter effect in carrier concentration of CNTFET
    Mouatsi, Abdelmalek
    Marir-Benabbas, Mimia
    MATERIALS SCIENCE IN SEMICONDUCTOR PROCESSING, 2014, 28 : 115 - 120