TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

被引:0
|
作者
Kalkhorani, Vahid Ahmadi [1 ]
Wang, Deliang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA
基金
美国国家科学基金会;
关键词
Encoding; Training; Convolution; Correlation; Indexes; Time-frequency analysis; Feature extraction; Vectors; Time-domain analysis; Noise measurement; Complex spectral mapping; multi-channel; single-channel; speaker separation; time-frequency domain; SPEECH SEPARATION; DEREVERBERATION;
D O I
10.1109/TASLP.2024.3492803
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.
引用
收藏
页码:4999 / 5009
页数:11
相关论文
共 1 条
  • [1] MULTI-CHANNEL NARROW-BAND DEEP SPEECH SEPARATION WITH FULL-BAND PERMUTATION INVARIANT TRAINING
    Quan, Changsheng
    Li, Xiaofei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 541 - 545