TF-CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

被引：0

作者：

Kalkhorani, Vahid Ahmadi ^{[1
]}

Wang, Deliang ^{[1
,2
]}

机构：

[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA

[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH 43210 USA

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

美国国家科学基金会;

关键词：

Encoding; Training; Convolution; Correlation; Indexes; Time-frequency analysis; Feature extraction; Vectors; Time-domain analysis; Noise measurement; Complex spectral mapping; multi-channel; single-channel; speaker separation; time-frequency domain; SPEECH SEPARATION; DEREVERBERATION;

D O I：

10.1109/TASLP.2024.3492803

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

We introduce TF-CrossNet, a complex spectral mapping approach to speaker separation and enhancement in reverberant and noisy conditions. The proposed architecture comprises an encoder layer, a global multi-head self-attention module, a cross-band module, a narrow-band module, and an output layer. TF-CrossNet captures global, cross-band, and narrow-band correlations in the time-frequency domain. To address performance degradation in long utterances, we introduce a random chunk positional encoding. Experimental results on multiple datasets demonstrate the effectiveness and robustness of TF-CrossNet, achieving state-of-the-art performance in tasks including reverberant and noisy-reverberant speaker separation. Furthermore, TF-CrossNet exhibits faster and more stable training in comparison to recent baselines. Additionally, TF-CrossNet's high performance extends to multi-microphone conditions, demonstrating its versatility in various acoustic scenarios.

引用

页码：4999 / 5009

页数：11