Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors

被引：1

作者：

Chetupalli, Srikanth Raj ^{[1
]}

Habets, Emanuel A. P. ^{[1
]}

机构：

[1] Int Audio Labs Erlangen, Wolfsmantel 33, D-91058 Erlangen, Germany

来源：

INTERSPEECH 2022 | 2022年

关键词：

source separation; speaker counting; attractors; transformers;

D O I：

10.21437/Interspeech.2022-10849

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Speaker-independent speech separation for single-channel mixtures with an unknown number of multiple speakers in the waveform domain is considered in this paper. To deal with the unknown number of sources, we incorporate an encoder-decoder attractor (EDA) module into a speech separation network. The neural network architecture consists of a trainable encoder-decoder pair and a masking network. The mask network in the proposed approach is inspired by the transformer-based SepFormer separation system. It contains a dual-path block and a triple path block, each block modeling both short-time and long-time dependencies in the signal. The EDA module first summarises the dual-path block output using an LSTM encoder and generates one attractor vector per speaker in the mixture using an LSTM decoder. The attractors are combined with the dual-path block output to generate speaker channels, which are processed jointly by the triple-path block to predict the mask. Further, a linear-sigmoid layer, with attractors as the input, predicts a binary output to indicate a stopping criterion for attractor generation. The proposed approach is evaluated on the WSJ0-mix dataset with mixtures of up to five speakers. State-of-the-art results are obtained in the speech separation quality and speaker counting for all the mixtures.

引用

页码：5393 / 5397

页数：5

共 50 条

[1] End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
Horiguchi, Shota
Fujita, Yusuke
Watanabe, Shinji
Xue, Yawen
Nagamatsu, Kenji
[J]. INTERSPEECH 2020, 2020, : 269 - 273
[2] Recursive speech separation for unknown number of speakers
Takahashi, Naoya
Parthasaarathy, Sudarsanam
Goswami, Nabarun
Mitsufuji, Yuki
[J]. INTERSPEECH 2019, 2019, : 1348 - 1352
[3] Distillation of encoder-decoder transformers for sequence labelling
Farina, Marco
Pappadopulo, Duccio
Gupta, Anant
Huang, Leslie
Irsoy, Ozan
Solorio, Thamar
[J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2539 - 2549
[4] AN EFFICIENT ENCODER-DECODER ARCHITECTURE WITH TOP-DOWN ATTENTION FOR SPEECH SEPARATION
Li, Kai
Yang, Runxuan
Hu, Xiaolin
[J]. 11th International Conference on Learning Representations, ICLR 2023, 2023,
[5] AN EFFICIENT ENCODER-DECODER ARCHITECTURE WITH TOP-DOWN ATTENTION FOR SPEECH SEPARATION
Li, Kai
Yang, Runxuan
Hu, Xiaolin
[J]. arXiv, 2022,
[6] Improved speech enhancement using TCN with multiple encoder-decoder layers
Kishore, Vinith
Tiwari, Nitya
Paramasivam, Periyasamy
[J]. INTERSPEECH 2020, 2020, : 4531 - 4535
[7] SPEECH-TO-SINGING CONVERSION IN AN ENCODER-DECODER FRAMEWORK
Parekh, Jayneel
Rao, Preeti
Yang, Yi-Hsuan
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 261 - 265
[8] Confidence measures in encoder-decoder models for speech recognition
Woodward, Alejandro
Bonnin, Clara
Masuda, Issey
Varas, David
Bou-Balust, Elisenda
Riveiro, Juan Carlos
[J]. INTERSPEECH 2020, 2020, : 611 - 615
[9] Encoder-Decoder Based Attractors for End-to-End Neural Diarization
Horiguchi, Shota
Fujita, Yusuke
Watanabe, Shinji
Xue, Yawen
Garcia, Paola
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1493 - 1507
[10] Multi-layer encoder-decoder time-domain single channel speech separation
Liu, Debang
Zhang, Tianqi
Christensen, Mads Graesboll
Yi, Chen
Wei, Ying
[J]. PATTERN RECOGNITION LETTERS, 2024, 181 : 86 - 91

← 1 2 3 4 5 →