Single-Channel Multi-Speaker Separation using Deep Clustering

被引:239
|
作者
Isik, Yusuf [1 ,2 ,3 ]
Le Roux, Jonathan [1 ]
Chen, Zhuo [1 ,4 ]
Watanabe, Shinji [1 ]
Hershey, John R. [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
[2] Sabanci Univ, Istanbul, Turkey
[3] TUBITAK BILGEM, Kocaeli, Turkey
[4] Columbia Univ, New York, NY USA
关键词
single-channel speech separation; embedding; deep learning; SUPERVISED SPEECH SEPARATION;
D O I
10.21437/Interspeech.2016-1176
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WEIR) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.
引用
收藏
页码:545 / 549
页数:5
相关论文
共 50 条
  • [1] SOURCE-AWARE CONTEXT NETWORK FOR SINGLE-CHANNEL MULTI-SPEAKER SPEECH SEPARATION
    Li, Zeng-Xi
    Song, Yan
    Dai, Li-Rong
    McLoughlin, Ian
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 681 - 685
  • [2] Candidate Speech Extraction from Multi-speaker Single-Channel Audio Interviews
    Pandharipande, Meghna
    Kopparapu, Sunil Kumar
    [J]. SPEECH AND COMPUTER, SPECOM 2023, PT I, 2023, 14338 : 210 - 221
  • [3] Deep Clustering in Complex Domain for Single-Channel Speech Separation
    Liu, Runling
    Tang, Yu
    Mang, Hongwei
    [J]. 2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 1463 - 1468
  • [4] MULTI-SCENARIO DEEP LEARNING FOR MULTI-SPEAKER SOURCE SEPARATION
    Zegers, Jeroen
    Van Hamme, Hugo
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5379 - 5383
  • [5] Single-Channel Speech Separation Based on Deep Clustering with Local Optimization
    Fu, Taotao
    Yu, Ge
    Guo, Lili
    Wang, Yan
    Liang, Ji
    [J]. 2017 3RD INTERNATIONAL CONFERENCE ON FRONTIERS OF SIGNAL PROCESSING (ICFSP), 2017, : 44 - 49
  • [6] Speaker Separation Using Visual Speech Features and Single-channel Audio
    Khan, Faheem
    Milner, Ben
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3263 - 3267
  • [7] Automatic speaker clustering from multi-speaker utterances
    McLaughlin, J
    Reynolds, D
    Singer, E
    O'Leary, GC
    [J]. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 817 - 820
  • [8] Soft mask methods for single-channel speaker separation
    Reddy, Aarthi M.
    Raj, Bhiksha
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (06): : 1766 - 1776
  • [9] JOINT SINGLE-CHANNEL SPEECH SEPARATION AND SPEAKER IDENTIFICATION
    Mowlaee, P.
    Saeidi, R.
    Tan, Z. -H.
    Christensen, M. G.
    Franti, P.
    Jensen, S. H.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4430 - 4433
  • [10] Single Channel multi-speaker speech Separation based on quantized ratio mask and residual network
    Shanfa Ke
    Ruimin Hu
    Xiaochen Wang
    Tingzhao Wu
    Gang Li
    Zhongyuan Wang
    [J]. Multimedia Tools and Applications, 2020, 79 : 32225 - 32241