A Regression Approach to Single-Channel Speech Separation Via High-Resolution Deep Neural Networks

被引:76
|
作者
Du, Jun [1 ]
Tu, Yanhui [1 ]
Dai, Li-Rong [1 ]
Lee, Chin-Hui [2 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Peoples R China
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
关键词
Deep neural network; divide and conquer; dual outputs; robust speech recognition; speech separation; ALGORITHM; CASA;
D O I
10.1109/TASLP.2016.2558822
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a novel data-driven approach to single-channel speech separation based on deep neural networks (DNNs) to directly model the highly nonlinear relationship between speech features of a mixed signal containing a target speaker and other interfering speakers. We focus our discussion on a semisupervised mode to separate speech of the target speaker from an unknown interfering speaker, which is more flexible than the conventional supervised mode with known information of both the target and interfering speakers. Two key issues are investigated. First, we propose a DNN architecture with dual outputs of the features of both the target and interfering speakers, which is shown to achieve a better generalization capability than that with output features of only the target speaker. Second, we propose using a set of multiple DNNs, each intending to be signal-noise-dependent (SND), to cope with the difficulty that one single general DNN could not well accommodate all the speaker mixing variabilities at different signal-to-noise ratio (SNR) levels. Experimental results on the speech separation challenge (SSC) data demonstrate that our proposed framework achieves better separation results than other conventional approaches in a supervised or semisupervised mode. SND-DNNs could also yield significant performance improvements over a general DNN for speech separation in low SNR cases. Furthermore, for automatic speech recognition (ASR) following speech separation, this purely front-end processing with a single set of speaker-independent ASR acoustic models, achieves a relative word error rate (WER) reduction of 11.6% over a state-of-the-art separation and recognition system where a complicated joint back-end decoding framework with multiple sets of speaker-dependent ASR acoustic models needs to be implemented. When speaker-adaptive ASR acoustic models for the target speakers are adopted for the enhanced signals, another 12.1% WER reduction over our best speaker-independent ASR system is achieved.
引用
收藏
页码:1424 / 1437
页数:14
相关论文
共 50 条
  • [31] DEEP NEURAL NETWORKS FOR SINGLE CHANNEL SOURCE SEPARATION
    Grais, Emad M.
    Sen, Mehmet Umut
    Erdogan, Hakan
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [32] INCREMENTAL BINARIZATION ON RECURRENT NEURAL NETWORKS FOR SINGLE-CHANNEL SOURCE SEPARATION
    Kim, Sunwoo
    Maity, Mrinmoy
    Kim, Minje
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 376 - 380
  • [33] A Regression Approach to Speech Enhancement Based on Deep Neural Networks
    Xu, Yong
    Du, Jun
    Dai, Li-Rong
    Lee, Chin-Hui
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (01) : 7 - 19
  • [34] GAUSSIAN DENSITY GUIDED DEEP NEURAL NETWORK FOR SINGLE-CHANNEL SPEECH ENHANCEMENT
    Chai, Li
    Du, Jun
    Wang, Yan-nan
    [J]. 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [35] NOISE-ADAPTIVE DEEP NEURAL NETWORK FOR SINGLE-CHANNEL SPEECH ENHANCEMENT
    Chung, Hanwook
    Kim, Taesup
    Plourde, Eric
    Champagne, Benoit
    [J]. 2018 IEEE 28TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING (MLSP), 2018,
  • [36] A Joint-Loss Approach for Speech Enhancement via Single-channel Neural Network and MVDR Beamformer
    Tan, Zhi-Wei
    Nguyen, Anh H. T.
    Tran, Linh T. T.
    Khong, Andy W. H.
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 841 - 849
  • [37] Error Modeling via Asymmetric Laplace Distribution for Deep Neural Network Based Single-Channel Speech Enhancement
    Chai, Li
    Du, Jun
    Lee, Chin-Hui
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3269 - 3273
  • [38] JOINT SINGLE-CHANNEL SPEECH SEPARATION AND SPEAKER IDENTIFICATION
    Mowlaee, P.
    Saeidi, R.
    Tan, Z. -H.
    Christensen, M. G.
    Franti, P.
    Jensen, S. H.
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4430 - 4433
  • [39] WHAMR!: NOISY AND REVERBERANT SINGLE-CHANNEL SPEECH SEPARATION
    Maciejewski, Matthew
    Wichern, Gordon
    McQuinn, Emmett
    Le Roux, Jonathan
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 696 - 700
  • [40] Learning a Discriminative Dictionary for Single-Channel Speech Separation
    Bao, Guangzhao
    Xu, Yangfei
    Ye, Zhongfu
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (07) : 1130 - 1138