A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

被引:6
|
作者
Tu, Yan-Hui [1 ]
Du, Jun [2 ]
Lee, Chin-Hui [3 ]
机构
[1] Univ Sci & Technol China, Hefei, Anhui, Peoples R China
[2] Univ Sci & Technol China, iFlytek Speech Lab, Hefei, Anhui, Peoples R China
[3] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
基金
中国国家自然科学基金;
关键词
Multi-talker speech recognition; Speaker-dependent model; Single-channel speech separation; Deep neural networks; Joint training; ENHANCEMENT;
D O I
10.1007/s11265-017-1295-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We propose a novel speaker-dependent (SD) multi-condition (MC) training approach to joint learning of deep neural networks (DNNs) of acoustic models and an explicit speech separation structure for recognition of multi-talker mixed speech in a single-channel setting. First, an MC acoustic modeling framework is established to train a SD-DNN model in multi-talker scenarios. Such a recognizer significantly reduces the decoding complexity and improves the recognition accuracy over those using speaker-independent DNN models with a complicated joint decoding structure assuming the speaker identities in mixed speech are known. In addition, a SD regression DNN for mapping the acoustic features of mixed speech to the speech features of a target speaker is jointly trained with the SD-DNN based acoustic models. Experimental results on Speech Separation Challenge (SSC) small-vocabulary recognition show that the proposed approach under multi-condition training achieves an average word error rate (WER) of 3.8%, yielding a relative WER reduction of 65.1% from a top performance, DNN-based pre-processing only approach we proposed earlier under clean-condition training (Tu et al. 2016). Furthermore, the proposed joint training DNN framework generates a relative WER reduction of 13.2% from state-of-the-art systems under multi-condition training. Finally, the effectiveness of the proposed approach is also verified on the Wall Street Journal (WSJ0) task with medium-vocabulary continuous speech recognition in a simulated multi-talker setting.
引用
收藏
页码:963 / 973
页数:11
相关论文
共 50 条
  • [1] A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech
    Yan-Hui Tu
    Jun Du
    Chin-Hui Lee
    [J]. Journal of Signal Processing Systems, 2018, 90 : 963 - 973
  • [2] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    [J]. 2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [3] Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition
    Weng, Chao
    Yu, Dong
    Seltzer, Michael L.
    Droppo, Jasha
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (10) : 1670 - 1679
  • [4] Deep neural networks based binary classification for single channel speaker independent multi-talker speech separation
    Saleem, Nasir
    Khattak, Muhammad Irfan
    [J]. APPLIED ACOUSTICS, 2020, 167
  • [5] Single-channel multi-talker speech recognition with permutation invariant training
    Qian, Yanmin
    Chang, Xuankai
    Yu, Dong
    [J]. SPEECH COMMUNICATION, 2018, 104 : 1 - 11
  • [6] Streaming Multi-talker Speech Recognition with Joint Speaker Identification
    Lu, Liang
    Kanda, Naoyuki
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2021, 2021, : 1782 - 1786
  • [7] Multi-Channel Speaker Verification for Single and Multi-talker Speech
    Kataria, Saurabh
    Zhang, Shi-Xiong
    Yu, Dong
    [J]. INTERSPEECH 2021, 2021, : 4608 - 4612
  • [8] Improving End-to-End Single-Channel Multi-Talker Speech Recognition
    Zhang, Wangyou
    Chang, Xuankai
    Qian, Yanmin
    Watanabe, Shinji
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1385 - 1394
  • [9] ACOUSTIC MODELING FOR DISTANT MULTI-TALKER SPEECH RECOGNITION WITH SINGLE- AND MULTI-CHANNEL BRANCHES
    Kanda, Naoyuki
    Fujita, Yusuke
    Horiguchi, Shota
    Ikeshita, Rintaro
    Nagamatsu, Kenji
    Watanabe, Shinji
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6630 - 6634
  • [10] A Joint Approach for Single-Channel Speaker Identification and Speech Separation
    Mowlaee, Pejman
    Saeidi, Rahim
    Christensen, Mads Grsboll
    Tan, Zheng-Hua
    Kinnunen, Tomi
    Franti, Pasi
    Jensen, Soren Holdt
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (09): : 2586 - 2601