Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition

被引:62
|
作者
Tu, Yan-Hui [1 ]
Du, Jun [1 ]
Lee, Chin-Hui [2 ]
机构
[1] Univ Sci & Technol China, Hefei 230052, Anhui, Peoples R China
[2] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Speech enhancement; Speech recognition; Noise measurement; Adaptation models; Computational modeling; Training; Teacher-student learning; improved minima controlled recursive averaging; improved speech presence probability; deep learning based speech enhancement; noise-robust speech recognition; SUPPRESSION; SEPARATION;
D O I
10.1109/TASLP.2019.2940662
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In this paper, we propose a novel teacher-student learning framework for the preprocessing of a speech recognizer, leveraging the online noise tracking capabilities of improved minima controlled recursive averaging (IMCRA) and deep learning of nonlinear interactions between speech and noise. First, a teacher model with deep architectures is built to learn the target of ideal ratio masks (IRMs) using simulated training pairs of clean and noisy speech data. Next, a student model is trained to learn an improved speech presence probability by incorporating the estimated IRMs from the teacher model into the IMCRA approach. The student model can be compactly designed in a causal processing mode having no latency with the guidance of a complex and noncausal teacher model. Moreover, the clean speech requirement, which is difficult to meet in real-world adverse environments, can be relaxed for training the student model, implying that noisy speech data can be directly used to adapt the regression-based enhancement model to further improve speech recognition accuracies for noisy speech collected in such conditions. Experiments on the CHiME-4 challenge task show that our best student model with bidirectional gated recurrent units (BGRUs) can achieve a relative word error rate (WER) reduction of 18.85% for the real test set when compared to unprocessed system without acoustic model retraining. However, the traditional teacher model degrades the performance of the unprocessed system in this case. In addition, the student model with a deep neural network (DNN) in causal mode having no latency yields a relative WER reduction of 7.94% over the unprocessed system with 670 times less computing cycles when compared to the BGRU-equipped student model. Finally, the conventional speech enhancement and IRM-based deep learning method destroyed the ASR performance when the recognition system became more powerful. While our proposed approach could still improve the ASR performance even in the more powerful recognition system.
引用
收藏
页码:2080 / 2091
页数:12
相关论文
共 50 条
  • [1] Robust Speech Recognition Using Teacher-Student Learning Domain Adaptation
    Ma, Han
    Zhang, Qiaoling
    Tang, Roubing
    Zhang, Lu
    Jia, Yubo
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (12) : 2112 - 2118
  • [2] Noise-Robust speech recognition of Conversational Telephone Speech
    Chen, Gang
    Tolba, Hesham
    O'Shaughnessy, Douglas
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1101 - 1104
  • [3] Speech Enhancement for Noise-Robust Speech Synthesis using Wasserstein GAN
    Adiga, Nagaraj
    Pantazis, Yannis
    Tsiaras, Vassilis
    Stylianou, Yannis
    INTERSPEECH 2019, 2019, : 1821 - 1825
  • [4] Knowledge Distillation-Based Training of Speech Enhancement for Noise-Robust Automatic Speech Recognition
    Woo Lee, Geon
    Kook Kim, Hong
    Kong, Duk-Jo
    IEEE ACCESS, 2024, 12 : 72707 - 72720
  • [5] A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition
    Zhu, Qiu-Shi
    Zhang, Jie
    Zhang, Zi-Qiang
    Dai, Li-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1927 - 1939
  • [6] A speech emphasis method for noise-robust speech recognition by using repetitive phrase
    Hirai, Takanori
    Kuroiwa, Shingo
    Tsuge, Satoru
    Ren, Fuji
    Fattah, Mohamed Abdel
    2006 10TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY, VOLS 1 AND 2, PROCEEDINGS, 2006, : 1269 - +
  • [7] Deep Maxout Networks Applied to Noise-Robust Speech Recognition
    de-la-Calle-Silos, F.
    Gallardo-Antolin, A.
    Pelaez-Moreno, C.
    ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, IBERSPEECH 2014, 2014, 8854 : 109 - 118
  • [8] REINFORCEMENT LEARNING BASED SPEECH ENHANCEMENT FOR ROBUST SPEECH RECOGNITION
    Shen, Yih-Liang
    Huang, Chao-Yuan
    Wang, Syu-Siang
    Tsao, Yu
    Wang, Hsin-Min
    Chi, Tai-Shih
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6750 - 6754
  • [9] Deep maxout networks applied to noise-robust speech recognition
    de-la-Calle-Silos, F. (fsilos@tsc.uc3m.es), 1600, Springer Verlag (8854):
  • [10] Factorial Speech Processing Models for Noise-Robust Automatic Speech Recognition
    Khademian, Mahdi
    Homayounpour, Mohammad Mehdi
    2015 23RD IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2015, : 637 - 642