A Spectral Masking Approach to Noise-Robust Speech Recognition Using Deep Neural Networks

被引：37

作者：

Li, Bo ^{[1
]}

Sim, Khe Chai ^{[1
]}

机构：

[1] Natl Univ Singapore, Sch Comp, Singapore 117417, Singapore

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2014年 / 22卷 / 08期

关键词：

Deep neural network; noise robustness; spectral masking;

D O I：

10.1109/TASLP.2014.2329237

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Improving the noise robustness of automatic speech recognition systems has been a challenging task for many years. Recently, it was found that Deep Neural Networks (DNNs) yield large performance gains over conventional GMM-HMM systems, when used in both hybrid and tandem systems. However, they are still far from the level of human expectations especially under adverse environments. Motivated by the separation-prior-to-recognition process of the human auditory system, we propose a robust spectral masking system where power spectral domain masks are predicted using a DNN trained on the same filter-bank features used for acoustic modeling. To further improve performance, Linear Input Network (LIN) adaptation is applied to both the mask estimator and the acousticmodel DNNs. Since the estimation of LINs for the mask estimator requires stereo data, which is not available during testing, we proposed using the LINs estimated for the acoustic model DNNs to adapt the mask estimators. Furthermore, we used the same set of weights obtained from pre-training for the input layers of both the mask estimator and the acoustic model DNNs to ensure a better consistency for sharing LINs. Experimental results on benchmark Aurora2 and Aurora4 tasks demonstrated the effectiveness of our system, which yielded Word Error Rates (WERs) of 4.6% and 11.8% respectively. Furthermore, the simple averaging of posteriors from systems with and without spectral masking can further reduce the WERs to 4.3% on Aurora2 and 11.4% on Aurora4.

引用

页码：1296 / 1305

页数：10

共 50 条

[31] Extended VTS for Noise-Robust Speech Recognition
van Dalen, Rogier C.
Gales, Mark J. F.
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (04): : 733 - 743
[32] Direct control on modulation spectrum for noise-robust speech recognition and spectral subtraction
Wada, Naoya
Hayasaka, Noboru
Yoshizawa, Shingo
Miyanaga, Yoshikazu
[J]. 2006 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1-11, PROCEEDINGS, 2006, : 2533 - +
[33] A speech emphasis method for noise-robust speech recognition by using repetitive phrase
Hirai, Takanori
Kuroiwa, Shingo
Tsuge, Satoru
Ren, Fuji
Fattah, Mohamed Abdel
[J]. 2006 10TH INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY, VOLS 1 AND 2, PROCEEDINGS, 2006, : 1269 - +
[34] Employing Robust Principal Component Analysis for Noise-Robust Speech Feature Extraction in Automatic Speech Recognition with the Structure of a Deep Neural Network
Hung, Jeih-weih
Lin, Jung-Shan
Wu, Po-Jen
[J]. APPLIED SYSTEM INNOVATION, 2018, 1 (03) : 1 - 14
[35] A neural network approach for speech enhancement and noise-robust bandwidth extension
Hao, Xiang
Xu, Chenglin
Zhang, Chen
Xie, Lei
[J]. COMPUTER SPEECH AND LANGUAGE, 2025, 89
[36] Robust Speech Recognition with Speech Enhanced Deep Neural Networks
Du, Jun
Wang, Qing
Gao, Tian
Xu, Yong
Dai, Lirong
Lee, Chin-Hui
[J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 616 - 620
[37] RECURRENT DEEP NEURAL NETWORKS FOR ROBUST SPEECH RECOGNITION
Weng, Chao
Yu, Dong
Watanabe, Shinji
Juang, Biing-Hwang
[J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[38] Deep bidirectional neural networks for robust speech recognition under heavy background noise
Koya, Jeevan Reddy
Rao, S. P. Venu Madhava
[J]. MATERIALS TODAY-PROCEEDINGS, 2021, 46 : 4117 - 4121
[39] Noise-robust cellular phone speech recognition using CODEC-adapted speech and noise models
Kato, T
Naito, M
Shimizu, T
[J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 285 - 288
[40] SPEECH SEPARATION BASED ON SIGNAL-NOISE-DEPENDENT DEEP NEURAL NETWORKS FOR ROBUST SPEECH RECOGNITION
Tu, Yan-Hui
Du, Jun
Dai, Li-Rong
Lee, Chin-Hui
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP), 2015, : 61 - 65

← 1 2 3 4 5 →