A Speech Recognition Method Using Long Short-Term Memory Network in Low Resources

被引:3
|
作者
Shu F. [1 ]
Qu D. [1 ]
Zhang W. [1 ]
Zhou L. [1 ]
Guo W. [2 ]
机构
[1] Institute of Information System Engineering, PLA Information Engineering University, Zhengzhou
[2] Institute of Information Science and Technology, University of Science and Technology of China, Hefei
关键词
Long short-term memory; Low resource; Neural network; Speech recognition;
D O I
10.7652/xjtuxb201710020
中图分类号
学科分类号
摘要
A speech recognition method using long short-term memory network in low resources (LSTM-LRASR method) is proposed to solve the problem that the recognition rate of an auto speech recognition system is declining due to the lack of transcripted training data in low resource environments. The method uses long short-term memory network to construct an acoustic model, and improves the low resource speech recognition performance from three aspects. These are feature extraction, data augmentation and model optimization. The feature extraction extracts language-independent high-level robustness parameters to reduce the dependence of acoustic model on training data. The data augmentation processes the transcripted data by speed perturbation, while the untranscripted data is recognized automatically, so that more transcripted data are created. The model optimization uses the sequential discriminating training technique to improve the ability of distinguishing phonemes, and the minimum Bayes-risk decoding is used to combine multiple systems and to further improve the recognition performance. The experimental results on the OpenKWS16 evaluation database show that the word error rate of the low resource speech recognition system built by the proposed LSTM-LRASR method is 29.9% lower than that of the baseline system, and the actual value weighted value increases by 60.3%. © 2017, Editorial Office of Journal of Xi'an Jiaotong University. All right reserved.
引用
收藏
页码:120 / 127
页数:7
相关论文
共 18 条
  • [1] Liu J., Zhang W., Research progress on key technology of low resource speech recognition, Data Acquisition & Processing, 32, 2, pp. 205-220, (2017)
  • [2] The babel program
  • [3] Cai M., Lv Z., Song B., Et al., The THUEE system for the openKWS14 keyword search evaluation, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4734-4738, (2015)
  • [4] Do V.H., Xiao X., Xu H., Et al., Multilingual exemplar-based acoustic model for the NIST Open KWS 2015 evaluation, Asia-Pacific Signal and Information Processing Association Summit and Conference, pp. 594-598, (2015)
  • [5] Chen I.F., Ni C., Lim B.P., Et al., A keyword-aware language modeling approach to spoken keyword search, Journal of Signal Processing Systems, 82, 2, pp. 197-206, (2016)
  • [6] NIST. DRAFT KWS16 keyword search evaluation plan
  • [7] Knill K.M., Gales M.J.F., Rath S.P., Et al., Investigation of multilingual deep neural networks for spoken term detection, Automatic Speech Recognition and Understanding, pp. 138-143, (2013)
  • [8] Vesely K., Hannemann M., Burget L., Semi-supervised training of deep neural networks, 2013 Automatic Speech Recognition and Understanding, pp. 267-272, (2013)
  • [9] Ko T., Peddinti V., Povey D., Et al., Audio augmentation for speech recognition, Proceedings of the Annual Conference of the International Speech Communication Association, pp. 3586-3589, (2015)
  • [10] Vesely K., Ghoshal A., Burget L., Et al., Sequence-discriminative training of deep neural networks, Proceedings of the Annual Conference of the International Speech Communication Association, pp. 2345-2349, (2013)