Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition

被引:84
|
作者
Soltau, Hagen [1 ]
Liao, Hank [1 ]
Sak, Hasim [1 ]
机构
[1] Google, New York, NY USA
关键词
D O I
10.21437/Interspeech.2017-1566
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present results that show it is possible to build a competitive. greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. We model the output vocabulary of about 100,000 words directly using deep bi-directional LSTM RNNs with CTC loss. The model is trained on 125.000 hours of semi-supervised acoustic training data, which enables us to alleviate the data sparsity problem for word models. We show that the CTC word models work very well as an end-to-end all-neural speech recognition model without the use of traditional context-dependent sub-word phone units that require a pronunciation lexicon, and without any language model removing the need to decode. We demonstrate that the CTC word models perform better than a strong, more complex. state-of-the-art baseline with sub-word units.
引用
收藏
页码:3707 / 3711
页数:5
相关论文
共 50 条
  • [1] LEVERAGING SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR ENHANCING ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Mimura, Masato
    Ueno, Sei
    Inaguma, Hirofumi
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 477 - 484
  • [2] Encoder Transfer for Attention-based Acoustic-to-word Speech Recognition
    Ueno, Sei
    Moriya, Takafumi
    Mimura, Masato
    Sakai, Shinsuke
    Shinohara, Yusuke
    Yamaguchi, Yoshikazu
    Aono, Yushi
    Kawahara, Tatsuya
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2424 - 2428
  • [3] Modular End-to-End Automatic Speech Recognition Framework for Acoustic-to-Word Model
    Liu, Qi
    Chen, Zhehuai
    Li, Hao
    Huang, Mingkun
    Lu, Yizhou
    Yu, Kai
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 2174 - 2183
  • [4] End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model
    Feng, Han
    Ueno, Sei
    Kawahara, Tatsuya
    [J]. INTERSPEECH 2020, 2020, : 501 - 505
  • [5] MULTI-SPEAKER SEQUENCE-TO-SEQUENCE SPEECH SYNTHESIS FOR DATA AUGMENTATION IN ACOUSTIC-TO-WORD SPEECH RECOGNITION
    Ueno, Sei
    Mimura, Masato
    Sakai, Shinsuke
    Kawahara, Tatsuya
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6161 - 6165
  • [6] Boosting acoustic models in large vocabulary speech recognition
    Meyer, C
    Schramm, H
    [J]. PROCEEDINGS OF THE SIXTH IASTED INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING, 2004, : 255 - 260
  • [7] A word graph algorithm for large vocabulary continuous speech recognition
    Ortmanns, S
    Ney, H
    Aubert, X
    [J]. COMPUTER SPEECH AND LANGUAGE, 1997, 11 (01): : 43 - 72
  • [8] A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition
    Bahl, Lalit R.
    De Gennaro, Steven V.
    Gopalakrishnan, P. S.
    Mercer, Robert L.
    [J]. IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1993, 1 (01): : 59 - 67
  • [9] Building DNN acoustic models for large vocabulary speech recognition
    Maas, Andrew L.
    Qi, Peng
    Xie, Ziang
    Hannun, Awni Y.
    Lengerich, Christopher T.
    Jurafsky, Daniel
    Ng, Andrew Y.
    [J]. COMPUTER SPEECH AND LANGUAGE, 2017, 41 : 195 - 213
  • [10] Boosting HMM acoustic models in large vocabulary speech recognition
    Meyer, C
    Schramm, H
    [J]. SPEECH COMMUNICATION, 2006, 48 (05) : 532 - 548