PHONEME BASED NEURAL TRANSDUCER FOR LARGE VOCABULARY SPEECH RECOGNITION

被引:7
|
作者
Zhou, Wei [1 ,2 ]
Berger, Simon [1 ]
Schlueter, Ralf [1 ,2 ]
Ney, Hermann [1 ,2 ]
机构
[1] Rhein Westfal TH Aachen, Comp Sci Dept, Human Language Technol & Pattern Recognit, D-52074 Aachen, Germany
[2] AppTek GmbH, D-52062 Aachen, Germany
基金
欧洲研究理事会;
关键词
phoneme; neural transducer; speech recognition; RNN-TRANSDUCER;
D O I
10.1109/ICASSP39728.2021.9413648
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network structure and a straightforward integration with the external word-level language model to preserve the consistency of seq-to-seq modeling. We also present a simple, stable and efficient training procedure using frame-wise cross-entropy loss. A phonetic context size of one is shown to be sufficient for the best performance. A simplified scheduled sampling approach is applied for further improvement and different decoding approaches are briefly compared. The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.
引用
收藏
页码:5644 / 5648
页数:5
相关论文
共 50 条
  • [1] ADAPTABLE PHONEME-BASED MODELS FOR LARGE-VOCABULARY SPEECH RECOGNITION
    BAMBERG, PG
    MANDEL, MA
    [J]. SPEECH COMMUNICATION, 1991, 10 (5-6) : 437 - 451
  • [2] Transcription of out-of-vocabulary words in large vocabulary speech recognition based on phoneme-to-grapheme conversion
    Decadt, B
    Duchateau, J
    Daelemans, W
    Wambacq, P
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 861 - 864
  • [3] Phoneme-to-grapheme conversion for out-of-vocabulary words in large vocabulary speech recognition
    Decadt, B
    Duchateau, J
    Daelemans, W
    Wambacq, P
    [J]. ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, 2001, : 413 - 416
  • [4] Phoneme Set and Pronouncing Dictionary Creation for Large Vocabulary Continuous Speech Recognition of Vietnamese
    Thien Chuong Nguyen
    Chaloupka, Josef
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2013, 2013, 8082 : 394 - 401
  • [5] Training of across-word phoneme models for large vocabulary continuous speech recognition
    Sixtus, A
    Ney, H
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 849 - 852
  • [6] The Deep Tensor Neural Network With Applications to Large Vocabulary Speech Recognition
    Yu, Dong
    Deng, Li
    Seide, Frank
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (02): : 388 - 396
  • [7] Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition
    Jaitly, Navdeep
    Patrick Nguyen
    Senior, Andrew
    Vanhoucke, Vincent
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 2577 - 2580
  • [8] Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition
    Wu, Jibin
    Yilmaz, Emre
    Zhang, Malu
    Li, Haizhou
    Tan, Kay Chen
    [J]. FRONTIERS IN NEUROSCIENCE, 2020, 14
  • [9] EXPLOITING SPARSENESS IN DEEP NEURAL NETWORKS FOR LARGE VOCABULARY SPEECH RECOGNITION
    Yu, Dong
    Seide, Frank
    Li, Gang
    Deng, Li
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4409 - 4412
  • [10] Large Vocabulary Speech Recognition Using Deep Tensor Neural Networks
    Yu, Dong
    Deng, Li
    Seide, Frank
    [J]. 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 6 - 9