EXPLORING NEURAL TRANSDUCERS FOR END-TO-END SPEECH RECOGNITION

被引:0
|
作者
Battenberg, Eric [1 ]
Chen, Jitong [1 ]
Child, Rewon [1 ]
Coates, Adam [1 ]
Gaur, Yashesh [1 ]
Li, Yi [1 ]
Liu, Hairong [1 ]
Satheesh, Sanjeev [1 ]
Sriram, Anuroop [1 ]
Zhu, Zhenyao [1 ]
机构
[1] Baidu Silicon Valley AI Lab, Sunnyvale, CA 94089 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this work, we perform an empirical comparison among the CTC, RNN-Transducer, and attention-based Seq2Seq models for end-to-end speech recognition. We show that, without any language model, Seq2Seq and RNN-Transducer models both outperform the best reported CTC models with a language model, on the popular Hub5'00 benchmark. On our internal diverse dataset, these trends continue - RNN-Transducer models rescored with a language model after beam search outperform our best CTC models. These results simplify the speech recognition pipeline so that decoding can now be expressed purely as neural network operations. We also study how the choice of encoder architecture affects the performance of the three models - when all encoder layers are forward only, and when encoders downsample the input representation aggressively.
引用
收藏
页码:206 / 213
页数:8
相关论文
共 50 条
  • [1] Self-Attention Transducers for End-to-End Speech Recognition
    Tian, Zhengkun
    Yi, Jiangyan
    Tao, Jianhua
    Bai, Ye
    Wen, Zhengqi
    [J]. INTERSPEECH 2019, 2019, : 4395 - 4399
  • [2] Insights on Neural Representations for End-to-End Speech Recognition
    Ollerenshaw, Anna
    Jalal, Asif
    Hain, Thomas
    [J]. INTERSPEECH 2021, 2021, : 4079 - 4083
  • [3] End-to-End Neural Segmental Models for Speech Recognition
    Tang, Hao
    Lu, Liang
    Kong, Lingpeng
    Gimpel, Kevin
    Livescu, Karen
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1254 - 1264
  • [4] Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition
    Shinohara, Yusuke
    Watanabe, Shinji
    [J]. INTERSPEECH 2022, 2022, : 2098 - 2102
  • [5] Segmental Recurrent Neural Networks for End-to-end Speech Recognition
    Lu, Liang
    Kong, Lingpeng
    Dyer, Chris
    Smith, Noah A.
    Renals, Steve
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 385 - 389
  • [6] End-to-End Speech Emotion Recognition Based on Neural Network
    Zhu, Bing
    Zhou, Wenkai
    Wang, Yutian
    Wang, Hui
    Cai, Juan Juan
    [J]. 2017 17TH IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION TECHNOLOGY (ICCT 2017), 2017, : 1634 - 1638
  • [7] Towards End-to-End Speech Recognition with Recurrent Neural Networks
    Graves, Alex
    Jaitly, Navdeep
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 32 (CYCLE 2), 2014, 32 : 1764 - 1772
  • [8] ESPRESSO: A FAST END-TO-END NEURAL SPEECH RECOGNITION TOOLKIT
    Wang, Yiming
    Chen, Tongfei
    Xu, Hainan
    Ding, Shuoyang
    Lv, Hang
    Shao, Yiwen
    Peng, Nanyun
    Xie, Lei
    Watanabe, Shinji
    Khudanpur, Sanjeev
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 136 - 143
  • [9] Large-Scale Streaming End-to-End Speech Translation with Neural Transducers
    Xue, Jian
    Wang, Peidong
    Li, Jinyu
    Post, Matt
    Gaur, Yashesh
    [J]. INTERSPEECH 2022, 2022, : 3263 - 3267
  • [10] Exploring end-to-end framework towards Khasi speech recognition system
    Bronson Syiem
    L. Joyprakash Singh
    [J]. International Journal of Speech Technology, 2021, 24 : 419 - 424