Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

被引:152
|
作者
Hari, Takaaki [1 ]
Watanabe, Shinji [1 ]
Zhang, Yu [2 ]
Chan, William [3 ]
机构
[1] Mitsubishi Elect Res Labs, Cambridge, MA 02139 USA
[2] MIT, Cambridge, MA 02139 USA
[3] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
end-to-end speech recognition; encoder-decoder; connectionist temporal classification; attention model;
D O I
10.21437/Interspeech.2017-1296
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is a deep Convolutional Neural Network (CNN) based on the VGG network. The CTC network sits on top of the encoder and is jointly trained with the attention-based decoder. During the beam search process, we combine the CTC predictions. the attention-based decoder predictions and a separately trained LSTM language model. We achieve a 5-10% error reduction compared to prior systems on spontaneous Japanese and Chinese speech, and our end-to-end model beats out traditional hybrid ASR systems.
引用
收藏
页码:949 / 953
页数:5
相关论文
共 50 条
  • [1] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943
  • [2] Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder
    Zhu T.
    Cheng C.
    [J]. Journal of Shanghai Jiaotong University (Science), 2020, 25 (01): : 70 - 75
  • [3] Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition
    Markovnikov, Nikita
    Kipyatkova, Irina
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 337 - 347
  • [4] JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING
    Kim, Suyoun
    Hori, Takaaki
    Watanabe, Shinji
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4835 - 4839
  • [5] Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units
    Xiao, Zhangyu
    Ou, Zhijian
    Chu, Wei
    Lin, Hui
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 146 - 150
  • [6] Improved CTC-Attention Based End-to-End Speech Recognition on Air Traffic Control
    Zhou, Kai
    Yang, Qun
    Sun, XiuSong
    Liu, ShaoHan
    Lu, JinJun
    [J]. INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: BIG DATA AND MACHINE LEARNING, PT II, 2019, 11936 : 187 - 196
  • [7] DISTILLING KNOWLEDGE FROM ENSEMBLES OF ACOUSTIC MODELS FOR JOINT CTC-ATTENTION END-TO-END SPEECH RECOGNITION
    Gao, Yan
    Parcollet, Titouan
    Lane, Nicholas D.
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 138 - 145
  • [8] Joint CTC/attention decoding for end-to-end speech recognition
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
  • [9] Hybrid CTC-Attention Network-Based End-to-End Speech Recognition System for Korean Language
    Park, Hosung
    Kim, Changmin
    Son, Hyunsoo
    Seo, Soonshin
    Kim, Ji-Hwan
    [J]. JOURNAL OF WEB ENGINEERING, 2022, 21 (02): : 265 - 284
  • [10] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253