Joint CTC/attention decoding for end-to-end speech recognition

被引:67
|
作者
Hori, Takaaki [1 ]
Watanabe, Shinji [1 ]
Hershey, John R. [1 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
关键词
D O I
10.18653/v1/P17-1048
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
End-to-end automatic speech recognition (ASR) has become a popular alternative to conventional DNN/HMM systems because it avoids the need for linguistic resources such as pronunciation dictionary, tokenization, and context-dependency trees, leading to a greatly simplified model-building process. There are two major types of end-to-end architectures for ASR: attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, and connectionist temporal classification (CTC), uses Markov assumptions to efficiently solve sequential problems by dynamic programming. This paper proposes a joint decoding algorithm for end-to-end ASR with a hybrid CTC/attention architecture, which effectively utilizes both advantages in decoding. We have applied the proposed method to two ASR benchmarks (spontaneous Japanese and Mandarin Chinese), and showing the comparable performance to conventional state-of-the-art DNN/HMM ASR systems without linguistic resources.
引用
收藏
页码:518 / 529
页数:12
相关论文
共 50 条
  • [1] Investigating Joint CTC-Attention Models for End-to-End Russian Speech Recognition
    Markovnikov, Nikita
    Kipyatkova, Irina
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 337 - 347
  • [2] STREAMING END-TO-END SPEECH RECOGNITION WITH JOINT CTC-ATTENTION BASED MODELS
    Moritz, Niko
    Hori, Takaaki
    Le Roux, Jonathan
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 936 - 943
  • [3] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
  • [4] Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Li, Ta
    Yan, Yonghong
    [J]. INTERSPEECH 2019, 2019, : 2623 - 2627
  • [5] Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder
    Zhu T.
    Cheng C.
    [J]. Journal of Shanghai Jiaotong University (Science), 2020, 25 (01) : 70 - 75
  • [6] End-to-end recognition of streaming Japanese speech using CTC and local attention
    Chen, Jiahao
    Nishimura, Ryota
    Kitaoka, Norihide
    [J]. APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING, 2020, 9 (01)
  • [7] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
    使用 Conformer 增强的混合 CTC/Attention 端到端中文语音识别
    [J]. 2024, 59 (04) : 97 - 103
  • [8] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
  • [9] Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes
    Kuerzinger, Ludwig
    Watzel, Tobias
    Li, Lujun
    Baumgartner, Robert
    Rigoll, Gerhard
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 258 - 269
  • [10] JOINT CTC-ATTENTION BASED END-TO-END SPEECH RECOGNITION USING MULTI-TASK LEARNING
    Kim, Suyoun
    Hori, Takaaki
    Watanabe, Shinji
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 4835 - 4839