Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes

被引:1
|
作者
Kuerzinger, Ludwig [1 ]
Watzel, Tobias [1 ]
Li, Lujun [1 ]
Baumgartner, Robert [1 ]
Rigoll, Gerhard [1 ]
机构
[1] Tech Univ Munich, Inst Human Machine Commun, Munich, Germany
来源
关键词
Connectionist Temporal Classification; Attention-based neural networks; End-to-end speech recognition; Gaussian process optimization; Multi-objective training; Hybrid CTC/attention; ALGORITHM;
D O I
10.1007/978-3-030-26061-3_27
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved 10.9% CER, or 22.4% Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best 40.2% CER, or, 49.3% WER. A combined hybrid CTC/attention model with RNNLM performed best, with 8.9% CER, or 17.6% WER.
引用
收藏
页码:258 / 269
页数:12
相关论文
共 50 条
  • [1] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
  • [2] Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Li, Ta
    Yan, Yonghong
    [J]. INTERSPEECH 2019, 2019, : 2623 - 2627
  • [3] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
    使用 Conformer 增强的混合 CTC/Attention 端到端中文语音识别
    [J]. 2024, 59 (04) : 97 - 103
  • [4] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
    Miao, Haoran
    Cheng, Gaofeng
    Zhang, Pengyuan
    Yan, Yonghong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
  • [5] Joint CTC/attention decoding for end-to-end speech recognition
    Hori, Takaaki
    Watanabe, Shinji
    Hershey, John R.
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
  • [6] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
    Gao, Qiang
    Wu, Haiwei
    Sun, Yanqing
    Duan, Yitao
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
  • [7] A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition
    Sendong Liang
    Wei Qi Yan
    [J]. Multimedia Tools and Applications, 2022, 81 : 41295 - 41308
  • [8] Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units
    Xiao, Zhangyu
    Ou, Zhijian
    Chu, Wei
    Lin, Hui
    [J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 146 - 150
  • [9] IMPROVING HYBRID CTC/ATTENTION END-TO-END SPEECH RECOGNITION WITH PRETRAINED ACOUSTIC AND LANGUAGE MODELS
    Deng, Keqi
    Cao, Songjun
    Zhang, Yike
    Ma, Long
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 76 - 82
  • [10] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
    Wu, Long
    Li, Ta
    Wang, Li
    Yan, Yonghong
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (21):