Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes

被引:1
|
作者
Kuerzinger, Ludwig [1 ]
Watzel, Tobias [1 ]
Li, Lujun [1 ]
Baumgartner, Robert [1 ]
Rigoll, Gerhard [1 ]
机构
[1] Tech Univ Munich, Inst Human Machine Commun, Munich, Germany
来源
关键词
Connectionist Temporal Classification; Attention-based neural networks; End-to-end speech recognition; Gaussian process optimization; Multi-objective training; Hybrid CTC/attention; ALGORITHM;
D O I
10.1007/978-3-030-26061-3_27
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved 10.9% CER, or 22.4% Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best 40.2% CER, or, 49.3% WER. A combined hybrid CTC/attention model with RNNLM performed best, with 8.9% CER, or 17.6% WER.
引用
收藏
页码:258 / 269
页数:12
相关论文
共 50 条
  • [41] Noise-robust Attention Learning for End-to-End Speech Recognition
    Higuchi, Yosuke
    Tawara, Naohiro
    Ogawa, Atsunori
    Iwata, Tomoharu
    Kobayashi, Tetsunori
    Ogawa, Tetsuji
    [J]. 28TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2020), 2021, : 311 - 315
  • [42] MODALITY ATTENTION FOR END-TO-END AUDIO-VISUAL SPEECH RECOGNITION
    Zhou, Pan
    Yang, Wenwen
    Chen, Wei
    Wang, Yanfeng
    Jia, Jia
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6565 - 6569
  • [43] Large Margin Training for Attention Based End-to-End Speech Recognition
    Wang, Peidong
    Cui, Jia
    Weng, Chao
    Yu, Dong
    [J]. INTERSPEECH 2019, 2019, : 246 - 250
  • [44] Speaker Adaptation for Attention-Based End-to-End Speech Recognition
    Meng, Zhong
    Gaur, Yashesh
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2019, 2019, : 241 - 245
  • [45] ATTENTION-BASED END-TO-END SPEECH RECOGNITION ON VOICE SEARCH
    Shan, Changhao
    Zhang, Junbo
    Wang, Yujun
    Xie, Lei
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4764 - 4768
  • [46] Exploring end-to-end framework towards Khasi speech recognition system
    Bronson Syiem
    L. Joyprakash Singh
    [J]. International Journal of Speech Technology, 2021, 24 : 419 - 424
  • [47] Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention
    Liang, Chengdong
    Xu, Menglong
    Zhang, Xiao-Lei
    [J]. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2021, 2 : 1495 - 1499
  • [48] Exploring End-to-End Techniques for Low-Resource Speech Recognition
    Bataev, Vladimir
    Korenevsky, Maxim
    Medennikov, Ivan
    Zatvornitskiy, Alexander
    [J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 32 - 41
  • [49] Exploring end-to-end framework towards Khasi speech recognition system
    Syiem, Bronson
    Singh, L. Joyprakash
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (02) : 419 - 424
  • [50] EXPLORING MODEL UNITS AND TRAINING STRATEGIES FOR END-TO-END SPEECH RECOGNITION
    Huang, Mingkun
    Lu, Yizhou
    Wang, Lan
    Qian, Yanmin
    Yu, Kai
    [J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 524 - 531