Exploring Hybrid CTC/Attention End-to-End Speech Recognition with Gaussian Processes

被引：1

作者：

Kuerzinger, Ludwig ^{[1
]}

Watzel, Tobias ^{[1
]}

Li, Lujun ^{[1
]}

Baumgartner, Robert ^{[1
]}

Rigoll, Gerhard ^{[1
]}

机构：

[1] Tech Univ Munich, Inst Human Machine Commun, Munich, Germany

来源：

SPEECH AND COMPUTER, SPECOM 2019 | 2019年 / 11658卷

关键词：

Connectionist Temporal Classification; Attention-based neural networks; End-to-end speech recognition; Gaussian process optimization; Multi-objective training; Hybrid CTC/attention; ALGORITHM;

D O I：

10.1007/978-3-030-26061-3_27

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Hybrid CTC/attention end-to-end speech recognition combines two powerful concepts. Given a speech feature sequence, the attention mechanism directly outputs a sequence of letters. Connectionist Temporal Classification (CTC) helps to bind the attention mechanism to sequential alignments. This hybrid architecture also gives more degrees of freedom in choosing parameter configurations. We applied Gaussian process optimization to estimate the impact of network parameters and language model weight in decoding towards Character Error Rate (CER), as well as attention accuracy. In total, we trained 70 hybrid CTC/attention networks and performed 590 beam search runs with an RNNLM as language model on the TEDlium v2 test set. To our surprise, the results challenge the assumption that CTC primarily regularizes the attention mechanism. We argue in an evidence-based manner that CTC instead regularizes the impact of language model feedback in a one-pass beam search, as letter hypotheses are fed back into the attention mechanism. Attention-only models without RNNLM already achieved 10.9% CER, or 22.4% Word Error Rate (WER), on the TEDlium v2 test set. Combined decoding of same attention-only networks with RNNLM strongly underperformed, with at best 40.2% CER, or, 49.3% WER. A combined hybrid CTC/attention model with RNNLM performed best, with 8.9% CER, or 17.6% WER.

引用

页码：258 / 269

页数：12

共 50 条

[1] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
Watanabe, Shinji
Hori, Takaaki
Kim, Suyoun
Hershey, John R.
Hayashi, Tomoki
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
[2] Online Hybrid CTC/Attention Architecture for End-to-end Speech Recognition
Miao, Haoran
Cheng, Gaofeng
Zhang, Pengyuan
Li, Ta
Yan, Yonghong
[J]. INTERSPEECH 2019, 2019, : 2623 - 2627
[3] Hybrid CTC/Attention End-to-End Chinese Speech Recognition Enhanced by Conformer
使用 Conformer 增强的混合 CTC/Attention 端到端中文语音识别
[J]. 2024, 59 (04) : 97 - 103
[4] Online Hybrid CTC/Attention End-to-End Automatic Speech Recognition Architecture
Miao, Haoran
Cheng, Gaofeng
Zhang, Pengyuan
Yan, Yonghong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1452 - 1465
[5] Joint CTC/attention decoding for end-to-end speech recognition
Hori, Takaaki
Watanabe, Shinji
Hershey, John R.
[J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 518 - 529
[6] AN END-TO-END SPEECH ACCENT RECOGNITION METHOD BASED ON HYBRID CTC/ATTENTION TRANSFORMER ASR
Gao, Qiang
Wu, Haiwei
Sun, Yanqing
Duan, Yitao
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7253 - 7257
[7] A hybrid CTC+Attention model based on end-to-end framework for multilingual speech recognition
Sendong Liang
Wei Qi Yan
[J]. Multimedia Tools and Applications, 2022, 81 : 41295 - 41308
[8] Hybrid CTC-Attention based End-to-End Speech Recognition using Subword Units
Xiao, Zhangyu
Ou, Zhijian
Chu, Wei
Lin, Hui
[J]. 2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, : 146 - 150
[9] IMPROVING HYBRID CTC/ATTENTION END-TO-END SPEECH RECOGNITION WITH PRETRAINED ACOUSTIC AND LANGUAGE MODELS
Deng, Keqi
Cao, Songjun
Zhang, Yike
Ma, Long
[J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 76 - 82
[10] Improving Hybrid CTC/Attention Architecture with Time-Restricted Self-Attention CTC for End-to-End Speech Recognition
Wu, Long
Li, Ta
Wang, Li
Yan, Yonghong
[J]. APPLIED SCIENCES-BASEL, 2019, 9 (21):

← 1 2 3 4 5 →