Non-autoregressive Deliberation-Attention based End-to-End ASR

被引：1

作者：

Gao, Changfeng ^{[1
,2
]}

Cheng, Gaofeng ^{[1
,2
]}

Zhou, Jun ^{[1
,2
]}

Zhang, Pengyuan ^{[1
,2
]}

Yan, Yonghong ^{[1
,2
]}

机构：

[1] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2021年

基金：

中国国家自然科学基金;

关键词：

automatic speech recognition; end-to-end; non-autoregressive; highly paralleled;

D O I：

10.1109/ISCSLP49672.2021.9362115

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which replaces the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.

引用

页数：5

共 50 条

[1] Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
Wang, Tianzi
Fujita, Yuya
Chang, Xuankai
Watanabe, Shinji
INTERSPEECH 2021, 2021, : 3755 - 3759
[2] IMPROVED MASK-CTC FOR NON-AUTOREGRESSIVE END-TO-END ASR
Higuchi, Yosuke
Inaguma, Hirofumi
Watanabe, Shinji
Ogawa, Tetsuji
Kobayashi, Tetsunori
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8363 - 8367
[3] Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
Higuchi, Yosuke
Watanabe, Shinji
Chen, Nanxin
Ogawa, Tetsuji
Kobayashi, Tetsunori
INTERSPEECH 2020, 2020, : 3655 - 3659
[4] BOUNDARY AND CONTEXT AWARE TRAINING FOR CIF-BASED NON-AUTOREGRESSIVE END-TO-END ASR
Yu, Fan
Luo, Haoneng
Guo, Pengcheng
Bang, Yuhao
Yao, Zhuoyuan
Xie, Lei
Gao, Yingying
Hou, Leijing
Zhang, Shilei
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 328 - 334
[5] Achieving Timestamp Prediction While Recognizing with Non-autoregressive End-to-End ASR Model
Shi, Xian
Chen, Yanni
Zhang, Shiliang
Yan, Zhijie
MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022, 2023, 1765 : 89 - 100
[6] End-to-End Neural Speaker Diarization with an Iterative Refinement of Non-Autoregressive Attention-based Attractors
Rybicka, Magdalena
Villalba, Jesus
Dehak, Najim
Kowalczyk, Konrad
INTERSPEECH 2022, 2022, : 5090 - 5094
[7] End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors
Rybicka, Magdalena
Villalba, Jesus
Thebaud, Thomas
Dehak, Najim
Kowalczyk, Konrad
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3960 - 3973
[8] Non-autoregressive End-to-End TTS with Coarse-to-Fine Decoding
Wang, Tao
Liu, Xuefei
Tao, Jianhua
Yi, Jiangyan
Fu, Ruibo
Wen, Zhengqi
INTERSPEECH 2020, 2020, : 3984 - 3988
[9] Analysis of expressivity transfer in non-autoregressive end-to-end multispeaker TTS systems
Kulkarni, Ajinkya
Colotte, Vincent
Jouvet, Denis
INTERSPEECH 2022, 2022, : 4581 - 4585
[10] FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis
Wang, Yongqi
Zhao, Zhou
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5678 - 5687

← 1 2 3 4 5 →