Non-autoregressive Deliberation-Attention based End-to-End ASR

被引：1

作者：

Gao, Changfeng ^{[1
,2
]}

Cheng, Gaofeng ^{[1
,2
]}

Zhou, Jun ^{[1
,2
]}

Zhang, Pengyuan ^{[1
,2
]}

Yan, Yonghong ^{[1
,2
]}

机构：

[1] Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2021年

基金：

中国国家自然科学基金;

关键词：

automatic speech recognition; end-to-end; non-autoregressive; highly paralleled;

D O I：

10.1109/ISCSLP49672.2021.9362115

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Attention-based encoder-decoder end-to-end (E2E) automatic speech recognition (ASR) architectures have achieved the state-of-the-art results on many ASR tasks. However, the conventional attention-based E2E ASR models rely on the autoregressive decoder, which makes the parallel computation in decoding difficult. In this paper, we propose a novel deliberation-attention (D-Att) based E2E ASR architecture, which replaces the autoregressive attention-based decoder with the non-autoregressive frame level D-Att decoder, and thus accelerates the GPU parallel decoding speed significantly. D-Att decoder differs from the conventional attention decoder on two aspects: first, D-Att decoder uses the frame level text embedding (FLTE) generated by an auxiliary ASR model instead of the ground truth transcripts or previous predictions which are required by the conventional attention decoder; second, conventional attention decoder is trained in the left-to-right label-synchronous way, however, D-Att decoder is trained under the supervision of connectionist temporal classification (CTC) loss and utilizes the FLTE to provide the text information. Our experiments on Aishell, HKUST and WSJ benchmarks show that the proposed D-Att E2E ASR models are comparable to the performance of the state-of-the-art autoregressive attention-based transformer E2E ASR baselines, and are 10 times faster with GPU parallel decoding.

引用

页数：5

共 50 条

[41] STREAMING BILINGUAL END-TO-END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX
Patil, Aditya
Joshi, Vikas
Agrawal, Purvi
Mehta, Rupesh
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 252 - 259
[42] Towards Lifelong Learning of End-to-end ASR
Chang, Heng-Jui
Lee, Hung-yi
Lee, Lin-shan
INTERSPEECH 2021, 2021, : 2551 - 2555
[43] Contextual Biasing for End-to-End Chinese ASR
Zhang, Kai
Zhang, Qiuxia
Wang, Chung-Che
Jang, Jyh-Shing Roger
IEEE ACCESS, 2024, 12 : 92960 - 92975
[44] UNSUPERVISED MODEL ADAPTATION FOR END-TO-END ASR
Sivaraman, Ganesh
Casal, Ricardo
Garland, Matt
Khoury, Elie
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6987 - 6991
[45] End-to-End Topic Classification without ASR
Dong, Zexian
Liu, Jia
Zhang, Wei-Qiang
2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,
[46] Phonemic competition in end-to-end ASR models
ten Bosch, Louis
Bentum, Martijn
Boves, Lou
INTERSPEECH 2023, 2023, : 586 - 590
[47] Streaming Align-Refine for Non-autoregressive Deliberation
Wang, Weiran
Hu, Ke
Sainath, Tara N.
INTERSPEECH 2022, 2022, : 1696 - 1700
[48] Keyword Search Using Attention-Based End-to-End ASR and Frame-Synchronous Phoneme Alignments
Yang, Runyan
Cheng, Gaofeng
Miao, Haoran
Li, Ta
Zhang, Pengyuan
Yan, Yonghong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 3202 - 3215
[49] Spelling-Aware Word-Based End-to-End ASR
Egorova, Ekaterina
Vydana, Hari Krishna
Burget, Lukas
Cernocky, Jan Honza
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1729 - 1733
[50] A Lexical-aware Non-autoregressive Transformer-based ASR Model
Lin, Chong-En
Chen, Kuan-Yu
INTERSPEECH 2023, 2023, : 1434 - 1438

← 1 2 3 4 5 →