Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

被引：0

作者：

Li, Zehan ^{[1
,2
]}

Miao, Haoran ^{[1
,2
]}

Deng, Keqi ^{[1
,2
]}

Cheng, Gaofeng ^{[1
]}

Tian, Sanli ^{[1
,2
]}

Li, Ta ^{[1
,2
]}

Yan, Yonghong ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Acoust, Key Lab Speech Acoust & Content Understanding, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

INTERSPEECH 2022 | 2022年

关键词：

Streaming ASR; Causal model; Transformer; Encoder states revision; SPEECH RECOGNITION;

D O I：

10.21437/Interspeech.2022-707

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the proposed revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets and brings 45% relative improvement for causal models, which is also competitive with the chunk-based methods and the knowledge distillation methods.

引用

下载

页码：1671 / 1675

页数：5

共 50 条

[21] Improving Transformer-based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
Karita, Shigeki
Soplin, Nelson Enrique Yalta
Watanabe, Shinji
Delcroix, Marc
Ogawa, Atsunori
Nakatani, Tomohiro
INTERSPEECH 2019, 2019, : 1408 - 1412
[22] TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION WITH LOCAL DENSE SYNTHESIZER ATTENTION
Xu, Menglong
Li, Shengqiang
Zhang, Xiao-Lei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5899 - 5903
[23] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev Orken
Oralbekova Dina
Alimhan Keylan
Turdalykyzy Tolganay
Othman Mohamed
Scientific Reports, 12
[24] TMSS: An End-to-End Transformer-Based Multimodal Network for Segmentation and Survival Prediction
Saeed, Numan
Sobirov, Ikboljon
Al Majzoub, Roba
Yaqub, Mohammad
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2022, PT VII, 2022, 13437 : 319 - 329
[25] A study of transformer-based end-to-end speech recognition system for Kazakh language
Mamyrbayev, Orken
Oralbekova, Dina
Alimhan, Keylan
Turdalykyzy, Tolganay
Othman, Mohamed
SCIENTIFIC REPORTS, 2022, 12 (01)
[26] TRANSFORMER-BASED ONLINE CTC/ATTENTION END-TO-END SPEECH RECOGNITION ARCHITECTURE
Miao, Haoran
Cheng, Gaofeng
Gao, Changfeng
Zhang, Pengyuan
Yan, Yonghong
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6084 - 6088
[27] Transformer-based end-to-end attack on text CAPTCHAs with triplet deep attention
Zhang, Bo
Xiong, Yu-Jie
Xia, Chunming
Gao, Yongbin
COMPUTERS & SECURITY, 2024, 146
[28] SIMPLIFIED SELF-ATTENTION FOR TRANSFORMER-BASED END-TO-END SPEECH RECOGNITION
Luo, Haoneng
Zhang, Shiliang
Lei, Ming
Xie, Lei
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 75 - 81
[29] Transformer-Based End-to-End Classification of Variable-Length Volumetric Data
Oghbaie, Marzieh
Araujo, Teresa
Emre, Taha
Schmidt-Erfurth, Ursula
Bogunovic, Hrvoje
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VI, 2023, 14225 : 358 - 367
[30] TOD-Net: An end-to-end transformer-based object detection network
Sirisha, Museboyina
Sudha, S. V.
COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108

← 1 2 3 4 5 →