TOWARDS FAST AND ACCURATE STREAMING END-TO-END ASR

被引:0
|
作者
Li, Bo [1 ]
Chang, Shuo-yiin [1 ]
Sainath, Tara N. [1 ]
Pang, Ruoming [1 ]
He, Yanzhang [1 ]
Strohman, Trevor [1 ]
Wu, Yonghui [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
关键词
RNN-T; Endpointer; Latency;
D O I
10.1109/icassp40776.2020.9054715
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNNT) as a streaming E2E model has shown promising potential for on-device ASR [1]. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model [2] with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique [3], we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over [2] on a Voice Search test set. We also experimented with a second-pass Listen, Attend and Spell (LAS) rescorer [4]. Although it did not directly improve the first pass latency, the large WER reduction provides extra room to trade WER for latency. RNN-T EP+LAS, together with MWER training brings in 18.7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP [2] model.
引用
收藏
页码:6069 / 6073
页数:5
相关论文
共 50 条
  • [1] A BETTER AND FASTER END-TO-END MODEL FOR STREAMING ASR
    Li, Bo
    Gulati, Anmol
    Yu, Jiahui
    Sainath, Tara N.
    Chiu, Chung-Cheng
    Narayanan, Arun
    Chang, Shuo-Yiin
    Pang, Ruoming
    He, Yanzhang
    Qin, James
    Han, Wei
    Liang, Qiao
    Zhang, Yu
    Strohman, Trevor
    Wu, Yonghui
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5634 - 5638
  • [2] Towards Lifelong Learning of End-to-end ASR
    Chang, Heng-Jui
    Lee, Hung-yi
    Lee, Lin-shan
    [J]. INTERSPEECH 2021, 2021, : 2551 - 2555
  • [3] Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems
    Joshi, Vikas
    Das, Amit
    Sun, Eric
    Mehta, Rupesh R.
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2021, 2021, : 1767 - 1771
  • [4] ENDPOINT DETECTION FOR STREAMING END-TO-END MULTI-TALKER ASR
    Lu, Liang
    Li, Jinyu
    Gong, Yifan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7312 - 7316
  • [5] COMPARATIVE STUDY OF DIFFERENT TOKENIZATION STRATEGIES FOR STREAMING END-TO-END ASR
    Singh, Sachin
    Gupta, Ashutosh
    Maghan, Aman
    Gowda, Dhananjaya
    Singh, Shatrughan
    Kim, Chanwoo
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 388 - 394
  • [6] Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models
    Wang, Tianzi
    Fujita, Yuya
    Chang, Xuankai
    Watanabe, Shinji
    [J]. INTERSPEECH 2021, 2021, : 3755 - 3759
  • [7] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
    Ochiai, Tsubasa
    Watanabe, Shinji
    Katagiri, Shigeru
    [J]. 2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [8] TOWARDS CODE-SWITCHING ASR FOR END-TO-END CTC MODELS
    Li, Ke
    Li, Jinyu
    Ye, Guoli
    Zhao, Rui
    Gong, Yifan
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6076 - 6080
  • [9] Streaming End-to-End ASR Using CTC Decoder and DRA for Linguistic Information Substitution
    Takagi, Tatsunari
    Ogawa, Atsunori
    Kitaoka, Norihide
    Wakabayashi, Yukoh
    [J]. 2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1779 - 1783
  • [10] STREAMING BILINGUAL END-TO-END ASR MODEL USING ATTENTION OVER MULTIPLE SOFTMAX
    Patil, Aditya
    Joshi, Vikas
    Agrawal, Purvi
    Mehta, Rupesh
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 252 - 259