TOWARDS FAST AND ACCURATE STREAMING END-TO-END ASR

被引:0
|
作者
Li, Bo [1 ]
Chang, Shuo-yiin [1 ]
Sainath, Tara N. [1 ]
Pang, Ruoming [1 ]
He, Yanzhang [1 ]
Strohman, Trevor [1 ]
Wu, Yonghui [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
关键词
RNN-T; Endpointer; Latency;
D O I
10.1109/icassp40776.2020.9054715
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
End-to-end (E2E) models fold the acoustic, pronunciation and language models of a conventional speech recognition model into one neural network with a much smaller number of parameters than a conventional ASR system, thus making it suitable for on-device applications. For example, recurrent neural network transducer (RNNT) as a streaming E2E model has shown promising potential for on-device ASR [1]. For such applications, quality and latency are two critical factors. We propose to reduce E2E model's latency by extending the RNN-T endpointer (RNN-T EP) model [2] with additional early and late penalties. By further applying the minimum word error rate (MWER) training technique [3], we achieved 8.0% relative word error rate (WER) reduction and 130ms 90-percentile latency reduction over [2] on a Voice Search test set. We also experimented with a second-pass Listen, Attend and Spell (LAS) rescorer [4]. Although it did not directly improve the first pass latency, the large WER reduction provides extra room to trade WER for latency. RNN-T EP+LAS, together with MWER training brings in 18.7% relative WER reduction and 160ms 90-percentile latency reductions compared to the original proposed RNN-T EP [2] model.
引用
收藏
页码:6069 / 6073
页数:5
相关论文
共 50 条
  • [31] BILINGUAL END-TO-END ASR WITH BYTE-LEVEL SUBWORDS
    Deng, Liuhui
    Hsiao, Roger
    Ghoshal, Arnab
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6417 - 6421
  • [32] Joint Grapheme and Phoneme Embeddings for Contextual End-to-End ASR
    Chen, Zhehuai
    Jain, Mahaveer
    Wang, Yongqiang
    Seltzer, Michael L.
    Fuegen, Christian
    [J]. INTERSPEECH 2019, 2019, : 3490 - 3494
  • [33] Comparison and analysis of new curriculum criteria for end-to-end ASR
    Karakasidis, Georgios
    Kurimo, Mikko
    Bell, Peter
    Grosz, Tamas
    [J]. SPEECH COMMUNICATION, 2024, 163
  • [34] End-to-end ASR to jointly predict transcriptions and linguistic annotations
    Omachi, Motoi
    Fujita, Yuya
    Watanabe, Shinji
    Wiesner, Matthew
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 1861 - 1871
  • [35] Iterative Compression of End-to-End ASR Model using AutoML
    Mehrotra, Abhinav
    Dudziak, Lukasz
    Yeo, Jinsu
    Lee, Young-yoon
    Vipperla, Ravichander
    Abdelfattah, Mohamed S.
    Bhattacharya, Sourav
    Ishtiaq, Samin
    Ramos, Alberto Gil C. P.
    Lee, SangJeong
    Kim, Daehyun
    Lane, Nicholas D.
    [J]. INTERSPEECH 2020, 2020, : 3361 - 3365
  • [36] Data Augmentation Using CycleGAN for End-to-End Children ASR
    Singh, Dipesh K.
    Amin, Preet P.
    Sailor, Hardik B.
    Patil, Hemant A.
    [J]. 29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 511 - 515
  • [37] Auxiliary feature based adaptation of end-to-end ASR systems
    Delcroix, Marc
    Watanabe, Shinji
    Ogawa, Atsunori
    Karita, Shigeki
    Nakatani, Tomohiro
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2444 - 2448
  • [38] Multi-Modal Data Augmentation for End-to-End ASR
    Renduchintala, Adithya
    Ding, Shuoyang
    Wiesner, Matthew
    Watanabe, Shinji
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2394 - 2398
  • [39] End-to-End ASR with Adaptive Span Self-Attention
    Chang, Xuankai
    Subramanian, Aswin Shanmugam
    Guo, Pengcheng
    Watanabe, Shinji
    Fujita, Yuya
    Omachi, Motoi
    [J]. INTERSPEECH 2020, 2020, : 3595 - 3599
  • [40] TWO-PASS END-TO-END ASR MODEL COMPRESSION
    Dawalatabad, Nauman
    Vatsal, Tushar
    Gupta, Ashutosh
    Kim, Sungsoo
    Singh, Shatrughan
    Gowda, Dhananjaya
    Kim, Chanwoo
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 403 - 410