UNIFIED END-TO-END SPEECH RECOGNITION AND ENDPOINTING FOR FAST AND EFFICIENT SPEECH SYSTEMS

被引：3

作者：

Bijwadia, Shaan ^{[1
]}

Chang, Shuo-yiin ^{[1
]}

Li, Bo ^{[1
]}

Sainath, Tara ^{[1
]}

Zhang, Chao ^{[1
]}

He, Yanzhang ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年

关键词：

endpointing; end-to-end speech recognition; voice activity detection; end of query detection; multitask;

D O I：

10.1109/SLT54892.2023.10022338

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).

引用

页码：310 / 316

页数：7

共 50 条

[1] PERSONALIZATION STRATEGIES FOR END-TO-END SPEECH RECOGNITION SYSTEMS
Gourav, Aditya
Liu, Linda
Gandhe, Ankur
Gu, Yile
Lan, Guitang
Huang, Xiangyang
Kalmane, Shashank
Tiwari, Gautam
Filimonov, Denis
Rastrow, Ariya
Stolcke, Andreas
Bulyko, Ivan
Alexa, Amazon
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7348 - 7352
[2] ESPRESSO: A FAST END-TO-END NEURAL SPEECH RECOGNITION TOOLKIT
Wang, Yiming
Chen, Tongfei
Xu, Hainan
Ding, Shuoyang
Lv, Hang
Shao, Yiwen
Peng, Nanyun
Xie, Lei
Watanabe, Shinji
Khudanpur, Sanjeev
[J]. 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 136 - 143
[3] Unified Architecture for Multichannel End-to-End Speech Recognition With Neural Beamforming
Ochiai, Tsubasa
Watanabe, Shinji
Hori, Takaaki
Hershey, John R.
Xiao, Xiong
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1274 - 1288
[4] Overview of end-to-end speech recognition
Wang, Song
Li, Guanyu
[J]. 2018 INTERNATIONAL SYMPOSIUM ON POWER ELECTRONICS AND CONTROL ENGINEERING (ISPECE 2018), 2019, 1187
[5] END-TO-END MULTIMODAL SPEECH RECOGNITION
Palaskar, Shruti
Sanabria, Ramon
Metze, Florian
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778
[6] End-to-End Speech Recognition in Russian
Markovnikov, Nikita
Kipyatkova, Irina
Lyakso, Elena
[J]. SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 377 - 386
[7] END-TO-END ANCHORED SPEECH RECOGNITION
Wang, Yiming
Fan, Xing
Chen, I-Fan
Liu, Yuzong
Chen, Tongfei
Hoffmeister, Bjorn
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7090 - 7094
[8] END-TO-END AUDIOVISUAL SPEECH RECOGNITION
Petridis, Stavros
Stafylakis, Themos
Ma, Pingchuan
Cai, Feipeng
Tzimiropoulos, Georgios
Pantic, Maja
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 6548 - 6552
[9] End-to-end Accented Speech Recognition
Viglino, Thibault
Motlicek, Petr
Cernak, Milos
[J]. INTERSPEECH 2019, 2019, : 2140 - 2144
[10] Multichannel End-to-end Speech Recognition
Ochiai, Tsubasa
Watanabe, Shinji
Hori, Takaaki
Hershey, John R.
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70

← 1 2 3 4 5 →