RESCOREBERT: DISCRIMINATIVE SPEECH RECOGNITION RESCORING WITH BERT

被引：7

作者：

Xu, Liyan ^{[1
,2
]}

Gu, Yile ^{[1
]}

Kolehmainen, Jari ^{[1
]}

Khan, Haidar ^{[1
]}

Gandhe, Ankur ^{[1
]}

Rastrow, Ariya ^{[1
]}

Stoleke, Andreas ^{[1
]}

Bulyko, Ivan ^{[1
]}

机构：

[1] Amazon Alexa AI, Seattle, WA 98121 USA

[2] Emory Univ, Atlanta, GA 30322 USA

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

masked language model; BERT; second-pass rescoring; pretrained model; minimum WER training;

D O I：

10.1109/ICASSP43922.2022.9747118

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Second-pass rescoring is an important component in automatic speech recognition (ASR) systems that is used to improve the outputs from a first-pass decoder by implementing a lattice rescoring or n-best re-ranking. While pretraining with a masked language model (MLM) objective has received great success in various natural language understanding (NLU) tasks, it has not gained traction as a rescoring model for ASR. Specifically, training a bidirectional model like BERT on a discriminative objective such as minimum WER (MWER) has not been explored. Here we show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR. Specifically, we propose a fusion strategy that incorporates the MLM into the discriminative training process to effectively distill knowledge from a pretrained model. We further propose an alternative discriminative loss. This approach, which we call RescoreBERT, reduces WER by 6.6%/3.4% relative on the LibriSpeech clean/other test sets over a BERT baseline without discriminative objective. We also evaluate our method on an internal dataset from a conversational agent and find that it reduces both latency and WER (by 3 to 8% relative) over an LSTM rescoring model.

引用

下载

页码：6117 / 6121

页数：5

共 50 条

[21] Discriminative Named Entity Recognition of Speech Data using Speech Recognition Confidence
Sudoh, Katsuhito
Tsukada, Hajime
Isozaki, Hideki
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 337 - 340
[22] Incorporating speech recognition confidence into discriminative named entity recognition of speech data
Sudoh, Katsuhito
Tsukada, Hajime
Isozaki, Hideki
COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 617 - 624
[23] Improved mandarin speech recognition by lattice rescoring with enhanced tone models
Wang, Huanliang
Qian, Yao
Soong, Frank
Zhou, Jian-Lai
Han, Jiqing
CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4274 : 445 - +
[24] A study on knowledge source integration for candidate rescoring in automatic speech recognition
Li, J
Tsao, Y
Lee, CH
2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 837 - 840
[25] DISCRIMINATIVE OUTPUT CODING FEATURES FOR SPEECH RECOGNITION
Dehzangi, Omid
Ma, Bin
Chng, Eng Siong
Li, Haizhou
2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2008, : 89 - 92
[26] Jointly Optimized Discriminative Features for Speech Recognition
Ng, Tim
Zhang, Bing
Long Nguyen
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2626 - 2629
[27] Improved Lattice Rescoring by Using Speech Attributes in Large Vocabulary Continuous Speech Recognition Systems
Gao, Xinglong
Zhang, Qingqing
Pan, Jielin
2013 6TH INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING (CISP), VOLS 1-3, 2013, : 143 - 147
[28] Discriminative pronunciation modeling for dialectal speech recognition
Lehr, Maider
Gorman, Kyle
Shafran, Izhak
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1458 - 1462
[29] Speech Emotion Recognition with Discriminative Feature Learning
Zhou, Huan
Liu, Kai
INTERSPEECH 2020, 2020, : 4094 - 4097
[30] Using SVMs and discriminative models for speech recognition
Smith, ND
Gales, MJF
2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 77 - 80

← 1 2 3 4 5 →