Improving Mandarin End-to-End Speech Recognition With Word N-Gram Language Model

被引：6

作者：

Tian, Jinchuan ^{[1
]}

Yu, Jianwei ^{[2
,3
]}

Weng, Chao ^{[2
,3
]}

Zou, Yuexian ^{[1
]}

Yu, Dong ^{[3
]}

机构：

[1] Peking Univ, Shenzhen Grad Sch, Adv Data & Signal Proc Lab, Sch Elect & Comp Sci, Shenzhen 518055, Peoples R China

[2] Tencent AI Lab, Shenzhen, Peoples R China

[3] Tencent ASR Oteam, Shenzhen, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2022年 / 29卷

关键词：

Decoding; Lattices; Chaos; Artificial neural networks; Vocabulary; Transducers; Training; Speech recognition; language model;

D O I：

10.1109/LSP.2022.3154241

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Despite the rapid progress of end-to-end (E2E) automatic speech recognition (ASR), it has been shown that incorporating external language models (LMs) into the decoding can further improve the recognition performance of E2E ASR systems. To align with the modelingunits adopted in E2E ASR systems, subword-level (e.g., characters, BPE) LMs are usually used to cooperate with current E2E ASR systems. However, the use of subword-level LMs will ignore the word-level information, which may limit the strength of the external LMs in E2E ASR. Although several methods have been proposed to incorporate word-level external LMs in E2E ASR, these methods are mainly designed for languages with clear word boundaries such as English and cannot be directly applied to languages like Mandarin, in which each character sequence can have multiple corresponding word sequences. To this end, we propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences for each partial hypothesis. Then, the LM score of the hypothesis is obtained by intersecting the generated lattice with an external word N-gram LM. The proposed method is examined on both Attention-based Encoder-Decoder (AED) and Neural Transducer (NT) frameworks. Experiments suggest that our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM. We achieve state-of-the-art results on both Aishell-1 (CER 4.18%) and Aishell-2 (CER 5.06%) datasets and reduce CER by 14.8% relatively on a 21K-hour Mandarin dataset. Code is released.

引用

页码：812 / 816

页数：5

共 50 条

[1] Residual Language Model for End-to-end Speech Recognition
Tsunoo, Emiru
Kashiwagi, Yosuke
Narisetty, Chaitanya
Watanabe, Shinji
[J]. INTERSPEECH 2022, 2022, : 3899 - 3903
[2] Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition
Meng, Zhong
Wu, Yu
Kanda, Naoyuki
Lu, Liang
Chen, Xie
Ye, Guoli
Sun, Eric
Li, Jinyu
Gong, Yifan
[J]. INTERSPEECH 2021, 2021, : 2596 - 2600
[3] Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Amodei, Dario
Ananthanarayanan, Sundaram
Anubhai, Rishita
Bai, Jingliang
Battenberg, Eric
Case, Carl
Casper, Jared
Catanzaro, Bryan
Cheng, Qiang
Chen, Guoliang
Chen, Jie
Chen, Jingdong
Chen, Zhijie
Chrzanowski, Mike
Coates, Adam
Diamos, Greg
Ding, Ke
Du, Niandong
Elsen, Erich
Engel, Jesse
Fang, Weiwei
Fan, Linxi
Fougner, Christopher
Gao, Liang
Gong, Caixia
Hannun, Awni
Han, Tony
Johannes, Lappi Vaino
Jiang, Bing
Ju, Cai
Jun, Billy
LeGresley, Patrick
Lin, Libby
Liu, Junjie
Liu, Yang
Li, Weigao
Li, Xiangang
Ma, Dongpeng
Narang, Sharan
Ng, Andrew
Ozair, Sherjil
Peng, Yiping
Prenger, Ryan
Qian, Sheng
Quan, Zongfeng
Raiman, Jonathan
Rao, Vinay
Satheesh, Sanjeev
Seetapun, David
Sengupta, Shubho
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 48, 2016, 48
[4] END-TO-END SPEECH RECOGNITION WITH WORD-BASED RNN LANGUAGE MODELS
Hori, Takaaki
Cho, Jaejin
Watanabe, Shinji
[J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 389 - 396
[5] End-to-End Speech Recognition of Tamil Language
Changrampadi, Mohamed Hashim
Shahina, A.
Narayanan, M. Badri
Khan, A. Nayeemulla
[J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2022, 32 (02): : 1309 - 1323
[6] Semantic Data Augmentation for End-to-End Mandarin Speech Recognition
Sun, Jianwei
Tang, Zhiyuan
Yin, Hengxin
Wang, Wei
Zhao, Xi
Zhao, Shuaijiang
Lei, Xiaoning
Zou, Wei
Li, Xiangang
[J]. INTERSPEECH 2021, 2021, : 1269 - 1273
[7] End-to-End Mandarin Speech Recognition Combining CNN and BLSTM
Wang, Dong
Wang, Xiaodong
Lv, Shaohe
[J]. SYMMETRY-BASEL, 2019, 11 (05):
[8] IMPROVING UNSUPERVISED STYLE TRANSFER IN END-TO-END SPEECH SYNTHESIS WITH END-TO-END SPEECH RECOGNITION
Liu, Da-Rong
Yang, Chi-Yu
Wu, Szu-Lin
Lee, Hung-Yi
[J]. 2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 640 - 647
[9] Improving Attention-based End-to-end ASR by Incorporating an N-gram Neural Network
Ao, Junyi
Ko, Tom
[J]. 2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
[10] An End-to-End Chinese Speech Recognition Algorithm Integrating Language Model
Lü, Kun-Ru
Wu, Chun-Guo
Liang, Yan-Chun
Yuan, Yu-Ping
Ren, Zhi-Min
Zhou, You
Shi, Xiao-Hu
[J]. Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2021, 49 (11): : 2177 - 2185

← 1 2 3 4 5 →