A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

被引：8

作者：

Meng, Weijing ^{[1
,2
]}

Yolwas, Nurmemet ^{[1
,2
]}

机构：

[1] Xinjiang Multilingual Informat Technol Lab, Urumqi 830017, Peoples R China

[2] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830017, Peoples R China

来源：

SENSORS | 2023年 / 23卷 / 02期

基金：

中国国家自然科学基金;

关键词：

automatic speech recognition; Factorized TDNN; unsupervised pre-training; speech synthesis; TEXT-TO-SPEECH;

D O I：

10.3390/s23020870

中图分类号：

O65 [分析化学];

学科分类号：

070302 ; 081704 ;

摘要：

Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.

引用

页数：13

共 50 条

[1] wav2vec: Unsupervised Pre-training for Speech Recognition
Schneider, Steffen
Baevski, Alexei
Collobert, Ronan
Auli, Michael
[J]. INTERSPEECH 2019, 2019, : 3465 - 3469
[2] Neural speech enhancement with unsupervised pre-training and mixture training
Hao, Xiang
Xu, Chenglin
Xie, Lei
[J]. NEURAL NETWORKS, 2023, 158 : 216 - 227
[3] PERFORMANCE-EFFICIENCY TRADE-OFFS IN UNSUPERVISED PRE-TRAINING FOR SPEECH RECOGNITION
Wu, Felix
Kim, Kwangyoun
Pan, Jing
Han, Kyu J.
Weinberger, Kilian Q.
Artzi, Yoav
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7667 - 7671
[4] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
Xu, Qiantong
Baevski, Alexei
Likhomanenko, Tatiana
Tomasello, Paden
Conneau, Alexis
Collobert, Ronan
Synnaeve, Gabriel
Auli, Michael
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
[5] A Multilingual Framework Based on Pre-training Model for Speech Emotion Recognition
Zhang, Zhaohang
Zhang, Xiaohui
Guo, Min
Zhang, Wei-Qiang
Li, Ke
Huang, Yukai
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 750 - 755
[6] Diffusion-Based Unsupervised Pre-training for Automated Recognition of Vitality Forms
Canovi, Noemi
Montagna, Federico
Niewiadomski, Radoslaw
Sciutti, Alessandra
Di Cesare, Giuseppe
Beyan, Cigdem
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED VISUAL INTERFACES, AVI 2024, 2024,
[7] Improving Transformer-based Speech Recognition with Unsupervised Pre-training and Multi-task Semantic Knowledge Learning
Li, Song
Li, Lin
Hong, Qingyang
Liu, Lingling
[J]. INTERSPEECH 2020, 2020, : 5006 - 5010
[8] Unified Speech-Text Pre-training for Speech Translation and Recognition
Tang, Yun
Gong, Hongyu
Dong, Ning
Wang, Changhan
Hsu, Wei-Ning
Gu, Jiatao
Baevski, Alexei
Li, Xian
Mohamed, Abdelrahman
Auli, Michael
Pino, Juan
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
[9] UNSUPERVISED PRE-TRAINING OF BIDIRECTIONAL SPEECH ENCODERS VIA MASKED RECONSTRUCTION
Wang, Weiran
Tang, Qingming
Livescu, Karen
[J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6889 - 6893
[10] LATTICEBART: LATTICE-TO-LATTICE PRE-TRAINING FOR SPEECH RECOGNITION
Dai, Lingfeng
Chen, Lu
Zhou, Zhikai
Yu, Kai
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6112 - 6116

← 1 2 3 4 5 →