A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

被引:8
|
作者
Meng, Weijing [1 ,2 ]
Yolwas, Nurmemet [1 ,2 ]
机构
[1] Xinjiang Multilingual Informat Technol Lab, Urumqi 830017, Peoples R China
[2] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830017, Peoples R China
基金
中国国家自然科学基金;
关键词
automatic speech recognition; Factorized TDNN; unsupervised pre-training; speech synthesis; TEXT-TO-SPEECH;
D O I
10.3390/s23020870
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] wav2vec: Unsupervised Pre-training for Speech Recognition
    Schneider, Steffen
    Baevski, Alexei
    Collobert, Ronan
    Auli, Michael
    [J]. INTERSPEECH 2019, 2019, : 3465 - 3469
  • [2] Neural speech enhancement with unsupervised pre-training and mixture training
    Hao, Xiang
    Xu, Chenglin
    Xie, Lei
    [J]. NEURAL NETWORKS, 2023, 158 : 216 - 227
  • [3] PERFORMANCE-EFFICIENCY TRADE-OFFS IN UNSUPERVISED PRE-TRAINING FOR SPEECH RECOGNITION
    Wu, Felix
    Kim, Kwangyoun
    Pan, Jing
    Han, Kyu J.
    Weinberger, Kilian Q.
    Artzi, Yoav
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7667 - 7671
  • [4] SELF-TRAINING AND PRE-TRAINING ARE COMPLEMENTARY FOR SPEECH RECOGNITION
    Xu, Qiantong
    Baevski, Alexei
    Likhomanenko, Tatiana
    Tomasello, Paden
    Conneau, Alexis
    Collobert, Ronan
    Synnaeve, Gabriel
    Auli, Michael
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3030 - 3034
  • [5] A Multilingual Framework Based on Pre-training Model for Speech Emotion Recognition
    Zhang, Zhaohang
    Zhang, Xiaohui
    Guo, Min
    Zhang, Wei-Qiang
    Li, Ke
    Huang, Yukai
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 750 - 755
  • [6] Diffusion-Based Unsupervised Pre-training for Automated Recognition of Vitality Forms
    Canovi, Noemi
    Montagna, Federico
    Niewiadomski, Radoslaw
    Sciutti, Alessandra
    Di Cesare, Giuseppe
    Beyan, Cigdem
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON ADVANCED VISUAL INTERFACES, AVI 2024, 2024,
  • [7] Improving Transformer-based Speech Recognition with Unsupervised Pre-training and Multi-task Semantic Knowledge Learning
    Li, Song
    Li, Lin
    Hong, Qingyang
    Liu, Lingling
    [J]. INTERSPEECH 2020, 2020, : 5006 - 5010
  • [8] Unified Speech-Text Pre-training for Speech Translation and Recognition
    Tang, Yun
    Gong, Hongyu
    Dong, Ning
    Wang, Changhan
    Hsu, Wei-Ning
    Gu, Jiatao
    Baevski, Alexei
    Li, Xian
    Mohamed, Abdelrahman
    Auli, Michael
    Pino, Juan
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1488 - 1499
  • [9] UNSUPERVISED PRE-TRAINING OF BIDIRECTIONAL SPEECH ENCODERS VIA MASKED RECONSTRUCTION
    Wang, Weiran
    Tang, Qingming
    Livescu, Karen
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6889 - 6893
  • [10] LATTICEBART: LATTICE-TO-LATTICE PRE-TRAINING FOR SPEECH RECOGNITION
    Dai, Lingfeng
    Chen, Lu
    Zhou, Zhikai
    Yu, Kai
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6112 - 6116