A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

被引:8
|
作者
Meng, Weijing [1 ,2 ]
Yolwas, Nurmemet [1 ,2 ]
机构
[1] Xinjiang Multilingual Informat Technol Lab, Urumqi 830017, Peoples R China
[2] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830017, Peoples R China
基金
中国国家自然科学基金;
关键词
automatic speech recognition; Factorized TDNN; unsupervised pre-training; speech synthesis; TEXT-TO-SPEECH;
D O I
10.3390/s23020870
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] A Good Start is Half the Battle Won: Unsupervised Pre-training for Low Resource Children's Speech Recognition for an Interactive Reading Companion
    Misra, Abhinav
    Loukina, Anastassia
    Klebanov, Beata Beigman
    Gyawali, Binod
    Zechner, Klaus
    [J]. ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2021), PT I, 2021, 12748 : 306 - 317
  • [32] Unsupervised Extractive Summarization by Pre-training Hierarchical Transformers
    Xu, Shusheng
    Zhang, Xingxing
    Wu, Yi
    Wei, Furu
    Zhou, Ming
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1784 - 1795
  • [33] FlauBERT: Unsupervised Language Model Pre-training for French
    Le, Hang
    Vial, Loic
    Frej, Jibril
    Segonne, Vincent
    Coavoux, Maximin
    Lecouteux, Benjamin
    Allauzen, Alexandre
    Crabbe, Benoit
    Besacier, Laurent
    Schwab, Didier
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
  • [34] A NOISE-ROBUST SELF-SUPERVISED PRE-TRAINING MODEL BASED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEECH RECOGNITION
    Zhu, Qiu-Shi
    Zhang, Jie
    Zhang, Zi-Qiang
    Wu, Ming-Hui
    Fang, Xin
    Dai, Li-Rong
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3174 - 3178
  • [35] Unsupervised Pre-training Classifier Based on Restricted Boltzmann Machine with Imbalanced Data
    Fu, Xiaoyang
    [J]. SMART COMPUTING AND COMMUNICATION, SMARTCOM 2016, 2017, 10135 : 102 - 110
  • [36] Lottery Hypothesis based Unsupervised Pre-training for Model Compression in Federated Learning
    Itahara, Sohei
    Nishio, Takayuki
    Morikura, Masahiro
    Yamamoto, Koji
    [J]. 2020 IEEE 92ND VEHICULAR TECHNOLOGY CONFERENCE (VTC2020-FALL), 2020,
  • [37] Swahili Speech Dataset Development and Improved Pre-training Method for Spoken Digit Recognition
    Kivaisi, Alexander R.
    Zhao, Qingjie
    Mbelwa, Jimmy T.
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (07)
  • [38] A Study into Pre-training Strategies for Spoken Language Understanding on Dysarthric Speech
    Wang, Pu
    BabaAli, Bagher
    Van Hamme, Hugo
    [J]. INTERSPEECH 2021, 2021, : 36 - 40
  • [39] Unsupervised Regularization-Based Adaptive Training for Speech Recognition
    Ding, Fenglin
    Guo, Wu
    Gu, Bin
    Ling, Zhen-Hua
    Du, Jun
    [J]. INTERSPEECH 2020, 2020, : 996 - 1000
  • [40] Unleashing the Transferability Power of Unsupervised Pre-Training for Emotion Recognition in Masked and Unmasked Facial Images
    D'Inca, Moreno
    Beyan, Cigdem
    Niewiadomski, Radoslaw
    Barattin, Simone
    Sebe, Nicu
    [J]. IEEE ACCESS, 2023, 11 : 90876 - 90890