A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training

被引:8
|
作者
Meng, Weijing [1 ,2 ]
Yolwas, Nurmemet [1 ,2 ]
机构
[1] Xinjiang Multilingual Informat Technol Lab, Urumqi 830017, Peoples R China
[2] Xinjiang Univ, Coll Informat Sci & Engn, Urumqi 830017, Peoples R China
基金
中国国家自然科学基金;
关键词
automatic speech recognition; Factorized TDNN; unsupervised pre-training; speech synthesis; TEXT-TO-SPEECH;
D O I
10.3390/s23020870
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recognition, but it is rarely used in Kazakh and other Central and West Asian languages. In this paper, wav2vec2.0 is improved by integrating a Factorized TDNN layer to better preserve the relationship between the voice and the time step before and after the quantization, which is called wav2vec-F. The unsupervised pre-training strategy was used to learn the potential speech representation from a large number of unlabeled audio data and was applied to the cross-language ASR task, which was optimized using the noise contrast binary classification task. At the same time, speech synthesis is used to promote the performance of speech recognition. The experiment shows that wav2vec-F can effectively utilize the unlabeled data from non-target languages, and the multi-language pre-training is obviously better than the single-language pre-training. The data enhancement method using speech synthesis can bring huge benefits. Compared with the baseline model, Librispeech's test-clean dataset has an average reduction of 1.9% in the word error rate. On the Kazakh KSC test set, the pre-training using only Kazakh reduced the word error rate by 3.8%. The pre-training of a small amount of Kazakh speech data synthesized by multi-language combined with TTS achieved a word error rate of 8.6% on the KSC test set when the labeled data were only 10 h, which was comparable to the results of the previous end-to-end model when the labeled data were 30 times less.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Perceptual MVDR-based Unsupervised Built-in Speaker Normalization for Kazakh Speech Recognition
    Yessenbayev, Zhandos
    Yapanel, Umit
    [J]. 2014 IEEE 8th International Conference on Application of Information and Communication Technologies (AICT), 2014, : 87 - 91
  • [42] Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training
    Zhang, Bowen
    Cao, Songjun
    Zhang, Xiaoming
    Zhang, Yike
    Ma, Long
    Shinozaki, Takahiro
    [J]. INTERSPEECH 2022, 2022, : 2653 - 2657
  • [43] In Defense of Image Pre-Training for Spatiotemporal Recognition
    Li, Xianhang
    Wang, Huiyu
    Wei, Chen
    Mei, Jieru
    Yuille, Alan
    Zhou, Yuyin
    Xie, Cihang
    [J]. COMPUTER VISION, ECCV 2022, PT XXV, 2022, 13685 : 675 - 691
  • [44] Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech
    Sone, Kentaro
    Nakashika, Toru
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (08) : 1546 - 1553
  • [45] Depression recognition using voice-based pre-training model
    Huang, Xiangsheng
    Wang, Fang
    Gao, Yuan
    Liao, Yilong
    Zhang, Wenjing
    Zhang, Li
    Xu, Zhenrong
    [J]. SCIENTIFIC REPORTS, 2024, 14 (01):
  • [46] Unsupervised pre-training of graph transformers on patient population graphs
    Pellegrini, Chantal
    Navab, Nassir
    Kazi, Anees
    [J]. MEDICAL IMAGE ANALYSIS, 2023, 89
  • [47] Why Does Unsupervised Pre-training Help Deep Learning?
    Erhan, Dumitru
    Bengio, Yoshua
    Courville, Aaron
    Manzagol, Pierre-Antoine
    Vincent, Pascal
    Bengio, Samy
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2010, 11 : 625 - 660
  • [48] Unsupervised Point Cloud Pre-training via Occlusion Completion
    Wang, Hanchen
    Liu, Qi
    Yue, Xiangyu
    Lasenby, Joan
    Kusner, Matt J.
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9762 - 9772
  • [49] Unsupervised Pre-training on Improving the Performance of Neural Network in Regression
    Salida, Pallabi
    Vij, Prateek
    Baruah, Rashmi Dutta
    [J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
  • [50] A Study of Kazakh Speech Recognition in Hiformer Model
    Mamyrbayev, Orken
    Kurmetkan, Turdbek
    Oralbekova, Dina
    Zhumazhan, Nurdaulet
    [J]. RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 2145 : 330 - 340