Hybrid end-to-end model for Kazakh speech recognition

被引:12
|
作者
Mamyrbayev O.Z. [1 ,3 ]
Oralbekova D.O. [1 ,2 ]
Alimhan K. [1 ,4 ]
Nuranbayeva B.M. [5 ]
机构
[1] Institute of Information and Computational Technologies CS MES RK, 28 Shevchenko Str., Almaty
[2] Satbayev University, Almaty
[3] Al-Farabi Kazakh National University, Almaty
[4] L.N. Gumilyov Eurasian National University, Satpayev Str., 2, Nur-Sultan
[5] Caspian University, Dostyk 85A, Almaty
关键词
Attention; Automatic speech recognition; Connectionist temporal classification; End-to-end; Low resource language;
D O I
10.1007/s10772-022-09983-8
中图分类号
学科分类号
摘要
Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition. © 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
引用
收藏
页码:261 / 270
页数:9
相关论文
共 50 条
  • [1] Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System
    Oralbekova, Dina
    Mamyrbayev, Orken
    Othman, Mohamed
    Alimhan, Keylan
    Zhumazhanov, Bagashar
    Nuranbayeva, Bulbul
    [J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, PT I, 2022, 13757 : 519 - 531
  • [2] An End-to-End model for Vietnamese speech recognition
    Van Huy Nguyen
    [J]. 2019 IEEE - RIVF INTERNATIONAL CONFERENCE ON COMPUTING AND COMMUNICATION TECHNOLOGIES (RIVF), 2019, : 307 - 312
  • [3] Combination of end-to-end and hybrid models for speech recognition
    Wong, Jeremy H. M.
    Gaur, Yashesh
    Zhao, Rui
    Lu, Liang
    Sun, Eric
    Li, Jinyu
    Gong, Yifan
    [J]. INTERSPEECH 2020, 2020, : 1783 - 1787
  • [4] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev, Orken
    Oralbekova, Dina
    Alimhan, Keylan
    Turdalykyzy, Tolganay
    Othman, Mohamed
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [5] A study of transformer-based end-to-end speech recognition system for Kazakh language
    Mamyrbayev Orken
    Oralbekova Dina
    Alimhan Keylan
    Turdalykyzy Tolganay
    Othman Mohamed
    [J]. Scientific Reports, 12
  • [6] Residual Language Model for End-to-end Speech Recognition
    Tsunoo, Emiru
    Kashiwagi, Yosuke
    Narisetty, Chaitanya
    Watanabe, Shinji
    [J]. INTERSPEECH 2022, 2022, : 3899 - 3903
  • [7] A SPELLING CORRECTION MODEL FOR END-TO-END SPEECH RECOGNITION
    Guo, Jinxi
    Sainath, Tara N.
    Weiss, Ron J.
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5651 - 5655
  • [8] MULTILINGUAL SPEECH RECOGNITION WITH A SINGLE END-TO-END MODEL
    Toshniwal, Shubham
    Sainath, Tara N.
    Weiss, Ron J.
    Li, Bo
    Moreno, Pedro
    Weinstein, Eugene
    Rao, Kanishka
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4904 - 4908
  • [9] Hybrid CTC/Attention Architecture for End-to-End Speech Recognition
    Watanabe, Shinji
    Hori, Takaaki
    Kim, Suyoun
    Hershey, John R.
    Hayashi, Tomoki
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2017, 11 (08) : 1240 - 1253
  • [10] END-TO-END MULTIMODAL SPEECH RECOGNITION
    Palaskar, Shruti
    Sanabria, Ramon
    Metze, Florian
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5774 - 5778