A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

被引:0
|
作者
Khassanov, Yerbolat [1 ]
Mussakhojayeva, Saida [1 ]
Mirzakhmetov, Almas [1 ]
Adiyev, Alen [1 ]
Nurpeiissov, Mukhamet [1 ]
Varol, Huseyin Atakan [1 ]
机构
[1] Nazarbayev Univ, Inst Smart Syst & Artificial Intelligence ISSAI, Nur Sultan, Kazakhstan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.
引用
收藏
页码:697 / 706
页数:10
相关论文
共 50 条
  • [31] NeuroSpeech: An open-source software for Parkinson's speech analysis
    Rafael Orozco-Arroyave, Juan
    Camilo Vasquez-Correa, Juan
    Francisco Vargas-Bonilla, Jesus
    Arora, R.
    Dehak, N.
    Nidadavolu, P. S.
    Christensen, H.
    Rudzicz, F.
    Yancheva, M.
    Chinaei, H.
    Vann, A.
    Vogler, N.
    Bocklet, T.
    Cernak, M.
    Hannink, J.
    Noeth, Elmar
    [J]. DIGITAL SIGNAL PROCESSING, 2018, 77 : 207 - 221
  • [32] Corpus for automatic speech recognition
    Adda-Decker, Martine
    [J]. REVUE FRANCAISE DE LINGUISTIQUE APPLIQUEE, 2007, 12 (01): : 71 - 84
  • [33] A Study of Kazakh Speech Recognition in Hiformer Model
    Mamyrbayev, Orken
    Kurmetkan, Turdbek
    Oralbekova, Dina
    Zhumazhan, Nurdaulet
    [J]. RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 2145 : 330 - 340
  • [34] A speech recognition and speech corpus system based on Matlab
    He, Q
    Zhang, YW
    [J]. PROCEEDINGS OF 2001 INTERNATIONAL SYMPOSIUM ON INTELLIGENT MULTIMEDIA, VIDEO AND SPEECH PROCESSING, 2001, : 559 - 562
  • [35] Automated Speech Audiometry: Can It Work Using Open-Source Pre-Trained Kaldi-NL Automatic Speech Recognition?
    Araiza-Illan, Gloria
    Meyer, Luke
    Truong, Khiet P.
    Baskent, Deniz
    [J]. TRENDS IN HEARING, 2024, 28
  • [36] Urdu Speech Corpus and Preliminary Results on Speech Recognition
    Ali, Hazrat
    Ahmad, Nasir
    Hafeez, Abdul
    [J]. ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2016, 2016, 629 : 317 - 325
  • [37] Satja: Thai Elderly Speech Corpus for Speech Recognition
    Prajongjai, Suphunnee
    Triyason, Tuul
    Mongkolnam, Pornchai
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY (IAIT2018), 2018,
  • [38] Creation of Marathi Speech Corpus for Automatic Speech Recognition
    Gaikwad, Santosh
    Gawali, Bharti
    Mehrotra, Suresh
    [J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
  • [39] A Baseline System for Continuous Speech Recognition of Brazilian Portuguese Using the West Point Brazilian Portuguese Speech Corpus
    dos Santos, Fabiano Weimar
    Couto Barone, Dante Augusto
    Adami, Andre Gustavo
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2010, 6001 : 132 - 141
  • [40] The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition
    Mukiibi, Jonathan
    Katumba, Andrew
    Nakatumba-Nabende, Joyce
    Hussein, Ali
    Meyer, Josh
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1945 - 1954