A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

被引：0

作者：

Khassanov, Yerbolat ^{[1
]}

Mussakhojayeva, Saida ^{[1
]}

Mirzakhmetov, Almas ^{[1
]}

Adiyev, Alen ^{[1
]}

Nurpeiissov, Mukhamet ^{[1
]}

Varol, Huseyin Atakan ^{[1
]}

机构：

[1] Nazarbayev Univ, Inst Smart Syst & Artificial Intelligence ISSAI, Nur Sultan, Kazakhstan

来源：

16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.

引用

页码：697 / 706

页数：10

共 50 条

[31] NeuroSpeech: An open-source software for Parkinson's speech analysis
Rafael Orozco-Arroyave, Juan
Camilo Vasquez-Correa, Juan
Francisco Vargas-Bonilla, Jesus
Arora, R.
Dehak, N.
Nidadavolu, P. S.
Christensen, H.
Rudzicz, F.
Yancheva, M.
Chinaei, H.
Vann, A.
Vogler, N.
Bocklet, T.
Cernak, M.
Hannink, J.
Noeth, Elmar
[J]. DIGITAL SIGNAL PROCESSING, 2018, 77 : 207 - 221
[32] Corpus for automatic speech recognition
Adda-Decker, Martine
[J]. REVUE FRANCAISE DE LINGUISTIQUE APPLIQUEE, 2007, 12 (01): : 71 - 84
[33] A Study of Kazakh Speech Recognition in Hiformer Model
Mamyrbayev, Orken
Kurmetkan, Turdbek
Oralbekova, Dina
Zhumazhan, Nurdaulet
[J]. RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT II, ACIIDS 2024, 2024, 2145 : 330 - 340
[34] A speech recognition and speech corpus system based on Matlab
He, Q
Zhang, YW
[J]. PROCEEDINGS OF 2001 INTERNATIONAL SYMPOSIUM ON INTELLIGENT MULTIMEDIA, VIDEO AND SPEECH PROCESSING, 2001, : 559 - 562
[35] Automated Speech Audiometry: Can It Work Using Open-Source Pre-Trained Kaldi-NL Automatic Speech Recognition?
Araiza-Illan, Gloria
Meyer, Luke
Truong, Khiet P.
Baskent, Deniz
[J]. TRENDS IN HEARING, 2024, 28
[36] Urdu Speech Corpus and Preliminary Results on Speech Recognition
Ali, Hazrat
Ahmad, Nasir
Hafeez, Abdul
[J]. ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2016, 2016, 629 : 317 - 325
[37] Satja: Thai Elderly Speech Corpus for Speech Recognition
Prajongjai, Suphunnee
Triyason, Tuul
Mongkolnam, Pornchai
[J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY (IAIT2018), 2018,
[38] Creation of Marathi Speech Corpus for Automatic Speech Recognition
Gaikwad, Santosh
Gawali, Bharti
Mehrotra, Suresh
[J]. 2013 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2013 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2013,
[39] A Baseline System for Continuous Speech Recognition of Brazilian Portuguese Using the West Point Brazilian Portuguese Speech Corpus
dos Santos, Fabiano Weimar
Couto Barone, Dante Augusto
Adami, Andre Gustavo
[J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2010, 6001 : 132 - 141
[40] The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition
Mukiibi, Jonathan
Katumba, Andrew
Nakatumba-Nabende, Joyce
Hussein, Ali
Meyer, Josh
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1945 - 1954

← 1 2 3 4 5 →