A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

被引：0

作者：

Khassanov, Yerbolat ^{[1
]}

Mussakhojayeva, Saida ^{[1
]}

Mirzakhmetov, Almas ^{[1
]}

Adiyev, Alen ^{[1
]}

Nurpeiissov, Mukhamet ^{[1
]}

Varol, Huseyin Atakan ^{[1
]}

机构：

[1] Nazarbayev Univ, Inst Smart Syst & Artificial Intelligence ISSAI, Nur Sultan, Kazakhstan

来源：

16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021) | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.

引用

页码：697 / 706

页数：10

共 50 条

[1] AISHELL-1: AN OPEN-SOURCE MANDARIN SPEECH CORPUS AND A SPEECH RECOGNITION BASELINE
Bu, Hui
Du, Jiayu
Na, Xingyu
Wu, Bengu
Zheng, Hao
[J]. 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), 2017, : 58 - 62
[2] KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus
Mussakhojayeva, Saida
Khassanov, Yerbolat
Varol, Huseyin Atakan
[J]. INTERSPEECH 2022, 2022, : 1367 - 1371
[3] TALCS: AN OPEN-SOURCE MANDARIN-ENGLISH CODE-SWITCHING CORPUS AND A SPEECH RECOGNITION BASELINE
Li, Chengfei
Deng, Shuhao
Wang, Yaoping
Wang, Guangjing
Gong, Yaguang
Chen, Changbin
Bai, Jinfeng
[J]. INTERSPEECH 2022, 2022, : 1741 - 1745
[4] Developing an Open-Source Corpus of Yoruba Speech
Gutkin, Alexander
Demirsahin, Isin
Kjartansson, Oddur
Rivera, Clara
Tnbastin, Kola
[J]. INTERSPEECH 2020, 2020, : 404 - 408
[5] A Free Kazakh Speech Database and a Speech Recognition Baseline
Shi, Ying
Hamdulla, Askar
Tang, Zhiyuan
Wang, Dong
Zheng, Thomas Fang
[J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 745 - 748
[6] THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT
Bolanos, Daniel
[J]. 2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 354 - 359
[7] KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
Mussakhojayeva, Saida
Janaliyeva, Aigerim
Mirzakhmetov, Almas
Khassanov, Yerbolat
Varol, Huseyin Atakan
[J]. INTERSPEECH 2021, 2021, : 2786 - 2790
[8] Speech Recognition System Using Open-Source Speech Engine for Indian Names
Kallole, Nitin Arun
Prakash, R.
[J]. INTELLIGENT EMBEDDED SYSTEMS, ICNETS2, VOL II, 2018, 492 : 263 - 274
[9] An open and free Speech Corpus for Speaker Recognition: The FSCSR Speech Corpus
Bouziane, Ayoub
Kadi, Houda
Hourri, Soufiane
Kharroubi, Jamal
[J]. 2016 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA), 2016,
[10] Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
Brierley, Claire
Sawalha, Majdi
Atwell, Eric
[J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1011 - 1016

← 1 2 3 4 5 →