A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline

被引:0
|
作者
Khassanov, Yerbolat [1 ]
Mussakhojayeva, Saida [1 ]
Mirzakhmetov, Almas [1 ]
Adiyev, Alen [1 ]
Nurpeiissov, Mukhamet [1 ]
Varol, Huseyin Atakan [1 ]
机构
[1] Nazarbayev Univ, Inst Smart Syst & Artificial Intelligence ISSAI, Nur Sultan, Kazakhstan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both genders. It was carefully inspected by native Kazakh speakers to ensure high quality. The KSC is the largest publicly available database developed to advance various Kazakh speech and language processing applications. In this paper, we first describe the data collection and preprocessing procedures followed by a description of the database specifications. We also share our experience and challenges faced during the database construction, which might benefit other researchers planning to build a speech corpus for a low-resource language. To demonstrate the reliability of the database, we performed preliminary speech recognition experiments. The experimental results imply that the quality of audio and transcripts is promising (2.8% character error rate and 8.7% word error rate on the test set). To enable experiment reproducibility and ease the corpus usage, we also released an ESPnet recipe for our speech recognition models.
引用
收藏
页码:697 / 706
页数:10
相关论文
共 50 条
  • [1] AISHELL-1: AN OPEN-SOURCE MANDARIN SPEECH CORPUS AND A SPEECH RECOGNITION BASELINE
    Bu, Hui
    Du, Jiayu
    Na, Xingyu
    Wu, Bengu
    Zheng, Hao
    [J]. 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), 2017, : 58 - 62
  • [2] KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus
    Mussakhojayeva, Saida
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    [J]. INTERSPEECH 2022, 2022, : 1367 - 1371
  • [3] TALCS: AN OPEN-SOURCE MANDARIN-ENGLISH CODE-SWITCHING CORPUS AND A SPEECH RECOGNITION BASELINE
    Li, Chengfei
    Deng, Shuhao
    Wang, Yaoping
    Wang, Guangjing
    Gong, Yaguang
    Chen, Changbin
    Bai, Jinfeng
    [J]. INTERSPEECH 2022, 2022, : 1741 - 1745
  • [4] Developing an Open-Source Corpus of Yoruba Speech
    Gutkin, Alexander
    Demirsahin, Isin
    Kjartansson, Oddur
    Rivera, Clara
    Tnbastin, Kola
    [J]. INTERSPEECH 2020, 2020, : 404 - 408
  • [5] A Free Kazakh Speech Database and a Speech Recognition Baseline
    Shi, Ying
    Hamdulla, Askar
    Tang, Zhiyuan
    Wang, Dong
    Zheng, Thomas Fang
    [J]. 2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC 2017), 2017, : 745 - 748
  • [6] THE BAVIECA OPEN-SOURCE SPEECH RECOGNITION TOOLKIT
    Bolanos, Daniel
    [J]. 2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 354 - 359
  • [7] KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
    Mussakhojayeva, Saida
    Janaliyeva, Aigerim
    Mirzakhmetov, Almas
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    [J]. INTERSPEECH 2021, 2021, : 2786 - 2790
  • [8] Speech Recognition System Using Open-Source Speech Engine for Indian Names
    Kallole, Nitin Arun
    Prakash, R.
    [J]. INTELLIGENT EMBEDDED SYSTEMS, ICNETS2, VOL II, 2018, 492 : 263 - 274
  • [9] An open and free Speech Corpus for Speaker Recognition: The FSCSR Speech Corpus
    Bouziane, Ayoub
    Kadi, Houda
    Hourri, Soufiane
    Kharroubi, Jamal
    [J]. 2016 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA), 2016,
  • [10] Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
    Brierley, Claire
    Sawalha, Majdi
    Atwell, Eric
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1011 - 1016