KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus

被引:3
|
作者
Mussakhojayeva, Saida [1 ]
Khassanov, Yerbolat [1 ]
Varol, Huseyin Atakan [1 ]
机构
[1] Nazarbayev Univ, Inst Smart Syst & Artificial Intelligence ISSAI, Nur Sultan, Kazakhstan
来源
关键词
speech corpus; Kazakh; speech recognition; streaming ASR; spontaneous; code-switching; agglutinative; RECOGNITION;
D O I
10.21437/Interspeech.2022-421
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present the first industrial-scale open-source Kazakh speech corpus for automatic speech recognition research and development. Our corpus subsumes two previously presented corpora: 1) Kazakh speech corpus (KSC) and 2) Kazakh text-to-speech 2 (KazakhTTS2). We also provide additional data from other sources, including television news, television and radio programs, parliament speeches, and podcasts. Our corpus, which we have named KSC2, contains over a thousand hours of high-quality transcribed data, which is triple the size of KSC. KSC2 was manually transcribed with the help of native Kazakh speakers and validated via preliminary speech recognition experiments on various evaluation sets. Moreover, it contains utterances with Kazakh-Russian code-switching, a conversational practice common among Kazakh speakers. We believe that our corpus will facilitate speech processing research for Kazakh, which is widely considered an under-resourced language. To ensure the reproducibility of experiments, we share the KSC2 corpus, training recipes, and pretrained models(1).
引用
收藏
页码:1367 / 1371
页数:5
相关论文
共 50 条
  • [1] A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline
    Khassanov, Yerbolat
    Mussakhojayeva, Saida
    Mirzakhmetov, Almas
    Adiyev, Alen
    Nurpeiissov, Mukhamet
    Varol, Huseyin Atakan
    [J]. 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 697 - 706
  • [2] Developing an Open-Source Corpus of Yoruba Speech
    Gutkin, Alexander
    Demirsahin, Isin
    Kjartansson, Oddur
    Rivera, Clara
    Tnbastin, Kola
    [J]. INTERSPEECH 2020, 2020, : 404 - 408
  • [3] KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
    Mussakhojayeva, Saida
    Janaliyeva, Aigerim
    Mirzakhmetov, Almas
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    [J]. INTERSPEECH 2021, 2021, : 2786 - 2790
  • [4] KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics
    Mussakhojayeva, Saida
    Khassanov, Yerbolat
    Varol, Huseyin Atakan
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5404 - 5411
  • [5] AISHELL-1: AN OPEN-SOURCE MANDARIN SPEECH CORPUS AND A SPEECH RECOGNITION BASELINE
    Bu, Hui
    Du, Jiayu
    Na, Xingyu
    Wu, Bengu
    Zheng, Hao
    [J]. 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), 2017, : 58 - 62
  • [6] Open-source industrial-scale module simulation: Paving the way towards the right configuration choice for membrane distillation
    Dong, Guangxi
    Cha-Umpong, Withita
    Hou, Jingwei
    Ji, Chao
    Chen, Vicki
    [J]. DESALINATION, 2019, 464 : 48 - 62
  • [7] Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
    Brierley, Claire
    Sawalha, Majdi
    Atwell, Eric
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1011 - 1016
  • [8] An open-source data acquisition system for laboratory and industrial scale applications
    Niehaus, Konstantin
    Westhoff, Andreas
    [J]. MEASUREMENT SCIENCE AND TECHNOLOGY, 2023, 34 (02)
  • [9] Open-source benefits for industrial controllers
    Dehner, Bill
    [J]. Control Engineering, 2020, 67 (05): : 32 - 33
  • [10] Open{WSN| Mote}: Open-Source Industrial IoT
    Watteyne, Thomas
    Vilajosana, Xavier
    Tuset, Pere
    [J]. ERCIM NEWS, 2015, (101): : 11 - 12