Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

被引:7
|
作者
Polat, Huseyin [1 ]
Oyucu, Saadin [1 ]
机构
[1] Gazi Univ, Fac Technol, Dept Comp Engn, TR-06560 Ankara, Turkey
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 02期
关键词
automatic speech recognition; speech corpus; text corpus; data acquisition; multi-layer neural network; natural language processing; BROADCAST NEWS; MODELS;
D O I
10.3390/sym12020290
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Dictionary Speech, Phraseological Mold and Text Corpus
    Zhu, Lichao
    LANGAGES, 2022, (225) : 127 - +
  • [32] A Large Scale Speech Sentiment Corpus
    Chen, Eric Y.
    Lu, Zhiyun
    Xu, Hao
    Cao, Liangliang
    Zhang, Yu
    Fan, James
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6549 - 6555
  • [33] Chhattisgarhi speech corpus for research and development in automatic speech recognition
    Londhe, Narendra D.
    Kshirsagar, Ghanahshyam B.
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (02) : 193 - 210
  • [34] Crowd-Sourced, Automatic Speech-Corpora Collection - Building the Romanian Anonymous Speech Corpus
    Dumitrescu, Stefan Daniel
    Boros, Tiberiu
    Ion, Radu
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [35] RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition
    Georgescu, Alexandru-Lucian
    Cucu, Horia
    Buzo, Andi
    Burileanu, Corneliu
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6606 - 6612
  • [36] Bangladeshi Bangla speech corpus for automatic speech recognition research
    Kibria, Shafkat
    Samin, Ahnaf Mozib
    Kobir, M. Humayon
    Rahman, M. Shahidur
    Selim, M. Reza
    Iqbal, M. Zafar
    SPEECH COMMUNICATION, 2022, 136 : 84 - 97
  • [37] Bangladeshi Bangla speech corpus for automatic speech recognition research
    Kibria, Shafkat
    Samin, Ahnaf Mozib
    Kobir, M. Humayon
    Rahman, M. Shahidur
    Selim, M. Reza
    Iqbal, M. Zafar
    Speech Communication, 2022, 136 : 84 - 97
  • [38] KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition
    Bang, Jeong-Uk
    Yun, Seung
    Kim, Seung-Hi
    Choi, Mu-Yeol
    Lee, Min-Kyu
    Kim, Yeo-Jeong
    Kim, Dong-Hyun
    Park, Jun
    Lee, Young-Jik
    Kim, Sang-Hun
    APPLIED SCIENCES-BASEL, 2020, 10 (19): : 1 - 17
  • [39] COLLECTION AND ANNOTATION OF MALAY CONVERSATIONAL SPEECH CORPUS
    Chong, Tze Yuang
    Xiao, Xiong
    Tan, Tien-Ping
    Chng, Eng Siong
    Li, Haizhou
    2012 INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2012, : 30 - 35
  • [40] An automatic speech recognition system for spontaneous Punjabi speech corpus
    Kumar Y.
    Singh N.
    International Journal of Speech Technology, 2017, 20 (2) : 297 - 303