Building a Speech and Text Corpus of Turkish: Large Corpus Collection with Initial Speech Recognition Results

被引:7
|
作者
Polat, Huseyin [1 ]
Oyucu, Saadin [1 ]
机构
[1] Gazi Univ, Fac Technol, Dept Comp Engn, TR-06560 Ankara, Turkey
来源
SYMMETRY-BASEL | 2020年 / 12卷 / 02期
关键词
automatic speech recognition; speech corpus; text corpus; data acquisition; multi-layer neural network; natural language processing; BROADCAST NEWS; MODELS;
D O I
10.3390/sym12020290
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
To build automatic speech recognition (ASR) systems with a low word error rate (WER), a large speech and text corpus is needed. Corpus preparation is the first step required for developing an ASR system for a language with few argument speech documents available. Turkish is a language with limited resources for ASR. Therefore, development of a symmetric Turkish transcribed speech corpus according to the high resources languages corpora is crucial for improving and promoting Turkish speech recognition activities. In this study, we constructed a viable alternative to classical transcribed corpus preparation techniques for collecting Turkish speech data. In the presented approach, three different methods were used. In the first step, subtitles, which are mainly supplied for people with hearing difficulties, were used as transcriptions for the speech utterances obtained from movies. In the second step, data were collected via a mobile application. In the third step, a transfer learning approach to the Grand National Assembly of Turkey session records (videotext) was used. We also provide the initial speech recognition results of artificial neural network and Gaussian mixture-model-based acoustic models for Turkish. For training models, the newly collected corpus and other existing corpora published by the Linguistic Data Consortium were used. In light of the test results of the other existing corpora, the current study showed the relative contribution of corpus variability in a symmetric speech recognition task. The decrease in WER after including the new corpus was more evident with increased verified data size, compensating for the status of Turkish as a low resource language. For further studies, the importance of the corpus and language model in the success of the Turkish ASR system is shown.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] SloParl - Slovenian Parliamentary speech and text corpus for large vocabulary continuous speech recognition
    Zgank, Andrej
    Rotovnik, Tomaz
    Grasic, Matej
    Kos, Marko
    Vlaj, Damjan
    Kacic, Zdravko
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 197 - 200
  • [2] Urdu Speech Corpus and Preliminary Results on Speech Recognition
    Ali, Hazrat
    Ahmad, Nasir
    Hafeez, Abdul
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2016, 2016, 629 : 317 - 325
  • [3] On building phonetically and prosodically rich speech corpus for text-to-speech synthesis
    Matousek, Jindrich
    Romportl, Jan
    PROCEEDINGS OF THE SECOND IASTED INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE, 2006, : 442 - +
  • [4] An open and free Speech Corpus for Speaker Recognition: The FSCSR Speech Corpus
    Bouziane, Ayoub
    Kadi, Houda
    Hourri, Soufiane
    Kharroubi, Jamal
    2016 11TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA), 2016,
  • [5] Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need
    Kuzdeuov, Askat
    Nurgaliyev, Shakhizat
    Turmakhan, Diana
    Laiyk, Nurkhan
    Varol, Huseyin Atakan
    2023 3RD INTERNATIONAL CONFERENCE ON ROBOTICS, AUTOMATION AND ARTIFICIAL INTELLIGENCE, RAAI 2023, 2023, : 286 - 291
  • [6] METU Turkish microphone speech corpus
    Salon, Ozgul
    Ciloglu, Tolga
    Demirekler, Mubeccel
    2006 IEEE 14TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1 AND 2, 2006, : 33 - +
  • [7] Corpus for automatic speech recognition
    Adda-Decker, Martine
    REVUE FRANCAISE DE LINGUISTIQUE APPLIQUEE, 2007, 12 (01): : 71 - 84
  • [8] The Makerere Radio Speech Corpus: A Luganda Radio Corpus for Automatic Speech Recognition
    Mukiibi, Jonathan
    Katumba, Andrew
    Nakatumba-Nabende, Joyce
    Hussein, Ali
    Meyer, Josh
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1945 - 1954
  • [9] Development of Text and Speech Corpus for Designing the Multilingual Recognition System
    Bansal, Shweta
    Agrawal, Shyam S.
    2018 ORIENTAL COCOSDA - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2018, : 1 - 7
  • [10] Modern Arabic speech corpus for Text to Speech synthesis
    Oumaima, Zine
    Meziane, Abdelouafi
    2020 IEEE INTERNATIONAL CONFERENCE ON TECHNOLOGY MANAGEMENT, OPERATIONS AND DECISIONS (ICTMOD), 2020,