Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

被引:0
|
作者
Zevallos, Rodolfo [1 ,2 ]
Camacho, Luis [1 ,2 ]
Melgarejo, Nelsi [1 ,2 ]
机构
[1] Pontif Catholica Univ Peru, Lima, Peru
[2] Pompeu Fabra Univ, Barcelona, Spain
关键词
Speech Corpus; Speech Recognition; Low-resource Languages;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.
引用
收藏
页码:5029 / 5034
页数:6
相关论文
共 50 条
  • [1] Multilingual speech recognition in seven languages
    Uebler, U
    [J]. SPEECH COMMUNICATION, 2001, 35 (1-2) : 53 - 69
  • [2] Multilingual speech recognition for GlobalPhone languages
    Tachbelie, Martha Yifiru
    Abate, Solomon Teferra
    Schultz, Tanja
    [J]. SPEECH COMMUNICATION, 2022, 140 : 71 - 86
  • [3] Multilingual Speech Recognition for Turkic Languages
    Mussakhojayeva, Saida
    Dauletbek, Kaisar
    Yeshpanov, Rustem
    Varol, Huseyin Atakan
    [J]. INFORMATION, 2023, 14 (02)
  • [4] Indian Languages Corpus for Speech Recognition
    Basu, Joyanta
    Khan, Soma
    Roy, Rajib
    Saxena, Babita
    Ganguly, Dipankar
    Arora, Sunita
    Arora, Karunesh Kumar
    Bansal, Shweta
    Agrawal, Shyam Sunder
    [J]. 2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2019, : 13 - 18
  • [5] The Multilingual TEDx Corpus for Speech Recognition and Translation
    Salesky, Elizabeth
    Wiesner, Matthew
    Bremerman, Jacob
    Cattoni, Roldano
    Negri, Matteo
    Turchi, Marco
    Oard, Douglas W.
    Post, Matt
    [J]. INTERSPEECH 2021, 2021, : 3655 - 3659
  • [6] Multilingual Speech Recognition with Corpus Relatedness Sampling
    Li, Xinjian
    Dalmia, Siddharth
    Black, Alan W.
    Metze, Florian
    [J]. INTERSPEECH 2019, 2019, : 2120 - 2124
  • [7] Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages
    Retta, Ephrem Afele
    Sutcliffe, Richard
    Mahmood, Jabar
    Berwo, Michael Abebe
    Almekhlafi, Eiad
    Khan, Sajjad Ahmad
    Chaudhry, Shehzad Ashraf
    Mhamed, Mustafa
    Feng, Jun
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (23):
  • [8] Development of Text and Speech Corpus for Designing the Multilingual Recognition System
    Bansal, Shweta
    Agrawal, Shyam S.
    [J]. 2018 ORIENTAL COCOSDA - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2018, : 1 - 7
  • [9] A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
    P. Vijayalakshmi
    B. Ramani
    M. P. Actlin Jeeva
    T. Nagarajan
    [J]. Circuits, Systems, and Signal Processing, 2018, 37 : 2142 - 2163
  • [10] A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
    Vijayalakshmi, P.
    Ramani, B.
    Jeeva, M. P. Actlin
    Nagarajan, T.
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2018, 37 (05) : 2142 - 2163