Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

被引：0

作者：

Zevallos, Rodolfo ^{[1
,2
]}

Camacho, Luis ^{[1
,2
]}

Melgarejo, Nelsi ^{[1
,2
]}

机构：

[1] Pontif Catholica Univ Peru, Lima, Peru

[2] Pompeu Fabra Univ, Barcelona, Spain

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

Speech Corpus; Speech Recognition; Low-resource Languages;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

引用

页码：5029 / 5034

页数：6

共 50 条

[1] Multilingual speech recognition for GlobalPhone languages
Tachbelie, Martha Yifiru
Abate, Solomon Teferra
Schultz, Tanja
[J]. SPEECH COMMUNICATION, 2022, 140 : 71 - 86
[2] Multilingual speech recognition in seven languages
Uebler, U
[J]. SPEECH COMMUNICATION, 2001, 35 (1-2) : 53 - 69
[3] Multilingual Speech Recognition for Turkic Languages
Mussakhojayeva, Saida
Dauletbek, Kaisar
Yeshpanov, Rustem
Varol, Huseyin Atakan
[J]. INFORMATION, 2023, 14 (02)
[4] Indian Languages Corpus for Speech Recognition
Basu, Joyanta
Khan, Soma
Roy, Rajib
Saxena, Babita
Ganguly, Dipankar
Arora, Sunita
Arora, Karunesh Kumar
Bansal, Shweta
Agrawal, Shyam Sunder
[J]. 2019 22ND CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA), 2019, : 13 - 18
[5] The Multilingual TEDx Corpus for Speech Recognition and Translation
Salesky, Elizabeth
Wiesner, Matthew
Bremerman, Jacob
Cattoni, Roldano
Negri, Matteo
Turchi, Marco
Oard, Douglas W.
Post, Matt
[J]. INTERSPEECH 2021, 2021, : 3655 - 3659
[6] Multilingual Speech Recognition with Corpus Relatedness Sampling
Li, Xinjian
Dalmia, Siddharth
Black, Alan W.
Metze, Florian
[J]. INTERSPEECH 2019, 2019, : 2120 - 2124
[7] Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages
Retta, Ephrem Afele
Sutcliffe, Richard
Mahmood, Jabar
Berwo, Michael Abebe
Almekhlafi, Eiad
Khan, Sajjad Ahmad
Chaudhry, Shehzad Ashraf
Mhamed, Mustafa
Feng, Jun
[J]. APPLIED SCIENCES-BASEL, 2023, 13 (23):
[8] Development of Text and Speech Corpus for Designing the Multilingual Recognition System
Bansal, Shweta
Agrawal, Shyam S.
[J]. 2018 ORIENTAL COCOSDA - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, 2018, : 1 - 7
[9] A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
P. Vijayalakshmi
B. Ramani
M. P. Actlin Jeeva
T. Nagarajan
[J]. Circuits, Systems, and Signal Processing, 2018, 37 : 2142 - 2163
[10] A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
Vijayalakshmi, P.
Ramani, B.
Jeeva, M. P. Actlin
Nagarajan, T.
[J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2018, 37 (05) : 2142 - 2163

← 1 2 3 4 5 →