Common Voice: A Massively-Multilingual Speech Corpus

被引:0
|
作者
Ardila, Rosana [1 ]
Branson, Megan [1 ]
Davis, Kelly [1 ]
Henretty, Michael [4 ]
Kohler, Michael [4 ]
Meyer, Josh [3 ]
Morais, Reuben [1 ]
Saunders, Lindsay [1 ]
Tyers, Francis M. [2 ]
Weber, Gregor [1 ]
机构
[1] Mozilla, Bloomington, IN 47408 USA
[2] Indiana Univ, Bloomington, IN USA
[3] Artie Inc, Bloomington, IN USA
[4] Various Cities, Los Angeles, CA USA
基金
美国国家科学基金会;
关键词
spoken corpus; Automatic Speech Recognition; low-resource languages;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla's DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 +/- 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.
引用
收藏
页码:4218 / 4222
页数:5
相关论文
共 50 条
  • [1] CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
    Jia, Ye
    Ramanovich, Michelle Tadmor
    Wang, Quan
    Zen, Heiga
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6691 - 6703
  • [2] ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
    Imani, Ayyoob
    Sabet, Masoud Jalili
    Duller, Philipp
    Cysouw, Michael
    Schuetze, Hinrich
    [J]. ACL-IJCNLP 2021: THE JOINT CONFERENCE OF THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING: PROCEEDINGS OF THE SYSTEM DEMONSTRATIONS, 2021, : 63 - 72
  • [3] Massively Multilingual Adversarial Speech Recognition
    Adams, Oliver
    Wiesner, Matthew
    Watanabe, Shinji
    Yarowsky, David
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 96 - 108
  • [4] A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
    P. Vijayalakshmi
    B. Ramani
    M. P. Actlin Jeeva
    T. Nagarajan
    [J]. Circuits, Systems, and Signal Processing, 2018, 37 : 2142 - 2163
  • [5] A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus
    Vijayalakshmi, P.
    Ramani, B.
    Jeeva, M. P. Actlin
    Nagarajan, T.
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2018, 37 (05) : 2142 - 2163
  • [6] CoVoST 2 and Massively Multilingual Speech Translation
    Wang, Changhan
    Wu, Anne
    Gu, Jiatao
    Pino, Juan
    [J]. INTERSPEECH 2021, 2021, : 2247 - 2251
  • [7] Euronews: a multilingual speech corpus for ASR
    Gretter, Roberto
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2635 - 2638
  • [8] Multilingual Speech Synthesis for Voice Cloning
    Seong, Jiwon
    Lee, WooKey
    Lee, Suan
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP 2021), 2021, : 313 - 316
  • [9] PSEUDO-LABELING FOR MASSIVELY MULTILINGUAL SPEECH RECOGNITION
    Lugosch, Loren
    Likhomanenko, Tatiana
    Synnaeve, Gabriel
    Collobert, Ronan
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7687 - 7691
  • [10] The Multilingual TEDx Corpus for Speech Recognition and Translation
    Salesky, Elizabeth
    Wiesner, Matthew
    Bremerman, Jacob
    Cattoni, Roldano
    Negri, Matteo
    Turchi, Marco
    Oard, Douglas W.
    Post, Matt
    [J]. INTERSPEECH 2021, 2021, : 3655 - 3659