Massively Multilingual Pronunciation Mining with WikiPron

被引:0
|
作者
Lee, Jackson L.
Ashby, Lucas F. E. [1 ]
Garza, M. Elizabeth [1 ]
Lee-Sikka, Yeonju [1 ]
Miller, Sean [1 ]
Wong, Alan [1 ]
McCarthy, Arya D. [2 ]
Gorman, Kyle [1 ]
机构
[1] CUNY, Grad Ctr, New York, NY 10021 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
speech; pronunciation; grapheme-to-phoneme; g2p; MODELS;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
引用
收藏
页码:4223 / 4228
页数:6
相关论文
共 50 条
  • [41] Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
    Pratap, Vineel
    Sriram, Anuroop
    Tomasello, Paden
    Hannun, Awni
    Liptchinsky, Vitaliy
    Synnaeve, Gabriel
    Collobert, Ronan
    INTERSPEECH 2020, 2020, : 4751 - 4755
  • [42] ByT5 model for massively multilingual grapheme-to-phoneme conversion
    Zhu, Jian
    Zhang, Cong
    Jurgens, David
    INTERSPEECH 2022, 2022, : 446 - 450
  • [43] Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation
    Zhang, Biao
    Williams, Philip
    Titov, Ivan
    Sennrich, Rico
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 1628 - 1639
  • [44] Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis
    Peng, Yukun
    Ling, Zhenhua
    INTERSPEECH 2022, 2022, : 4257 - 4261
  • [45] The Relationship of Letters and Sounds in German. An Error Analysis of the Pronunciation of Multilingual Turkish Students
    Koksal, Handan
    Cinar, Servet
    STUDIEN ZUR DEUTSCHEN SPRACHE UND LITERATUR-ALMAN DILI VE EDEBIYATI DERGISI, 2020, (43): : 101 - 127
  • [46] Pronunciation Dictionary-Free Multilingual Speech Synthesis Using Learned Phonetic Representations
    Liu, Chang
    Ling, Zhen-Hua
    Chen, Ling-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 3706 - 3716
  • [47] Designing and developing multilingual e-learning materials: TUFS language education pronunciation module - Introduction of a system for learning Japanese language pronunciation
    Abe, S
    Nakata, S
    Kigoshi, T
    Mochizuki, H
    3RD IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, 2003, : 462 - 462
  • [48] Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark
    Augustyniak, Lukasz
    Wozniak, Szymon
    Gruza, Marcin
    Gramacki, Piotr
    Rajda, Krzysztof
    Morzy, Mikolaj
    Kajdanowicz, Tomasz
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [49] MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases
    Martin, Louis
    Fan, Angela
    de la Clergerie, Eric
    Bordes, Antoine
    Sagot, Benoit
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1651 - 1664
  • [50] Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
    Kvapilikova, Ivana
    Artetxe, Mikel
    Labaka, Gorka
    Agirre, Eneko
    Bojar, Ondrej
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 255 - 262