Massively Multilingual Pronunciation Mining with WikiPron

被引:0
|
作者
Lee, Jackson L.
Ashby, Lucas F. E. [1 ]
Garza, M. Elizabeth [1 ]
Lee-Sikka, Yeonju [1 ]
Miller, Sean [1 ]
Wong, Alan [1 ]
McCarthy, Arya D. [2 ]
Gorman, Kyle [1 ]
机构
[1] CUNY, Grad Ctr, New York, NY 10021 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
speech; pronunciation; grapheme-to-phoneme; g2p; MODELS;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We introduce WikiPron, an open-source command-line tool for extracting pronunciation data from Wiktionary, a collaborative multilingual online dictionary. We first describe the design and use of WikiPron. We then discuss the challenges faced scaling this tool to create an automatically-generated database of 1.7 million pronunciations from 165 languages. Finally, we validate the pronunciation database by using it to train and evaluating a collection of generic grapheme-to-phoneme models. The software, pronunciation data, and models are all made available under permissive open-source licenses.
引用
收藏
页码:4223 / 4228
页数:6
相关论文
共 50 条
  • [31] A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space
    Jones, Alex
    Wang, William Yang
    Mahowald, Kyle
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5833 - 5847
  • [33] On the Construction of Multilingual Corpora for Clinical Text Mining
    Villena, Fabian
    Eisenmann, Urs
    Knaup, Petra
    Dunstan, Jocelyn
    Ganzinger, Matthias
    DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 347 - 351
  • [34] Processing multilingual collections for text mining applications
    Gaussier, E
    TEXT MINING AND ITS APPLICATIONS, 2004, 138 : 119 - 130
  • [35] An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
    Mueller, Aaron
    Nicolai, Garrett
    McCarthy, Arya D.
    Lewis, Dylan
    Wu, Winston
    Yarowsky, David
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3710 - 3718
  • [36] Effectively Mining Wikipedia for Clustering Multilingual Documents
    Kumar, N. Kiran
    Santosh, G. S. K.
    Varma, Vasudeva
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2011, 6716 : 254 - 257
  • [37] Multilingual Sentiment Mining System to Prognosticate Governance
    Bhatti, Muhammad Shahid
    Azhar, Saman
    Sohail, Abid
    Hijji, Mohammad
    Ayemen, Hamna
    Ramzan, Areesha
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 71 (01): : 389 - 406
  • [38] Mining a Multilingual Geographical Gazetteer from the Web
    Popescu, Adrian
    Grefenstette, Gregory
    Bouamor, Houda
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 58 - 65
  • [39] Text mining of tourism preference in a multilingual site
    Zeng, Chao
    Nakatoh, Tetsuya
    Hirokawa, Sachio
    Eguchi, Masanari
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2019, 14 (04) : 590 - 596
  • [40] Evaluating the Cross-Lingual Effectiveness of Massively Multilingual Neural Machine Translation
    Siddhant, Aditya
    Johnson, Melvin
    Tsai, Henry
    Ari, Naveen
    Riesa, Jason
    Bapna, Ankur
    Firat, Orhan
    Raman, Karthik
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8854 - 8861