A Corpus for Large-Scale Phonetic Typology

被引:0
|
作者
Salesky, Elizabeth [1 ]
Chodroff, Eleanor [2 ]
Pimentel, Tiago [3 ]
Wiesner, Matthew [1 ]
Cotterell, Ryan [3 ,4 ]
Black, Alan W. [5 ]
Eisner, Jason [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Univ York, York, N Yorkshire, England
[3] Univ Cambridge, Cambridge, England
[4] Swiss Fed Inst Technol, Zurich, Switzerland
[5] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
关键词
VARIABILITY; DISPERSION; SHAPES;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis V1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io.
引用
收藏
页码:4526 / 4546
页数:21
相关论文
共 50 条
  • [1] Phonetic variation in English infant-directed speech: A large-scale corpus analysis
    Khlystova, Ekaterina A.
    Chong, Adam J.
    Sundara, Megha
    [J]. JOURNAL OF PHONETICS, 2023, 100
  • [2] Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus
    Xin, Detai
    Takamichi, Shinnosuke
    Morimatsu, Ai
    Saruwatari, Hiroshi
    [J]. INTERSPEECH 2023, 2023, : 17 - 21
  • [3] A Large-Scale Corpus for Conversation Disentanglement
    Kummerfeld, Jonathan K.
    Athreya, Vignesh
    Patel, Siva Sankalp
    Gouravajhala, Sai R.
    Gunasekara, Chulaka
    Polymenakos, Lazaros
    Peper, Joseph J.
    Ganhotra, Jatin
    Lasecki, Walter S.
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3846 - 3856
  • [4] Large-Scale Multimodal Movie Dialogue Corpus
    Yasuhara, Ryu
    Inoue, Masashi
    Suga, Ikuya
    Kosaka, Tetsuo
    [J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 414 - 415
  • [5] Vocal development in a large-scale crosslinguistic corpus
    Cychosz, Margaret
    Cristia, Alejandrina
    Bergelson, Elika
    Casillas, Marisa
    Baudet, Gladys
    Warlaumont, Anne S.
    Scaff, Camila
    Yankowitz, Lisa
    Seidl, Amanda
    [J]. DEVELOPMENTAL SCIENCE, 2021, 24 (05)
  • [6] A Phrase Topic Model for Large-scale Corpus
    Li, Baoji
    Xu, Wenhua
    Tian, Yuhui
    Chen, Juan
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
  • [7] A Large-Scale Query Spelling Correction Corpus
    Hagen, Matthias
    Potthast, Martin
    Gohsen, Marcel
    Rathgeber, Anja
    Stein, Benno
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1261 - 1264
  • [8] Build a large-scale syntactically annotated Chinese corpus
    Qiang, Z
    [J]. TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 106 - 113
  • [9] Development of a Large-Scale Mandarin Radio Speech Corpus
    Chang, Yung-hsiang Shawn
    Liao, Yuan-fu
    Wang, Sheng-ming
    Wang, Jenq-haur
    Wang, Sing-yue
    Chen, Jhih-wei
    Chen, You-dian
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2017,
  • [10] A large-scale corpus system for identifying thesaural relations
    Collier, A
    Pacey, M
    [J]. CORPUS-BASED STUDIES IN ENGLISH, 1997, (20): : 87 - 100