Mandarin-English code-switching speech corpus in South-East Asia: SEAME

被引:30
|
作者
Lyu, Dau-Cheng [1 ]
Tan, Tien-Ping [4 ]
Chng, Eng-Siong [1 ,2 ]
Li, Haizhou [1 ,2 ,3 ,5 ]
机构
[1] Nanyang Technol Univ, Temasek Labs, Singapore 639798, Singapore
[2] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore
[3] Inst Infocomm Res, Singapore 138632, Singapore
[4] Univ Sains Malaysia, Sch Comp Sci, Usm 11800, Penang, Malaysia
[5] Univ New S Wales, Sydney, NSW 2052, Australia
关键词
Code-switching speech; Spontaneous spoken corpus development; Mandarin-English; Speech recognition; Language recognition; LANGUAGE IDENTIFICATION; RECOGNITION;
D O I
10.1007/s10579-015-9303-x
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper introduces the South East Asia Mandarin-English corpus, a 63-h spontaneous Mandarin-English code-switching transcribed speech corpus suitable for LVCSR and language change detection/identification research. The corpus is recorded under unscripted interview and conversational settings from 157 Singaporean and Malaysian speakers who spoke a mixture of Mandarin and English within a single sentence. About 82 % of the transcribed utterances are intra-sentential code-switching speech and the corpus will be release by LDC in 2015. This paper presents an analysis of the code-switching statistics of the corpus, such as the duration of monolingual segments and the frequency of language turns in code-switch utterances. We also summarize the development effort, details such as the processing time for transcription, validation and language boundary labelling. Lastly, we present textual analyses of code-switch segments examining the word length of monolingual segments in code-switch utterances and the most common single word and two-word phrase of such segments.
引用
收藏
页码:581 / 600
页数:20
相关论文
共 50 条
  • [1] SEAME: a Mandarin-English Code-switching Speech Corpus in South-East Asia
    Lyu, Dau-Cheng
    Tan, Tien-Ping
    Chng, Eng-Siong
    Li, Haizhou
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1986 - +
  • [2] Mandarin–English code-switching speech corpus in South-East Asia: SEAME
    Dau-Cheng Lyu
    Tien-Ping Tan
    Eng-Siong Chng
    Haizhou Li
    [J]. Language Resources and Evaluation, 2015, 49 : 581 - 600
  • [3] A Review of the Mandarin-English Code-switching Corpus: SEAME
    Lee, Grandee
    Ho, Thi-Nga
    Chng, Eng-Siong
    Li, Haizhou
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 210 - 213
  • [4] A Mandarin-English Code-Switching Corpus
    Li, Ying
    Yu, Yue
    Fung, Pascale
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2515 - 2519
  • [5] Mandarin-English Code-switching Speech Recognition
    Xu, Haihua
    Van Tung Pham
    Kyaw, Zin Tun
    Lim, Zhi Hao
    Chng, Eng Siong
    Li, Haizhou
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 554 - 555
  • [6] Pronunciation augmentation for Mandarin-English code-switching speech recognition
    Long, Yanhua
    Wei, Shuang
    Lian, Jie
    Li, Yijie
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [7] Pronunciation augmentation for Mandarin-English code-switching speech recognition
    Yanhua Long
    Shuang Wei
    Jie Lian
    Yijie Li
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [8] TALCS: AN OPEN-SOURCE MANDARIN-ENGLISH CODE-SWITCHING CORPUS AND A SPEECH RECOGNITION BASELINE
    Li, Chengfei
    Deng, Shuhao
    Wang, Yaoping
    Wang, Guangjing
    Gong, Yaguang
    Chen, Changbin
    Bai, Jinfeng
    [J]. INTERSPEECH 2022, 2022, : 1741 - 1745
  • [9] NON-AUTOREGRESSIVE MANDARIN-ENGLISH CODE-SWITCHING SPEECH RECOGNITION
    Chuang, Shun-Po
    Chang, Heng-Jui
    Huang, Sung-Feng
    Lee, Hung-yi
    [J]. 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 465 - 472
  • [10] Acoustic data augmentation for Mandarin-English code-switching speech recognition
    Long, Yanhua
    Li, Yijie
    Zhang, Qiaozheng
    Wei, Shuang
    Ye, Hong
    Yang, Jichen
    [J]. APPLIED ACOUSTICS, 2020, 161