Synthesising isiZulu-English code-switch bigrams using word embeddings

被引:10
|
作者
van der Westhuizen, Ewald [1 ]
Niesler, Thomas [1 ]
机构
[1] Stellenbosch Univ, Dept Elect & Elect Engn, Stellenbosch, South Africa
关键词
code-switching; word vectors; word embed-dings; Zulu; IsiZulu; spontaneous;
D O I
10.21437/Interspeech.2017-1437
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon. and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English code-switch bigrams with which to augment such sparse language model training data. A variety of word embeddings are trained on a monolingual English web text corpus, and subsequently queried to synthesise code-switch bigrams. Our evaluation is performed on language models trained on a new, although small. English-isiZulu code switch corpus compiled from South African soap operas. This data is characterised by fast. spontaneously spoken speech containing frequent code-switching. We show that the augmentation of the training data with code-switched bigrams synthesised in this way leads to a reduction in perplexity.
引用
收藏
页码:72 / 76
页数:5
相关论文
共 24 条
  • [1] Digging deep into IsiZulu-English code-switching in a peri-urban context
    Ndimande-Hlongwa, Nobuhle
    Ndebele, Hloniphani
    [J]. LANGUAGE MATTERS, 2014, 45 (02) : 237 - 256
  • [2] WORD AND CLASS COMMON SPACE EMBEDDING FOR CODE-SWITCH LANGUAGE MODELLING
    Lee, Grandee
    Li, Haizhou
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6086 - 6090
  • [3] Building a First Language Model for Code-switch Arabic-English
    Hamed, Injy
    Elmahdy, Mohamed
    Abdennadher, Slim
    [J]. ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017), 2017, 117 : 208 - 216
  • [4] Synthesised bigrams using word embeddings for code-switched ASR of four South African language pairs
    van der Westhuizen, Ewald
    Niesler, Thomas R.
    [J]. COMPUTER SPEECH AND LANGUAGE, 2019, 54 : 151 - 175
  • [5] Why do teachers code-switch when teaching English as a second language?
    Shinga, Sibongile
    Pillay, Ansurie
    [J]. SOUTH AFRICAN JOURNAL OF EDUCATION, 2021, 41
  • [6] Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus
    Hamed, Injy
    Elmandy, Mohamed
    Abdennadher, Slim
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3805 - 3809
  • [7] Modeling Code-Switch Languages Using Bilingual Parallel Corpus
    Lee, Grandee
    Li, Haizhou
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 860 - 870
  • [8] A FIRST SPEECH RECOGNITION SYSTEM FOR MANDARIN-ENGLISH CODE-SWITCH CONVERSATIONAL SPEECH
    Ngoc Thang Vu
    Lyu, Dau-Cheng
    Weiner, Jochen
    Telaar, Dominic
    Schlippe, Tim
    Blaicher, Fabian
    Chng, Eng-Siong
    Schultz, Tanja
    Li, Haizhou
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4889 - 4892
  • [9] A FIRST SPEECH RECOGNITION SYSTEM FOR MANDARIN-ENGLISH CODE-SWITCH CONVERSATIONAL SPEECH
    Ngoc Thang Vu
    Lyu, Dau-Cheng
    Weiner, Jochen
    Telaar, Dominic
    Schlippe, Tim
    Blaicher, Fabian
    Chng, Eng-Siong
    Schultz, Tanja
    Li, Haizhou
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4889 - 4892
  • [10] Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
    Balabel, Mohamed
    Hamed, Injy
    Abdennadher, Slim
    Ngoc Thang Vu
    Cetinoglu, Oezlem
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3973 - 3977