ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English

被引:0
|
作者
Hamed, Injy [1 ,2 ]
Ngoc Thang Vu [2 ]
Abdennadher, Slim [1 ]
机构
[1] German Univ Cairo, Comp Sci Dept, Cairo, Egypt
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
关键词
Arabic-English; Dialectal Egyptian Arabic; code-switching; speech corpus; spontaneous speech;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.
引用
收藏
页码:4237 / 4246
页数:10
相关论文
共 50 条
  • [41] Modeling the auxiliary phrase asymmetry in code-switched Spanish-English
    Tsoukala, Chara
    Frank, Stefan L.
    Van den Bosch, Antal
    Kroff, Jorge Valdes
    Broersma, Mirjam
    BILINGUALISM-LANGUAGE AND COGNITION, 2021, 24 (02) : 271 - 280
  • [42] Code-switched English Pronunciation Modeling for Swahili Spoken Term Detection
    Kleynhans, Neil
    Hartman, William
    van Niekerk, Daniel
    van Heerden, Charl
    Schwartz, Rich
    Tsakalidis, Stavros
    Davel, Marelie
    SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 : 128 - 135
  • [43] Word recognition of code-switched words by Chinese-English bilinguals
    Li, P
    JOURNAL OF MEMORY AND LANGUAGE, 1996, 35 (06) : 757 - 774
  • [44] Joint Part-of-Speech and Language ID Tagging for Code-Switched Data
    Soto, Victor
    Hirschberg, Julia
    COMPUTATIONAL APPROACHES TO LINGUISTIC CODE-SWITCHING, 2018, : 1 - 10
  • [45] Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech
    Biswas, Astik
    de Wet, Febe
    van der Westhuizen, Ewald
    Yzlmaz, Emre
    Niesler, Thomas
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2603 - 2607
  • [46] Language Identification of Intra-Word Code-Switching for Arabic-English
    Sabty, Caroline
    Mesabah, Islam
    Cetinoglu, Oezlem
    Abdennadher, Slim
    ARRAY, 2021, 12
  • [47] ArzEn-MultiGenre: An aligned parallel dataset of Egyptian Arabic song lyrics, novels, and subtitles, with English translations
    Al-Sabbagh, Rania
    DATA IN BRIEF, 2024, 54
  • [48] Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition
    Wiesner, Matthew
    Sarma, Mousmita
    Arora, Ashish
    Raj, Desh
    Gao, Dongji
    Huang, Ruizhe
    Preet, Supreet
    Johnson, Moris
    Iqbal, Zikra
    Goel, Nagendra
    Trmal, Jan
    Garcia, Paola
    Khudanpur, Sanjeev
    INTERSPEECH 2021, 2021, : 2906 - 2910
  • [49] COMPARISON OF DATA AUGMENTATION AND ADAPTATION STRATEGIES FOR CODE-SWITCHED AUTOMATIC SPEECH RECOGNITION
    Ma, Min
    Ramabhadran, Bhuvana
    Emond, Jesse
    Rosenberg, Andrew
    Biadsy, Fadi
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6081 - 6085
  • [50] The effect of lexical triggers on Spanish-English code-switched judgment tasks
    Koronkiewicz, Bryan
    Delgado, Rodrigo
    FRONTIERS IN PSYCHOLOGY, 2024, 15