ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English

被引:0
|
作者
Hamed, Injy [1 ,2 ]
Ngoc Thang Vu [2 ]
Abdennadher, Slim [1 ]
机构
[1] German Univ Cairo, Comp Sci Dept, Cairo, Egypt
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
关键词
Arabic-English; Dialectal Egyptian Arabic; code-switching; speech corpus; spontaneous speech;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.
引用
收藏
页码:4237 / 4246
页数:10
相关论文
共 50 条
  • [31] CONVERSATION ANALYSIS OF CODE-SWITCHED UTTERANCES IN THE SPEECH OF KAZAKH BILINGUALS
    Akynova, Damira
    Agmanova, Atirkul
    Zhuravleva, Yevgeniya
    Bayekeyeva, Zhuldyz
    PROCEEDINGS OF INTCESS 2019- 6TH INTERNATIONAL CONFERENCE ON EDUCATION AND SOCIAL SCIENCES, 2019, : 901 - 907
  • [32] Arabic-English Corpus for Cross-Language Textual Similarity Detection
    Aljuaid, Hanan
    INFORMATION SCIENCE AND APPLICATIONS, 2020, 621 : 527 - 536
  • [33] Grammatical Error Correction for Code-Switched Sentences by Learners of English
    ALTA Institute & Computer Laboratory, University of Cambridge, United Kingdom
    不详
    不详
    arXiv, 1600,
  • [34] Malayalam-English Code-Switched: Grapheme to Phoneme System
    Manghat, Sreeja
    Manghat, Sreeram
    Schultz, Tanja
    INTERSPEECH 2020, 2020, : 4133 - 4137
  • [35] Building a First Language Model for Code-switch Arabic-English
    Hamed, Injy
    Elmahdy, Mohamed
    Abdennadher, Slim
    ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017), 2017, 117 : 208 - 216
  • [36] Named Entity Recognition on Arabic-English Code-Mixed Data
    Sabty, Caroline
    Elmahdy, Mohamed
    Abdennadher, Slim
    2019 13TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2019, : 93 - 97
  • [37] Code-switched automatic speech recognition in five South African languages
    Biswas, Astik
    Yilmaz, Emre
    van der Westhuizen, Ewald
    de Wet, Febe
    Niesler, Thomas
    COMPUTER SPEECH AND LANGUAGE, 2022, 71
  • [38] TRANSLITERATION BASED APPROACHES TO IMPROVE CODE-SWITCHED SPEECH RECOGNITION PERFORMANCE
    Emond, Jesse
    Ramabhadran, Bhuvana
    Roark, Brian
    Moreno, Pedro
    Ma, Min
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 448 - 455
  • [39] Detecting Offensive Tweets in Hindi-English Code-Switched Language
    Mathur, Puneet
    Shah, Rajiv Ratn
    Sawhney, Ramit
    Mahata, Debanjan
    NATURAL LANGUAGE PROCESSING FOR SOCIAL MEDIA (AFNLP SIG SOCIALNLP), 2018, : 18 - 26
  • [40] Deep Learning Approaches for English-Marathi Code-Switched Detection
    Bhimanwar S.
    Viralekar O.
    Anturkar K.
    Kulkarni A.
    EAI Endorsed Transactions on Scalable Information Systems, 2024, 11 (03) : 1 - 9