ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English

被引:0
|
作者
Hamed, Injy [1 ,2 ]
Ngoc Thang Vu [2 ]
Abdennadher, Slim [1 ]
机构
[1] German Univ Cairo, Comp Sci Dept, Cairo, Egypt
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
关键词
Arabic-English; Dialectal Egyptian Arabic; code-switching; speech corpus; spontaneous speech;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.
引用
下载
收藏
页码:4237 / 4246
页数:10
相关论文
共 50 条
  • [1] Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
    Gaser, Marwa
    Mager, Manuel
    Hamed, Injy
    Habash, Nizar
    Abdennadher, Slim
    Vu, Ngoc Thang
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3523 - 3538
  • [2] Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus
    Hamed, Injy
    Elmandy, Mohamed
    Abdennadher, Slim
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3805 - 3809
  • [3] Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
    Balabel, Mohamed
    Hamed, Injy
    Abdennadher, Slim
    Ngoc Thang Vu
    Cetinoglu, Oezlem
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3973 - 3977
  • [4] An Algerian Arabic-French Code-Switched Corpus
    Cotterell, Ryan
    Renduchintala, Adithya
    Saphra, Naomi
    Callison-Burch, Chris
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [5] An Arabic-Moroccan Darija Code-Switched Corpus
    Samih, Younes
    Maier, Wolfgang
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4170 - 4175
  • [6] Two sepedi-english code-switched speech corpora
    Modipa, Thipe, I
    Davel, Marelie H.
    LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) : 703 - 727
  • [7] Two sepedi-english code-switched speech corpora
    Thipe I. Modipa
    Marelie H. Davel
    Language Resources and Evaluation, 2022, 56 : 703 - 727
  • [8] Studying vowel variation in French-Algerian Arabic code-switched speech
    Wottawa, Jane
    Amazouz, Djegdjiga
    Adda-Decker, Martine
    Lamel, Lori
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2753 - 2757
  • [9] The perception of code-switched speech in noise
    Gavino, Maria Fernanda
    Goldrick, Matthew
    JASA EXPRESS LETTERS, 2024, 4 (03):
  • [10] A First South African Corpus of Multilingual Code-switched Soap Opera Speech
    van der Westhuizen, Ewald
    Niesler, Thomas
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2854 - 2859