ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English

被引:0
|
作者
Hamed, Injy [1 ,2 ]
Ngoc Thang Vu [2 ]
Abdennadher, Slim [1 ]
机构
[1] German Univ Cairo, Comp Sci Dept, Cairo, Egypt
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
关键词
Arabic-English; Dialectal Egyptian Arabic; code-switching; speech corpus; spontaneous speech;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present our ArzEn corpus, an Egyptian Arabic-English code-switching (CS) spontaneous speech corpus. The corpus is collected through informal interviews with 38 Egyptian bilingual university students and employees held in a soundproof room. A total of 12 hours are recorded, transcribed, validated and sentence segmented. The corpus is mainly designed to be used in Automatic Speech Recognition (ASR) systems, however, it also provides a useful resource for analyzing the CS phenomenon from linguistic, sociological, and psychological perspectives. In this paper, we first discuss the CS phenomenon in Egypt and the factors that gave rise to the current language. We then provide a detailed description on how the corpus was collected, giving an overview on the participants involved. We also present statistics on the CS involved in the corpus, as well as a summary to the effort exerted in the corpus development, in terms of number of hours required for transcription, validation, segmentation and speaker annotation. Finally, we discuss some factors contributing to the complexity of the corpus, as well as Arabic-English CS behaviour that could pose potential challenges to ASR systems.
引用
下载
收藏
页码:4237 / 4246
页数:10
相关论文
共 50 条
  • [21] TRANSFORMER-TRANSDUCERS FOR CODE-SWITCHED SPEECH RECOGNITION
    Dalmia, Siddharth
    Liu, Yuzong
    Ronanki, Srikanth
    Kirchhoff, Katrin
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5859 - 5863
  • [22] Homophone Identification and Merging for Code-switched Speech Recognition
    Srivastava, Brij Mohan Lal
    Sitara, Sunayana
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1943 - 1947
  • [23] Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English
    Hamed, Injy
    Zhu, Moritz
    Elmahdy, Mohamed
    Abdennadher, Slim
    Vu, Ngoc Thang
    SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 160 - 170
  • [24] Investigations on speech recognition systems for low-resource dialectal Arabic-English code-switching speech
    Hamed, Injy
    Denisov, Pavel
    Li, Chia-Yu
    Elmahdy, Mohamed
    Abdennadher, Slim
    Ngoc Thang Vu
    COMPUTER SPEECH AND LANGUAGE, 2022, 72
  • [25] Automatic Speech Recognition of English-isiZulu Code-switched Speech from South African Soap Operas
    van der Westhuizen, Ewald
    Niesler, Thomas
    SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 : 121 - 127
  • [26] Embedded English verbs in Arabic-English code-switching in Egypt
    Kniaz, Malgorzata
    Zawrotna, Magdalena
    INTERNATIONAL JOURNAL OF BILINGUALISM, 2021, 25 (03) : 622 - 639
  • [27] Enhancing Large Vocabulary Continuous Speech Recognition System for Urdu-English Conversational Code-Switched Speech
    Farooq, Muhammad Umar
    Adeeba, Farah
    Hussain, Sarmad
    Rauf, Sahar
    Khalid, Maryam
    PROCEEDINGS OF 2020 23RD CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (ORIENTAL-COCOSDA 2020), 2020, : 155 - 159
  • [28] Ceasing hate with MoH: Hate Speech Detection in Hindi-English code-switched language
    Sharma, Arushi
    Kabra, Anubha
    Jain, Minni
    INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (01)
  • [29] Meta-Transfer Learning for Code-Switched Speech Recognition
    Winata, Genta Indra
    Cahyawijaya, Samuel
    Lin, Zhaojiang
    Liu, Zihan
    Xu, Peng
    Fung, Pascale
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 3770 - 3776
  • [30] SYNTACTIC CONSTRAINTS ON THE CODE-SWITCHING OF ARABIC-ENGLISH BILINGUALS
    HUSSEIN, RF
    SHORRAB, GA
    IRAL-INTERNATIONAL REVIEW OF APPLIED LINGUISTICS IN LANGUAGE TEACHING, 1993, 31 (03): : 236 - 241