Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus

被引:0
|
作者
Balabel, Mohamed [1 ,2 ]
Hamed, Injy [2 ,3 ]
Abdennadher, Slim [3 ]
Ngoc Thang Vu [2 ]
Cetinoglu, Oezlem [2 ]
机构
[1] IBM Germany Res & Dev, Boblingen, Germany
[2] Univ Stuttgart, Inst Nat Language Proc, Stuttgart, Germany
[3] German Univ Cairo, Comp Sci Dept, Cairo, Egypt
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Code-switching has become a prevalent phenomenon across many communities. It poses a challenge to NLP researchers, mainly due to the lack of available data needed for training and testing applications. In this paper, we introduce a new resource: a corpus of EgyptianArabic code-switch speech data that is fully tokenized, lemmatized and annotated for part-of-speech tags. Beside the corpus itself, we provide annotation guidelines to address the unique challenges of annotating code-switch data. Another challenge that we address is the fact that Egyptian Arabic orthography and grammar are not standardized.
引用
收藏
页码:3973 / 3977
页数:5
相关论文
共 12 条
  • [1] Collection and Analysis of Code-switch Egyptian Arabic-English Speech Corpus
    Hamed, Injy
    Elmandy, Mohamed
    Abdennadher, Slim
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3805 - 3809
  • [2] ArzEn: A Speech Corpus for Code-switched Egyptian Arabic-English
    Hamed, Injy
    Ngoc Thang Vu
    Abdennadher, Slim
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4237 - 4246
  • [3] ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus
    Habash, Nizar
    Palfreyman, David
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 79 - 88
  • [4] Building a First Language Model for Code-switch Arabic-English
    Hamed, Injy
    Elmahdy, Mohamed
    Abdennadher, Slim
    [J]. ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2017), 2017, 117 : 208 - 216
  • [5] A Morphologically Annotated Corpus and a Morphological Analyzer for Egyptian Arabic
    Fashwan, Amany
    Alansary, Sameh
    [J]. AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 203 - 210
  • [6] Modeling Code-Switch Languages Using Bilingual Parallel Corpus
    Lee, Grandee
    Li, Haizhou
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 860 - 870
  • [7] Arabic-English Corpus for Cross-Language Textual Similarity Detection
    Aljuaid, Hanan
    [J]. INFORMATION SCIENCE AND APPLICATIONS, 2020, 621 : 527 - 536
  • [8] Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
    Gaser, Marwa
    Mager, Manuel
    Hamed, Injy
    Habash, Nizar
    Abdennadher, Slim
    Vu, Ngoc Thang
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 3523 - 3538
  • [9] An interdisciplinary corpus-based analysis of the translation of (karama, 'dignity') and its collocates in Arabic-English constitutions
    Brierley, Claire
    El-Farahaty, Hanem
    [J]. JOURNAL OF SPECIALISED TRANSLATION, 2019, (32): : 121 - 145
  • [10] Code-Switching Language Modeling with Bilingual Word Embeddings: A Case Study for Egyptian Arabic-English
    Hamed, Injy
    Zhu, Moritz
    Elmahdy, Mohamed
    Abdennadher, Slim
    Vu, Ngoc Thang
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 160 - 170