Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations

被引:0
|
作者
Elfardy, Heba [1 ]
Diab, Mona [1 ]
机构
[1] Columbia Univ, Ctr Computat Learning Syst, New York, NY 10115 USA
关键词
Linguistic Code Switching; Dialectal Arabic; Annotation Guidelines;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The Arabic language is a collection of dialectal variants along with the standard form, Modern Standard Arabic (MSA). MSA is used in official Settings while the dialectal variants (DA) correspond to the native tongue of the Arabic speakers. Arabic speakers typically code switch between DA and MSA, which is reflected extensively in written online social media. Automatic processing such Arabic genre is very difficult for automated NLP tools since the linguistic difference between MSA and DA is quite profound. However, no annotated resources exist for marking the regions of such switches in the utterance. In this paper, we present a simplified Set of guidelines for detecting code switching in Arabic on the word/token level. We use these guidelines in annotating a corpus that is rich in DA with frequent code switching to MSA. We present both a quantitative and qualitative analysis of the annotations.
引用
收藏
页码:371 / 378
页数:8
相关论文
共 50 条
  • [1] Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
    Diab, Mona
    Al-Badrashiny, Mohamed
    Aminian, Maryam
    Attia, Mohammed
    Dasigi, Pradeep
    Elfardy, Heba
    Eskander, Ramy
    Habash, Nizar
    Hawwari, Abdelati
    Salloum, Wael
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3782 - 3789
  • [2] DART: A Large Dataset of Dialectal Arabic Tweets
    Alsarsour, Israa
    Mohamed, Esraa
    Suwaileh, Reem
    Elsayed, Tamer
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3666 - 3670
  • [3] Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
    Zaghouani, Wajdi
    Bouamor, Houda
    Hawwari, Abdelati
    Diab, Mona
    Obeid, Ossama
    Ghoneim, Mahmoud
    Alqahtani, Sawsan
    Oflazer, Kemal
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 3637 - 3643
  • [4] Large Scale Arabic Error Annotation: Guidelines and Framework
    Zaghouani, Wajdi
    Mohit, Behrang
    Habash, Nizar
    Obeid, Ossama
    Tomeh, Nadi
    Rozovskaya, Alla
    Farra, Noura
    Alkuhlani, Sarah
    Oflazer, Kemal
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2362 - 2369
  • [5] Creation of annotated country-level dialectal Arabic resources: An unsupervised approach
    Althobaiti, Maha J.
    [J]. NATURAL LANGUAGE ENGINEERING, 2022, 28 (05) : 607 - 648
  • [6] Validation of the dialectal Arabic version of Barratt's impulsivity scale, the BIS-11
    Ellouze, F.
    Ghaffari, O.
    Zouari, O.
    Zouari, B.
    M'rad, M. F.
    [J]. ENCEPHALE-REVUE DE PSYCHIATRIE CLINIQUE BIOLOGIQUE ET THERAPEUTIQUE, 2013, 39 (01): : 13 - 18
  • [7] A Large Scale Corpus of Gulf Arabic
    Khalifa, Salam
    Habash, Nizar
    Abdulrahim, Dana
    Hassan, Sara
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4282 - 4289
  • [8] Significance estimation for large scale metabolomics annotations by spectral matching
    Scheubert, Kerstin
    Hufsky, Franziska
    Petras, Daniel
    Wang, Mingxun
    Nothias, Louis-Felix
    Duehrkop, Kai
    Bandeira, Nuno
    Dorrestein, Pieter C.
    Boecker, Sebastian
    [J]. NATURE COMMUNICATIONS, 2017, 8
  • [9] Improving large-scale search engines with semantic annotations
    Fuentes-Lorenzo, Damaris
    Fernandez, Norberto
    Fisteus, Jesus A.
    Sanchez, Luis
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (06) : 2287 - 2296
  • [10] Significance estimation for large scale metabolomics annotations by spectral matching
    Kerstin Scheubert
    Franziska Hufsky
    Daniel Petras
    Mingxun Wang
    Louis-Félix Nothias
    Kai Dührkop
    Nuno Bandeira
    Pieter C. Dorrestein
    Sebastian Böcker
    [J]. Nature Communications, 8