Arabic Dialect Identification

被引:96
|
作者
Zaidan, Omar F. [1 ]
Callison-Burch, Chris [2 ]
机构
[1] Microsoft Res, Seattle, WA USA
[2] Univ Penn, Comp & Informat Sci Dept, Philadelphia, PA 19104 USA
关键词
LANGUAGE IDENTIFICATION; AGREEMENT;
D O I
10.1162/COLI_a_00169
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The written form of the Arabic language, Modern Standard Arabic (MSA), differs in a non-trivial manner from the various spoken regional dialects of Arabicthe true native languages of Arabic speakers. Those dialects, in turn, differ quite a bit from each other. However, due to MSA's prevalence in written form, almost all Arabic data sets have predominantly MSA content. In this article, we describe the creation of a novel Arabic resource with dialect annotations. We have created a large monolingual data set rich in dialectal Arabic content called the Arabic On-line Commentary Data set (Zaidan and Callison-Burch 2011). We describe our annotation effort to identify the dialect level (and dialect itself) in each of more than 100,000 sentences from the data set by crowdsourcing the annotation task, and delve into interesting annotator behaviors (like over-identification of one's own dialect). Using this new annotated data set, we consider the task of Arabic dialect identification: Given the word sequence forming an Arabic sentence, determine the variety of Arabic in which it is written. We use the data to train and evaluate automatic classifiers for dialect identification, and establish that classifiers using dialectal data significantly and dramatically outperform baselines that use MSA-only data, achieving near-human classification accuracy. Finally, we apply our classifiers to discover dialectical data from a large Web crawl consisting of 3.5 million pages mined from on-line Arabic newspapers.
引用
下载
收藏
页码:171 / 202
页数:32
相关论文
共 50 条
  • [1] Spoken Arabic Algerian Dialect Identification
    Bougrine, Soumia
    Cherroun, Hadda
    Abdelali, Ahmed
    2018 2ND INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE AND SPEECH PROCESSING (ICNLSP), 2018, : 96 - 101
  • [2] ADIDA: Automatic Dialect Identification for Arabic
    Obeid, Ossama
    Salameh, Mohammad
    Bouamor, Houda
    Habash, Nizar
    NAACL HLT 2019: THE 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE DEMONSTRATIONS SESSION, 2019, : 6 - 11
  • [3] Transformer-based Arabic Dialect Identification
    Lin, Wanqiu
    Madhavi, Maulik
    Das, Rohan Kumar
    Li, Haizhou
    2020 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2020), 2020, : 192 - 196
  • [4] Arabic Dialect Identification for Travel and Twitter Text
    Mishra, Pruthwik
    Mujadia, Vandan
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 234 - 238
  • [5] IADD: An integrated Arabic dialect identification dataset
    Zahir, Jihad
    DATA IN BRIEF, 2022, 40
  • [6] Using Prosody and Phonotactics in Arabic Dialect Identification
    Biadsy, Fadi
    Hirschberg, Julia
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 208 - 211
  • [7] Hierarchical Deep Learning for Arabic Dialect Identification
    de Francony, Gael
    Guichard, Victor
    Joshi, Praveen
    Afli, Haithem
    Bouchekif, Abdessalam
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 249 - 253
  • [8] A Character Level Convolutional BiLSTM for Arabic Dialect Identification
    Elaraby, Mohamed
    Zahran, Ahmed Ismail
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 274 - 278
  • [9] Learning Intonation Pattern Embeddings for Arabic Dialect Identification
    Alvarez, Aitor Arronte
    Issa, Elsayed Sabry Abdelaal
    INTERSPEECH 2020, 2020, : 472 - 476
  • [10] Arabic Dialect Identification - 'Is the Secret in the Silence?' and Other Observations
    Boril, Hynek
    Sangwan, Abhijeet
    Hansen, John H. L.
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 30 - 33