A Spelling Correction Corpus for Multiple Arabic Dialects

被引:0
|
作者
Eryani, Fadhl [1 ]
Habash, Nizar [1 ]
Bouamor, Houda [2 ]
Khalifa, Salam [1 ]
机构
[1] New York Univ Abu Dhabi, Computat Approaches Modeling Language CAMeL Lab, Abu Dhabi, U Arab Emirates
[2] Carnegie Mellon Univ Qatar, Ar Rayyan, Qatar
关键词
Dialects; Corpora; Spelling Correction; Conventional Orthography for Dialectal Arabic;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Arabic dialects are the non-standard varieties of Arabic commonly spoken - and increasingly written on social media - across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their Raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.
引用
收藏
页码:4130 / 4138
页数:9
相关论文
共 50 条
  • [1] Shami: A Corpus of Levantine Arabic Dialects
    Abu Kwaik, Kathrein
    Saad, Motaz
    Chatzikyriakidis, Stergios
    Dobnik, Simon
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3645 - 3652
  • [2] Arabic spelling error detection and correction
    Attia, Mohammed
    Pecina, Pavel
    Samih, Younes
    Shaalan, Khaled
    Van Genabith, Josef
    [J]. NATURAL LANGUAGE ENGINEERING, 2016, 22 (05) : 751 - 773
  • [3] Building a Corpus for Arabic Dialects using Games With A Purpose
    Osman, Maya
    Sabty, Caroline
    Sharaf, Nada
    Abdennadher, Slim
    [J]. 2015 FIRST INTERNATIONAL CONFERENCE ON ARABIC COMPUTATIONAL LINGUISTICS (ACLING 2015): ADVANCES IN ARABIC COMPUTATIONAL LINGUISTICS, 2015, : 21 - 25
  • [4] Automatic Building of a Large Arabic Spelling Error Corpus
    Aichaoui S.B.
    Hiri N.
    Dahou A.H.
    Cheragui M.A.
    [J]. SN Computer Science, 4 (2)
  • [5] AUTOMATIC CORRECTION OF SPELLING-ERRORS IN ARABIC
    ALFEDAGHI, S
    AMIN, A
    [J]. JOURNAL OF THE UNIVERSITY OF KUWAIT-SCIENCE, 1992, 19 (02): : 175 - 194
  • [6] The Corpus Based Approach to Sentiment Analysis in Modern Standard Arabic and Arabic Dialects: A Literature Review
    Alnawas, Anwar
    Arici, Nursal
    [J]. JOURNAL OF POLYTECHNIC-POLITEKNIK DERGISI, 2018, 21 (02): : 461 - 470
  • [7] Spelling Error Detection and Correction for Arabic Using NooJ
    Kassmi, Rafik
    Mbarki, Samir
    Mouloudi, Abdelaziz
    [J]. FORMALIZING NATURAL LANGUAGES: APPLICATIONS TO NATURAL LANGUAGE PROCESSING AND DIGITAL HUMANITIES, NOOJ 2023, 2024, 1816 : 202 - 212
  • [8] ARABIC SOFT SPELLING CORRECTION WITH T5
    Al-Qaraghuli, Mohammed
    Jaafar, Ola Arif
    [J]. JORDANIAN JOURNAL OF COMPUTERS AND INFORMATION TECHNOLOGY, 2024, 10 (01): : 46 - 57
  • [9] Spelling Error Detection and Correction for Arabic Using NooJ
    Kassmi, Rafik
    Mbarki, Samir
    Mouloudi, Abdelaziz
    [J]. Communications in Computer and Information Science, 2024, 1816 CCIS : 202 - 212
  • [10] A Large-Scale Query Spelling Correction Corpus
    Hagen, Matthias
    Potthast, Martin
    Gohsen, Marcel
    Rathgeber, Anja
    Stein, Benno
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1261 - 1264