A Large Scale Corpus of Gulf Arabic

被引:0
|
作者
Khalifa, Salam [1 ]
Habash, Nizar [1 ]
Abdulrahim, Dana [2 ]
Hassan, Sara [1 ]
机构
[1] New York Univ Abu Dhabi, Computat Approaches Modeling Language Lab, Abu Dhabi, U Arab Emirates
[2] Univ Bahrain, Zallaq, Bahrain
关键词
Arabic Dialects; Corpus; Large-Scale; Gulf Arabic;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.
引用
收藏
页码:4282 / 4289
页数:8
相关论文
共 50 条
  • [1] Guidelines and Framework for a Large Scale Arabic Diacritized Corpus
    Zaghouani, Wajdi
    Bouamor, Houda
    Hawwari, Abdelati
    Diab, Mona
    Obeid, Ossama
    Ghoneim, Mahmoud
    Alqahtani, Sawsan
    Oflazer, Kemal
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 3637 - 3643
  • [2] LANS: Large-scale Arabic News Summarization Corpus
    Alhamadani, Abdulaziz
    Zhang, Xuchao
    He, Jianfeng
    Khatri, Aadyant
    Lu, Chang-Tien
    [J]. ArabicNLP 2023 - 1st Arabic Natural Language Processing Conference, Proceedings, 2023, : 89 - 100
  • [3] MOALLEMCorpus: A Large-Scale Multimedia Corpus for Children Education of Arabic Vocabularies
    Al-Maadeed, Somaya
    AlJa'am, Jihad
    Khalifa, Batoul
    Abou Elsaud, Samir
    [J]. PROCEEDINGS OF THE 2021 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE (EDUCON), 2021, : 891 - 896
  • [4] Studying the history of the Arabic language: language technology and a large-scale historical corpus
    Belinkov, Yonatan
    Magidow, Alexander
    Barron-Cedeno, Alberto
    Shmidman, Avi
    Romanov, Maxim
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (04) : 771 - 805
  • [5] QASR: QCRI aljazeera speech resource a large scale annotated Arabic speech corpus
    Mubarak, Hamdy
    Hussein, Amir
    Chowdhury, Shammur Absar
    Ali, Ahmed
    [J]. ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021, : 2274 - 2285
  • [6] Studying the history of the Arabic language: language technology and a large-scale historical corpus
    Yonatan Belinkov
    Alexander Magidow
    Alberto Barrón-Cedeño
    Avi Shmidman
    Maxim Romanov
    [J]. Language Resources and Evaluation, 2019, 53 : 771 - 805
  • [7] QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus
    Mubarak, Hamdy
    Hussein, Amir
    Chowdhury, Shammur Absar
    Ali, Ahmed
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2274 - 2285
  • [8] Automatic Building of a Large Arabic Spelling Error Corpus
    Aichaoui S.B.
    Hiri N.
    Dahou A.H.
    Cheragui M.A.
    [J]. SN Computer Science, 4 (2)
  • [9] Testing a Large Corpus of Natural Standard Arabic for Rhythm Class
    Dockendorf, Liz
    Almubayei, Dalal
    Benton, Matthew
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 771 - 771