A Large Scale Corpus of Gulf Arabic

被引:0
|
作者
Khalifa, Salam [1 ]
Habash, Nizar [1 ]
Abdulrahim, Dana [2 ]
Hassan, Sara [1 ]
机构
[1] New York Univ Abu Dhabi, Computat Approaches Modeling Language Lab, Abu Dhabi, U Arab Emirates
[2] Univ Bahrain, Zallaq, Bahrain
关键词
Arabic Dialects; Corpus; Large-Scale; Gulf Arabic;
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.
引用
收藏
页码:4282 / 4289
页数:8
相关论文
共 50 条
  • [21] Large Scale Arabic Error Annotation: Guidelines and Framework
    Zaghouani, Wajdi
    Mohit, Behrang
    Habash, Nizar
    Obeid, Ossama
    Tomeh, Nadi
    Rozovskaya, Alla
    Farra, Noura
    Alkuhlani, Sarah
    Oflazer, Kemal
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2362 - 2369
  • [22] A Monolingual Parallel Corpus of Arabic
    Al-Raisi, Fatima
    Lin, Weijian
    Bourai, Abdelwahab
    [J]. ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 334 - 338
  • [23] A Multidialectal Parallel Corpus of Arabic
    Bouamor, Houda
    Habash, Nizar
    Oflazer, Kemal
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1240 - 1245
  • [24] The Constitution of an Arabic Touristic Corpus
    Lhioui, Chahira
    Zouaghi, Anis
    Zrigui, Mounir
    [J]. ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 14 - 25
  • [25] A Large Scale Test Corpus for Semantic Table Search
    Leventidis, Aristotelis
    Christensen, Martin Pekar
    Lissandrini, Matteo
    Di Rocco, Laura
    Hose, Katja
    Miller, Renee J.
    [J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 1142 - 1151
  • [26] Arabic corpus linguistics.
    Holes, Clive
    [J]. LANGUAGE, 2020, 96 (01) : 202 - 206
  • [27] OCA: Opinion Corpus for Arabic
    Rushdi-Saleh, Mohammed
    Teresa Martin-Valdivia, M.
    Alfonso Urena-Lopez, L.
    Perea-Ortega, Jose M.
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (10): : 2045 - 2054
  • [28] Vocal development in a large-scale crosslinguistic corpus
    Cychosz, Margaret
    Cristia, Alejandrina
    Bergelson, Elika
    Casillas, Marisa
    Baudet, Gladys
    Warlaumont, Anne S.
    Scaff, Camila
    Yankowitz, Lisa
    Seidl, Amanda
    [J]. DEVELOPMENTAL SCIENCE, 2021, 24 (05)
  • [29] A Phrase Topic Model for Large-scale Corpus
    Li, Baoji
    Xu, Wenhua
    Tian, Yuhui
    Chen, Juan
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
  • [30] Large-Scale Multimodal Movie Dialogue Corpus
    Yasuhara, Ryu
    Inoue, Masashi
    Suga, Ikuya
    Kosaka, Tetsuo
    [J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 414 - 415