A Large Scale Corpus of Gulf Arabic

被引：0

作者：

Khalifa, Salam ^{[1
]}

Habash, Nizar ^{[1
]}

Abdulrahim, Dana ^{[2
]}

Hassan, Sara ^{[1
]}

机构：

[1] New York Univ Abu Dhabi, Computat Approaches Modeling Language Lab, Abu Dhabi, U Arab Emirates

[2] Univ Bahrain, Zallaq, Bahrain

来源：

LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2016年

关键词：

Arabic Dialects; Corpus; Large-Scale; Gulf Arabic;

D O I：

暂无

中图分类号：

H [语言、文字];

学科分类号：

05 ;

摘要：

Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.

引用

页码：4282 / 4289

页数：8

共 50 条

[21] Large Scale Arabic Error Annotation: Guidelines and Framework
Zaghouani, Wajdi
Mohit, Behrang
Habash, Nizar
Obeid, Ossama
Tomeh, Nadi
Rozovskaya, Alla
Farra, Noura
Alkuhlani, Sarah
Oflazer, Kemal
[J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2362 - 2369
[22] A Monolingual Parallel Corpus of Arabic
Al-Raisi, Fatima
Lin, Weijian
Bourai, Abdelwahab
[J]. ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 334 - 338
[23] A Multidialectal Parallel Corpus of Arabic
Bouamor, Houda
Habash, Nizar
Oflazer, Kemal
[J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1240 - 1245
[24] The Constitution of an Arabic Touristic Corpus
Lhioui, Chahira
Zouaghi, Anis
Zrigui, Mounir
[J]. ARABIC COMPUTATIONAL LINGUISTICS, 2018, 142 : 14 - 25
[25] A Large Scale Test Corpus for Semantic Table Search
Leventidis, Aristotelis
Christensen, Martin Pekar
Lissandrini, Matteo
Di Rocco, Laura
Hose, Katja
Miller, Renee J.
[J]. PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 1142 - 1151
[26] Arabic corpus linguistics.
Holes, Clive
[J]. LANGUAGE, 2020, 96 (01) : 202 - 206
[27] OCA: Opinion Corpus for Arabic
Rushdi-Saleh, Mohammed
Teresa Martin-Valdivia, M.
Alfonso Urena-Lopez, L.
Perea-Ortega, Jose M.
[J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (10): : 2045 - 2054
[28] Vocal development in a large-scale crosslinguistic corpus
Cychosz, Margaret
Cristia, Alejandrina
Bergelson, Elika
Casillas, Marisa
Baudet, Gladys
Warlaumont, Anne S.
Scaff, Camila
Yankowitz, Lisa
Seidl, Amanda
[J]. DEVELOPMENTAL SCIENCE, 2021, 24 (05)
[29] A Phrase Topic Model for Large-scale Corpus
Li, Baoji
Xu, Wenhua
Tian, Yuhui
Chen, Juan
[J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
[30] Large-Scale Multimodal Movie Dialogue Corpus
Yasuhara, Ryu
Inoue, Masashi
Suga, Ikuya
Kosaka, Tetsuo
[J]. ICMI'16: PROCEEDINGS OF THE 18TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2016, : 414 - 415

← 1 2 3 4 5 →