ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus

被引:0
|
作者
Habash, Nizar [1 ]
Palfreyman, David [2 ]
机构
[1] New York Univ Abu Dhabi, Abu Dhabi, U Arab Emirates
[2] Zayed Univ, Abu Dhabi, U Arab Emirates
关键词
Annotated Corpus; Learner Corpus; CEFR; Arabic; English;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
We present ZAEBUC, an annotated Arabic-English bilingual writer corpus comprising short essays by first-year university students at Zayed University in the United Arab Emirates. We describe and discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on Arabic and English texts using consistent guidelines as much as possible, with tracked alignments among the different annotations, and to the original raw texts. For morphological tokenization, POS tagging, and lemmatization, we use existing automatic annotation tools followed by manual correction. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The publicly available ZAEBUC corpus and its annotations are intended to be the stepping stones for additional annotations.
引用
下载
收藏
页码:79 / 88
页数:10
相关论文
共 50 条