Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT

被引:0
|
作者
Jarrar, Mustafa [1 ]
Khalilia, Mohammed [1 ]
Ghanem, Sana [1 ]
机构
[1] Birzeit Univ, Birzeit, Palestine
关键词
Named Entity Recognition; Multi-Task Learning; Nested Entities; BERT; Arabic NER Corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents Wojood, a corpus for Arabic nested Named Entity Recognition (NER). Nested entities occur when one entity mention is embedded inside another entity mention. Wojood consists of about 550K Modern Standard Arabic (MSA) and dialect tokens that are manually annotated with 21 entity types including person, organization, location, event and date. More importantly, the corpus is annotated with nested entities instead of the more common flat annotations. The data contains about 75K entities and 22.5% of which are nested. The inter-annotator evaluation of the corpus demonstrated a strong agreement with Cohen's Kappa of 0.979 and an F1-score of 0.976. To validate our data, we used the corpus to train a nested NER model based on multi-task learning using the pre-trained AraBERT (Arabic BERT). The model achieved an overall micro F1-score of 0.884. Our corpus, the annotation guidelines, the source code and the pre-trained model are publicly available.
引用
收藏
页码:3626 / 3636
页数:11
相关论文
共 50 条
  • [21] RENA: A Named Entity Recognition System for Arabic
    El Bazi, Ismail
    Laachfoubi, Nabil
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 396 - 404
  • [22] A Survey of Arabic Named Entity Recognition and Classification
    Shaalan, Khaled
    COMPUTATIONAL LINGUISTICS, 2014, 40 (02) : 469 - 510
  • [23] Arabic named entity recognition in crime documents
    Asharef, M.
    Omar, N.
    Albared, M.
    Journal of Theoretical and Applied Information Technology, 2012, 44 (01) : 1 - 6
  • [24] Chinese named entity recognition model based on BERT
    Liu, Hongshuai
    Jun, Ge
    Zheng, Yuanyuan
    2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
  • [25] Efficacy of Arabic Named-Entity Recognition
    Al-Shoukry, Suhad
    Omar, Nazlia
    5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS 2015, 2015, : 506 - 510
  • [26] A hybrid approach to Arabic named entity recognition
    Shaalan, Khaled
    Oudah, Mai
    JOURNAL OF INFORMATION SCIENCE, 2014, 40 (01) : 67 - 87
  • [27] A Twitter Corpus for Named Entity Recognition in Turkish
    Carik, Buse
    Yeniterzi, Reyyan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4546 - 4551
  • [28] A Finnish news corpus for named entity recognition
    Teemu Ruokolainen
    Pekka Kauppinen
    Miikka Silfverberg
    Krister Lindén
    Language Resources and Evaluation, 2020, 54 : 247 - 272
  • [29] A Finnish news corpus for named entity recognition
    Ruokolainen, Teemu
    Kauppinen, Pekka
    Silfverberg, Miikka
    Linden, Krister
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (01) : 247 - 272
  • [30] Nested Named Entity Recognition Using Multilayer Recurrent Neural Networks
    Truong-Son Nguyen
    Le-Minh Nguyen
    COMPUTATIONAL LINGUISTICS, PACLING 2017, 2018, 781 : 233 - 246