Towards a Balanced Named Entity Corpus for Dutch

被引:0
|
作者
Desmet, Bart [1 ,2 ]
Hoste, Veronique [1 ,2 ]
机构
[1] Univ Coll Ghent, Language & Translat Technol Team, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.
引用
收藏
页数:7
相关论文
共 50 条
  • [41] Named Entity Linking in English-Czech Parallel Corpus
    Neverilova, Zuzana
    Zizkova, Hana
    TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 : 147 - 158
  • [42] Corpus Construction for Named-Entity and Entity Relations for Electronic Medical Records of Cardiovascular Disease
    Chang, Hongyang
    Zan, Hongying
    Zhang, Shuai
    Zhao, Bingfei
    Zhang, Kunli
    HEALTH INFORMATION PROCESSING, CHIP 2022, 2023, 1772 : 3 - 18
  • [43] Emerging Named Entity Recognition on Retrieval Features in an Affective Computing Corpus
    Nawroth, Christian
    Engel, Felix
    Mc Kevitt, Paul
    Hemmje, Matthias L.
    2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 2860 - 2868
  • [44] BanglaBioMed: A Biomedical Named-Entity Annotated Corpus for Bangla (Bengali)
    Sazzed, Salim
    PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, : 323 - 329
  • [45] Building a Named Entity Annotated Bilingual English-Vietnamese Corpus
    Tuan-An Dao
    Hung-Thinh Truong
    Long Nguyen
    Dien Dinh
    PROCEEDINGS OF 2018 10TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SYSTEMS ENGINEERING (KSE), 2018, : 61 - 66
  • [46] Towards Improving Neural Named Entity Recognition with Gazetteers
    Liu, Tianyu
    Yao, Jin-Ge
    Lin, Chin-Yew
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 5301 - 5307
  • [47] LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain
    Pais, Vasile
    Mitrofan, Maria
    Gasan, Carol Luca
    Ianov, Alexandru
    Ghita, Corvin
    Coneschi, Vlad Silviu
    Onut, Andrei
    SEMANTIC WEB, 2024, 15 (03) : 831 - 844
  • [48] Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications
    Kanwal, Safia
    Malik, Kamran
    Shahzad, Khurram
    Aslam, Faisal
    Nawaz, Zubair
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
  • [49] A Method for Building a Labeled Named Entity Recognition Corpus Using Ontologies
    Ngoc-Trinh Vu
    Van-Hien Tran
    Thi-Huyen-Trang Doan
    Hoang-Quynh Le
    Mai-Vu Tran
    ADVANCED COMPUTATIONAL METHODS FOR KNOWLEDGE ENGINEERING, 2015, 358 : 141 - 149
  • [50] Wojood: Nested Arabic Named Entity Corpus and Recognition using BERT
    Jarrar, Mustafa
    Khalilia, Mohammed
    Ghanem, Sana
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3626 - 3636