Towards a Balanced Named Entity Corpus for Dutch

被引:0
|
作者
Desmet, Bart [1 ,2 ]
Hoste, Veronique [1 ,2 ]
机构
[1] Univ Coll Ghent, Language & Translat Technol Team, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] A Broad-coverage Corpus for Finnish Named Entity Recognition
    Luoma, Jouni
    Oinonen, Miika
    Pyykonen, Maria
    Laippala, Veronika
    Pyysalo, Sampo
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4615 - 4624
  • [22] ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer
    Jibril, Ebrahim Chekol
    Tantug, A. Cuneyd
    IEEE ACCESS, 2023, 11 : 15799 - 15815
  • [23] Towards Named Entity Disambiguation with Graph Embeddings
    Colliani, Felice Paolo
    Futia, Giuseppe
    Garifo, Giovanni
    Vetro, Antonio
    De Martin, Juan Carlos
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES, AICT 2024, 2024,
  • [24] Assessment of disease named entity recognition on a corpus of annotated sentences
    Jimeno, Antonio
    Jimenez-Ruiz, Ernesto
    Lee, Vivian
    Gaudan, Sylvain
    Berlanga, Rafael
    Rebholz-Schuhmann, Dietrich
    BMC BIOINFORMATICS, 2008, 9 (Suppl 3)
  • [25] Assessment of disease named entity recognition on a corpus of annotated sentences
    Antonio Jimeno
    Ernesto Jimenez-Ruiz
    Vivian Lee
    Sylvain Gaudan
    Rafael Berlanga
    Dietrich Rebholz-Schuhmann
    BMC Bioinformatics, 9
  • [26] Towards Faceted Search for Named Entity Queries
    Stamou, Sofia
    Kozanidis, Lefteris
    ADVANCES IN WEB AND NETWORK TECHNOLOGIES, AND INFORMATION MANAGEMENT, 2009, 5731 : 100 - 112
  • [27] Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)
    Salah, Ramzi Esmail
    Zakaria, Lailatul Qadri Binti
    2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2018, : 150 - 157
  • [28] Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus
    Abiderexiti, Kahaerjiang
    Maimaiti, Maihemuti
    Yibulayin, Tuergen
    Wumaier, Aishan
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 103 - 107
  • [29] Building a Corpus-Derived Gazetteer for Named Entity Recognition
    Zamin, Norshuhani
    Oxley, Alan
    SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2, 2011, 180 : 73 - 80
  • [30] GENETAG: a tagged corpus for gene/protein named entity recognition
    Lorraine Tanabe
    Natalie Xie
    Lynne H Thom
    Wayne Matten
    W John Wilbur
    BMC Bioinformatics, 6