Towards a Balanced Named Entity Corpus for Dutch

被引:0
|
作者
Desmet, Bart [1 ,2 ]
Hoste, Veronique [1 ,2 ]
机构
[1] Univ Coll Ghent, Language & Translat Technol Team, B-9000 Ghent, Belgium
[2] Univ Ghent, Dept Appl Math & Comp Sci, B-9000 Ghent, Belgium
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
This paper introduces a new named entity corpus for Dutch. State-of-the-art named entity recognition systems require a substantial annotated corpus to be trained on. Such corpora exist for English, but not for Dutch. The STEVIN-funded SoNaR project aims to produce a diverse 500-million-word reference corpus of written Dutch, with four semantic annotation layers: named entities, coreference relations, semantic roles and spatiotemporal expressions. A 1-million-word subset will be manually corrected. Named entity annotation guidelines for Dutch were developed, adapted from the MUC and ACE guidelines. Adaptations include the annotation of products and events, the classification into subtypes, and the markup of metonymic usage. Inter-annotator agreement experiments were conducted to corroborate the reliability of the guidelines, which yielded satisfactory results (Kappa scores above 0.90). We are building a NER system, trained on the 1-million-word subcorpus, to automatically classify the remainder of the SoNaR corpus. To this end, experiments with various classification algorithms (MBL, SVM, CRF) and features have been carried out and evaluated.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Towards named entity annotation of Latvian National Library corpus
    Paikens, Peteris
    Auzina, Ilze
    Garkaje, Ginta
    Paegle, Madara
    HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 169 - 175
  • [2] A Named Entity Recognition system for Dutch
    De Meulder, F
    Daelemans, W
    Hoste, V
    COMPUTATIONAL LINGUISTICS IN THE NETHERLANDS 2001, 2002, (45): : 77 - 88
  • [3] Uzbek news corpus for named entity recognition
    Yusufu, Aizihaierjiang
    Aziz, Kamran
    Yusufu, Aizierguli
    Ainiwaer, Abidan
    Li, Fei
    Ji, Donghong
    LANGUAGE RESOURCES AND EVALUATION, 2024,
  • [4] A Twitter Corpus for Named Entity Recognition in Turkish
    Carik, Buse
    Yeniterzi, Reyyan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4546 - 4551
  • [5] Thai Nested Named Entity Recognition Corpus
    Buaphet, Weerayut
    Udomcharoenchaikit, Can
    Limkonchotiwat, Peerat
    Rutherford, Attapol T.
    Nutanong, Sarana
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1473 - 1486
  • [6] A Finnish news corpus for named entity recognition
    Teemu Ruokolainen
    Pekka Kauppinen
    Miikka Silfverberg
    Krister Lindén
    Language Resources and Evaluation, 2020, 54 : 247 - 272
  • [7] A Finnish news corpus for named entity recognition
    Ruokolainen, Teemu
    Kauppinen, Pekka
    Silfverberg, Miikka
    Linden, Krister
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (01) : 247 - 272
  • [8] GerNED: A German Corpus for Named Entity Disambiguation
    Ploch, Danuta
    Hennig, Leonhard
    Duka, Angelina
    De Luca, Ernesto William
    Albayrak, Sahin
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3886 - 3893
  • [9] Introducing RONEC - the Romanian Named Entity Corpus
    Dumitrescu, Stefan Daniel
    Avram, Andrei-Marius
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4436 - 4443
  • [10] MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning
    Zhuang, Yan
    Zhang, Junyan
    Lu, Ruogu
    He, Kunlun
    Li, Xiuxing
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (05)