Czech Historical Named Entity Corpus v 1.0

被引:0
|
作者
Hubkova, Helena [1 ]
Kral, Pavel [1 ]
Pettersson, Eva [2 ]
机构
[1] Univ West Bohemia, Fac Appl Sci, Dept Comp Sci & Engn, Univ 2732-8, Plzen 30100, Czech Republic
[2] Uppsala Univ, Dept Linguist & Philol, POB 256, SE-75105 Uppsala, Sweden
关键词
Historical Czech; Historical Named Entity Corpus; LSTM; Named Entity Recognition; Neural Networks;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes at http://chnec.kiv.zcu.cz/. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks in order to in order to show baseline results on this dataset. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.
引用
收藏
页码:4458 / 4465
页数:8
相关论文
共 50 条
  • [21] ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer
    Jibril, Ebrahim Chekol
    Tantug, A. Cuneyd
    IEEE ACCESS, 2023, 11 : 15799 - 15815
  • [22] Assessment of disease named entity recognition on a corpus of annotated sentences
    Jimeno, Antonio
    Jimenez-Ruiz, Ernesto
    Lee, Vivian
    Gaudan, Sylvain
    Berlanga, Rafael
    Rebholz-Schuhmann, Dietrich
    BMC BIOINFORMATICS, 2008, 9 (Suppl 3)
  • [23] Assessment of disease named entity recognition on a corpus of annotated sentences
    Antonio Jimeno
    Ernesto Jimenez-Ruiz
    Vivian Lee
    Sylvain Gaudan
    Rafael Berlanga
    Dietrich Rebholz-Schuhmann
    BMC Bioinformatics, 9
  • [24] Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus
    Abiderexiti, Kahaerjiang
    Maimaiti, Maihemuti
    Yibulayin, Tuergen
    Wumaier, Aishan
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 103 - 107
  • [25] Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)
    Salah, Ramzi Esmail
    Zakaria, Lailatul Qadri Binti
    2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2018, : 150 - 157
  • [26] Building a Corpus-Derived Gazetteer for Named Entity Recognition
    Zamin, Norshuhani
    Oxley, Alan
    SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2, 2011, 180 : 73 - 80
  • [27] GENETAG: a tagged corpus for gene/protein named entity recognition
    Lorraine Tanabe
    Natalie Xie
    Lynne H Thom
    Wayne Matten
    W John Wilbur
    BMC Bioinformatics, 6
  • [28] Named entity recognition through corpus transformation and system combination
    Troyano, JA
    Carrillo, V
    Enríquez, F
    Galán, FJ
    ADVANCES IN NATURAL LANGUAGE PROCESSING, 2004, 3230 : 255 - 266
  • [29] Towards named entity annotation of Latvian National Library corpus
    Paikens, Peteris
    Auzina, Ilze
    Garkaje, Ginta
    Paegle, Madara
    HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 169 - 175
  • [30] An Automatically Generated Annotated Corpus for Albanian Named Entity Recognition
    Hoxha, Klesti
    Baxhaku, Artur
    CYBERNETICS AND INFORMATION TECHNOLOGIES, 2018, 18 (01) : 95 - 108