Czech Historical Named Entity Corpus v 1.0

被引:0
|
作者
Hubkova, Helena [1 ]
Kral, Pavel [1 ]
Pettersson, Eva [2 ]
机构
[1] Univ West Bohemia, Fac Appl Sci, Dept Comp Sci & Engn, Univ 2732-8, Plzen 30100, Czech Republic
[2] Uppsala Univ, Dept Linguist & Philol, POB 256, SE-75105 Uppsala, Sweden
关键词
Historical Czech; Historical Named Entity Corpus; LSTM; Named Entity Recognition; Neural Networks;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
As the number of digitized archival documents increases very rapidly, named entity recognition (NER) in historical documents has become very important for information extraction and data mining. For this task an annotated corpus is needed, which has up to now been missing for Czech. In this paper we present a new annotated data collection for historical NER, composed of Czech historical newspapers. This corpus is freely available for research purposes at http://chnec.kiv.zcu.cz/. For this corpus, we have defined relevant domain-specific named entity types and created an annotation manual for corpus labelling. We further conducted some experiments on this corpus using recurrent neural networks in order to in order to show baseline results on this dataset. We experimented with randomly initialized embeddings and static and dynamic fastText word embeddings. We achieved 0.73 F1 score with a bidirectional LSTM model using static fastText embeddings.
引用
收藏
页码:4458 / 4465
页数:8
相关论文
共 50 条
  • [1] Named Entity Linking in English-Czech Parallel Corpus
    Neverilova, Zuzana
    Zizkova, Hana
    TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 : 147 - 158
  • [2] Uzbek news corpus for named entity recognition
    Yusufu, Aizihaierjiang
    Aziz, Kamran
    Yusufu, Aizierguli
    Ainiwaer, Abidan
    Li, Fei
    Ji, Donghong
    LANGUAGE RESOURCES AND EVALUATION, 2024,
  • [3] A Twitter Corpus for Named Entity Recognition in Turkish
    Carik, Buse
    Yeniterzi, Reyyan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4546 - 4551
  • [4] Towards a Balanced Named Entity Corpus for Dutch
    Desmet, Bart
    Hoste, Veronique
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [5] Thai Nested Named Entity Recognition Corpus
    Buaphet, Weerayut
    Udomcharoenchaikit, Can
    Limkonchotiwat, Peerat
    Rutherford, Attapol T.
    Nutanong, Sarana
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1473 - 1486
  • [6] A Finnish news corpus for named entity recognition
    Teemu Ruokolainen
    Pekka Kauppinen
    Miikka Silfverberg
    Krister Lindén
    Language Resources and Evaluation, 2020, 54 : 247 - 272
  • [7] Introducing RONEC - the Romanian Named Entity Corpus
    Dumitrescu, Stefan Daniel
    Avram, Andrei-Marius
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4436 - 4443
  • [8] GerNED: A German Corpus for Named Entity Disambiguation
    Ploch, Danuta
    Hennig, Leonhard
    Duka, Angelina
    De Luca, Ernesto William
    Albayrak, Sahin
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3886 - 3893
  • [9] A Finnish news corpus for named entity recognition
    Ruokolainen, Teemu
    Kauppinen, Pekka
    Silfverberg, Miikka
    Linden, Krister
    LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (01) : 247 - 272
  • [10] FEATURES FOR NAMED ENTITY RECOGNITION IN CZECH LANGUAGE
    Kral, Pavel
    KEOD 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE ENGINEERING AND ONTOLOGY DEVELOPMENT, 2011, : 437 - 441