Annotation tools for syntax and named entities in the National Corpus of Polish

被引:9
|
作者
Waszczuk, Jakub [1 ,2 ]
Glowinska, Katarzyna [1 ]
Savary, Agata [3 ]
Przepiorkowski, Adam [1 ,2 ]
Lenart, Michal [2 ]
机构
[1] Polish Acad Sci, Inst Comp Sci, Ul Ordona 21, PL-01237 Warsaw, Poland
[2] Univ Warsaw, Inst Informat, PL-02097 Warsaw, Poland
[3] Univ Francois Rabelais Tours, Lab Dinformat, F-41000 Blois, France
关键词
corpus annotation; National Corpus of Polish; shallow parsing; chunking; named entity recognition; NER;
D O I
10.1504/IJDMMM.2013.053691
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.
引用
收藏
页码:103 / 122
页数:20
相关论文
共 50 条
  • [1] Towards the Annotation of Named Entities in the National Corpus of Polish
    Savary, Agata
    Waszczuk, Jakub
    Przepiorkowski, Adam
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [2] Towards named entity annotation of Latvian National Library corpus
    Paikens, Peteris
    Auzina, Ilze
    Garkaje, Ginta
    Paegle, Madara
    [J]. HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 169 - 175
  • [3] The Design of Syntactic Annotation Levels in the National Corpus of Polish
    Glowinska, Katarzyna
    Przepiorkowski, Adam
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1816 - 1821
  • [4] Towards a double annotation of Named Entities
    Ehrmann, Maud
    Jacquet, Guillaume
    [J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2006, 47 (03): : 63 - 88
  • [5] Temporal Role Annotation for Named Entities
    Koutraki, Maria
    Bakhshandegan-Moghaddam, Farshad
    Sack, Harald
    [J]. PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON SEMANTIC SYSTEMS, 2018, 137 : 223 - 234
  • [6] Named Entities in Court: The MarineLives Corpus
    Ritze, Dominique
    Zirn, Caecilia
    Greenstreet, Colin
    Eckert, Kai
    Ponzetto, Simone Paolo
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [7] TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus
    Alvarez-Mellado, Elena
    Diez-Platas, Maria Luisa
    Ruiz-Fabo, Pablo
    Bermudez, Helena
    Ros, Salvador
    Gonzalez-Blanco, Elena
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (02) : 525 - 549
  • [8] TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus
    Elena Álvarez-Mellado
    María Luisa Díez-Platas
    Pablo Ruiz-Fabo
    Helena Bermúdez
    Salvador Ros
    Elena González-Blanco
    [J]. Language Resources and Evaluation, 2021, 55 : 525 - 549
  • [9] Search rules of annotation for the recognition of named entities
    Nouvel, Damien
    Antoine, Jean -Yves
    Friburger, Nathalie
    Soulet, Arnaud
    [J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2013, 54 (02): : 13 - 41
  • [10] Automatic Semantic Web Annotation of Named Entities
    Charton, Eric
    Gagnon, Michel
    Ozell, Benoit
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, 2011, 6657 : 74 - 85