Annotation tools for syntax and named entities in the National Corpus of Polish

被引:9
|
作者
Waszczuk, Jakub [1 ,2 ]
Glowinska, Katarzyna [1 ]
Savary, Agata [3 ]
Przepiorkowski, Adam [1 ,2 ]
Lenart, Michal [2 ]
机构
[1] Polish Acad Sci, Inst Comp Sci, Ul Ordona 21, PL-01237 Warsaw, Poland
[2] Univ Warsaw, Inst Informat, PL-02097 Warsaw, Poland
[3] Univ Francois Rabelais Tours, Lab Dinformat, F-41000 Blois, France
关键词
corpus annotation; National Corpus of Polish; shallow parsing; chunking; named entity recognition; NER;
D O I
10.1504/IJDMMM.2013.053691
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.
引用
收藏
页码:103 / 122
页数:20
相关论文
共 50 条
  • [21] Comparative Analysis of Portuguese Named Entities Recognition Tools
    Amaral, Daniela O. F.
    Fonseca, Evandro B.
    Lopes, Lucelene
    Vieira, Renata
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2554 - 2558
  • [22] Comparison of String Distance Metrics for Lemmatisation of Named Entities in Polish
    Piskorski, Jakub
    Sydow, Marcin
    Wieloch, Karol
    [J]. HUMAN LANGUAGE TECHNOLOGY: CHALLENGES OF THE INFORMATION SOCIETY, 2009, 5603 : 413 - +
  • [23] Towards the National Corpus of Polish
    Przepiorkowski, Adam
    Gorski, Rafal L.
    Lewandowska-Tomaszczyk, Barbara
    Lazinski, Marek
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 827 - 830
  • [24] Annotation Schemes for Constructing Uyghur Named Entity Relation Corpus
    Abiderexiti, Kahaerjiang
    Maimaiti, Maihemuti
    Yibulayin, Tuergen
    Wumaier, Aishan
    [J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 103 - 107
  • [25] Annotation of metaphorical expressions in the Basic Corpus of Polish Metaphors
    Hajnicz, Elzbieta
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5648 - 5653
  • [26] Annotation Scheme and Specification for Named Entities and Relations on Chinese Medical Knowledge Graph
    Yue, Donghui
    Zhang, Kunli
    Zhuang, Lei
    Zhao, Xu
    Byambasuren, Odmaa
    Zan, Hongying
    [J]. CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 563 - 574
  • [27] Building a Pediatric Medical Corpus: Word Segmentation and Named Entity Annotation
    Zan Hongying
    Li Wenxin
    Zhang Kunli
    Ye Yajuan
    Chang Baobao
    Sui Zhifang
    [J]. CHINESE LEXICAL SEMANTICS (CLSW 2020), 2021, 12278 : 652 - 664
  • [28] Recent Developments in the National Corpus of Polish
    Przepiorkowski, Adam
    Gorski, Rafal L.
    Lazinski, Marek
    Pezik, Piotr
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010,
  • [29] Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus
    Savkov, Aleksandar
    Carroll, John
    Koeling, Rob
    Cassell, Jackie
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2016, 50 (03) : 523 - 548
  • [30] Annotating patient clinical records with syntactic chunks and named entities: the Harvey Corpus
    Aleksandar Savkov
    John Carroll
    Rob Koeling
    Jackie Cassell
    [J]. Language Resources and Evaluation, 2016, 50 : 523 - 548