Annotation tools for syntax and named entities in the National Corpus of Polish

被引:9
|
作者
Waszczuk, Jakub [1 ,2 ]
Glowinska, Katarzyna [1 ]
Savary, Agata [3 ]
Przepiorkowski, Adam [1 ,2 ]
Lenart, Michal [2 ]
机构
[1] Polish Acad Sci, Inst Comp Sci, Ul Ordona 21, PL-01237 Warsaw, Poland
[2] Univ Warsaw, Inst Informat, PL-02097 Warsaw, Poland
[3] Univ Francois Rabelais Tours, Lab Dinformat, F-41000 Blois, France
关键词
corpus annotation; National Corpus of Polish; shallow parsing; chunking; named entity recognition; NER;
D O I
10.1504/IJDMMM.2013.053691
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.
引用
收藏
页码:103 / 122
页数:20
相关论文
共 50 条
  • [41] Towards the Integration of Synthetic SL Animation with Avatars into Corpus Annotation Tools
    Elliott, Ralph
    Bueno, Javier
    Kennaway, Richard
    Glauert, John
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : A84 - A87
  • [42] Comparing NERP-CRF with Publicly Available Portuguese Named Entities Recognition Tools
    do Amaral, Daniela O. F.
    Fonseca, Evandro
    Lopes, Lucelene
    Vieira, Renata
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 2014, 8775 : 244 - 249
  • [43] Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization
    Lu, Tingming
    Zhu, Man
    Gao, Zhiqiang
    [J]. NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 263 - 274
  • [44] Thai Named Entity Corpus Annotation Scheme and Self Verification by BiLSTM-CNN-CRF
    Sornlertlamvanich, Virach
    Suriyachay, Kitiya
    Charoenporn, Thatsanee
    [J]. HUMAN LANGUAGE TECHNOLOGY: CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2019, 2022, 13212 : 143 - 160
  • [45] COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
    Nguyen, Nhung T. H.
    Gabud, Roselyn S.
    Ananiadou, Sophia
    [J]. BIODIVERSITY DATA JOURNAL, 2019, 7
  • [46] Domain-related Annotation of Polish Spoken Dialogue Corpus LUNA.PL
    Mykowiecka, Agnieszka
    Glowinska, Katarzyna
    Rabiega-Wisniewska, Joanna
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 2097 - 2102
  • [47] National Corpus of the Tatar Language "Tugan ten Grammatical Annotation and Implementation
    Suleymanov, Dzhavdet
    Nevzorova, Olga
    Gatiatullin, Ayrat
    Gilmullin, Rinat
    Khakimov, Bulat
    [J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 68 - 74
  • [48] DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool
    Syed, Mahanazuddin
    Al-Shukri, Shaymaa
    Syed, Shorabuddin
    Sexton, Kevin
    Greer, Melody L.
    Zozus, Meredith
    Bhattacharyya, Sudeepa
    Prior, Fred
    [J]. PUBLIC HEALTH AND INFORMATICS, PROCEEDINGS OF MIE 2021, 2021, 281 : 432 - 436
  • [49] Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method
    Halike, Ayiguli
    Abiderexiti, Kahaerjiang
    Yibulayin, Tuergen
    [J]. INFORMATION, 2020, 11 (01)
  • [50] Assigning Wh-Questions to Verbal Arguments: Annotation Tools Evaluation and Corpus Building
    Duran, Magali Sanches
    Amancio, Marcelo Adriano
    Aluisio, Sandra Maria
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1445 - 1451