Annotation tools for syntax and named entities in the National Corpus of Polish

被引：9

作者：

Waszczuk, Jakub ^{[1
,2
]}

Glowinska, Katarzyna ^{[1
]}

Savary, Agata ^{[3
]}

Przepiorkowski, Adam ^{[1
,2
]}

Lenart, Michal ^{[2
]}

机构：

[1] Polish Acad Sci, Inst Comp Sci, Ul Ordona 21, PL-01237 Warsaw, Poland

[2] Univ Warsaw, Inst Informat, PL-02097 Warsaw, Poland

[3] Univ Francois Rabelais Tours, Lab Dinformat, F-41000 Blois, France

来源：

INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT | 2013年 / 5卷 / 02期

关键词：

corpus annotation; National Corpus of Polish; shallow parsing; chunking; named entity recognition; NER;

D O I：

10.1504/IJDMMM.2013.053691

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The ongoing National Corpus of Polish project assumes several levels of linguistic annotation. We present the technical environment and methodological background developed for the three upper annotation levels: the levels of syntactic words, syntactic groups and named entities. We show how knowledge-based platforms Spejd and Sprout are used for the automatic pre-annotation of the corpus and discuss some particular problems faced during the preparation of the parser grammar, which contains over 1,000 rules and is one of the largest chunking grammars for Polish. We also show how the tree editor TrEd has been customised for manual post-editing of annotations and for further revision of discrepancies. Our XML format converters and customised archiving repository ensure an automatic data flow and efficient corpus file management. We discuss the inter-annotator agreement in the manually annotated data, and present the first results of a CRF classifier trained on these data.

引用

页码：103 / 122

页数：20

共 50 条

[41] Towards the Integration of Synthetic SL Animation with Avatars into Corpus Annotation Tools
Elliott, Ralph
Bueno, Javier
Kennaway, Richard
Glauert, John
[J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : A84 - A87
[42] Comparing NERP-CRF with Publicly Available Portuguese Named Entities Recognition Tools
do Amaral, Daniela O. F.
Fonseca, Evandro
Lopes, Lucelene
Vieira, Renata
[J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, 2014, 8775 : 244 - 249
[43] Reducing Human Effort in Named Entity Corpus Construction Based on Ensemble Learning and Annotation Categorization
Lu, Tingming
Zhu, Man
Gao, Zhiqiang
[J]. NATURAL LANGUAGE UNDERSTANDING AND INTELLIGENT APPLICATIONS (NLPCC 2016), 2016, 10102 : 263 - 274
[44] Thai Named Entity Corpus Annotation Scheme and Self Verification by BiLSTM-CNN-CRF
Sornlertlamvanich, Virach
Suriyachay, Kitiya
Charoenporn, Thatsanee
[J]. HUMAN LANGUAGE TECHNOLOGY: CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2019, 2022, 13212 : 143 - 160
[45] COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature
Nguyen, Nhung T. H.
Gabud, Roselyn S.
Ananiadou, Sophia
[J]. BIODIVERSITY DATA JOURNAL, 2019, 7
[46] Domain-related Annotation of Polish Spoken Dialogue Corpus LUNA.PL
Mykowiecka, Agnieszka
Glowinska, Katarzyna
Rabiega-Wisniewska, Joanna
[J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 2097 - 2102
[47] National Corpus of the Tatar Language "Tugan ten Grammatical Annotation and Implementation
Suleymanov, Dzhavdet
Nevzorova, Olga
Gatiatullin, Ayrat
Gilmullin, Rinat
Khakimov, Bulat
[J]. CORPUS RESOURCES FOR DESCRIPTIVE AND APPLIED STUDIES. CURRENT CHALLENGES AND FUTURE DIRECTIONS: SELECTED PAPERS FROM THE 5TH INTERNATIONAL CONFERENCE ON CORPUS LINGUISTICS (CILC2013), 2013, 95 : 68 - 74
[48] DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool
Syed, Mahanazuddin
Al-Shukri, Shaymaa
Syed, Shorabuddin
Sexton, Kevin
Greer, Melody L.
Zozus, Meredith
Bhattacharyya, Sudeepa
Prior, Fred
[J]. PUBLIC HEALTH AND INFORMATICS, PROCEEDINGS OF MIE 2021, 2021, 281 : 432 - 436
[49] Semi-Automatic Corpus Expansion and Extraction of Uyghur-Named Entities and Relations Based on a Hybrid Method
Halike, Ayiguli
Abiderexiti, Kahaerjiang
Yibulayin, Tuergen
[J]. INFORMATION, 2020, 11 (01)
[50] Assigning Wh-Questions to Verbal Arguments: Annotation Tools Evaluation and Corpus Building
Duran, Magali Sanches
Amancio, Marcelo Adriano
Aluisio, Sandra Maria
[J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 1445 - 1451

← 1 2 3 4 5 →