From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

被引：2

作者：

Schulz, Sarah ^{[1
]}

Ketschik, Nora ^{[2
]}

机构：

[1] Univ Stuttgart, Inst Nat Language Proc IMS, Pfaffenwaldring 5B, D-70569 Stuttgart, Germany

[2] Univ Stuttgart, Inst Literary Studies ILW, Keplerstr 17, D-70174 Stuttgart, Germany

来源：

LANGUAGE RESOURCES AND EVALUATION | 2019年 / 53卷 / 04期

关键词：

Historical language; Part-of-speech tagging; Digital Humanities; Non-standard text processing; Middle High German;

D O I：

10.1007/s10579-019-09462-8

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.

引用

页码：837 / 863

页数：27

共 28 条

[1] From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
Sarah Schulz
Nora Ketschik
Language Resources and Evaluation, 2019, 53 : 837 - 863
[2] High performance part-of-speech tagging of Bulgarian
Doychinova, V
Mihov, S
ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2004, 3192 : 246 - 255
[3] A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging
Maulud, Dastan
Jacksi, Karwan
Ali, Ismael
DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2023, 38 (04) : 1604 - 1612
[4] An automatic part-of-speech tagger for Middle Low German
Koleva, Mariya
Farasyn, Melissa
Desmet, Bart
Breitbarth, Anne
Hoste, Veronique
INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS, 2017, 22 (01) : 107 - 140
[5] FOLK-Gold - A GOLD standard for Part-of-Speech Tagging of Spoken German
Westpfahl, Swantje
Schmidt, Thomas
LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1493 - 1499
[6] A TENGRAM method based part-of-speech tagging of multi-category words in Hindi language
Gupta, J. P.
Tayal, Devendra K.
Gupta, Arti
EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (12) : 15084 - 15093
[7] Automatic measurement of propositional idea density from part-of-speech tagging
Cati Brown
Tony Snodgrass
Susan J. Kemper
Ruth Herman
Michael A. Covington
Behavior Research Methods, 2008, 40 : 540 - 545
[8] Automatic measurement of propositional idea density from part-of-speech tagging
Brown, Cati
Snodgrass, Tony
Kemper, Susan J.
Herman, Ruth
Covington, Michael A.
BEHAVIOR RESEARCH METHODS, 2008, 40 (02) : 540 - 545
[9] Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?
Manning, Christopher D.
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT I, 2011, 6608 : 171 - 189
[10] Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging
Sugisaki, Kyoko
Wiedmer, Nicolas
Hausendorf, Heiko
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 255 - 259

← 1 2 3 →