From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

被引:2
|
作者
Schulz, Sarah [1 ]
Ketschik, Nora [2 ]
机构
[1] Univ Stuttgart, Inst Nat Language Proc IMS, Pfaffenwaldring 5B, D-70569 Stuttgart, Germany
[2] Univ Stuttgart, Inst Literary Studies ILW, Keplerstr 17, D-70174 Stuttgart, Germany
关键词
Historical language; Part-of-speech tagging; Digital Humanities; Non-standard text processing; Middle High German;
D O I
10.1007/s10579-019-09462-8
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.
引用
收藏
页码:837 / 863
页数:27
相关论文
共 28 条