From 0 to 10 million annotated words: part-of-speech tagging for Middle High German

被引:2
|
作者
Schulz, Sarah [1 ]
Ketschik, Nora [2 ]
机构
[1] Univ Stuttgart, Inst Nat Language Proc IMS, Pfaffenwaldring 5B, D-70569 Stuttgart, Germany
[2] Univ Stuttgart, Inst Literary Studies ILW, Keplerstr 17, D-70174 Stuttgart, Germany
关键词
Historical language; Part-of-speech tagging; Digital Humanities; Non-standard text processing; Middle High German;
D O I
10.1007/s10579-019-09462-8
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
By building a part-of-speech (POS) tagger for Middle High German, we investigate strategies for dealing with a low resource, diverse and non-standard language in the domain of natural language processing. We highlight various aspects such as the data quantity needed for training and the influence of data quality on tagger performance. Since the lack of annotated resources poses a problem for training a tagger, we exemplify how existing resources can be adapted fruitfully to serve as additional training data. The resulting POS model achieves a tagging accuracy of about 91% on a diverse test set representing the different genres, time periods and varieties of MHG. In order to verify its general applicability, we evaluate the performance on different genres, authors and varieties of MHG, separately. We explore self-learning techniques which yield the advantage that unannotated data can be utilized to improve tagging performance on specific subcorpora.
引用
收藏
页码:837 / 863
页数:27
相关论文
共 28 条
  • [1] From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
    Sarah Schulz
    Nora Ketschik
    Language Resources and Evaluation, 2019, 53 : 837 - 863
  • [2] High performance part-of-speech tagging of Bulgarian
    Doychinova, V
    Mihov, S
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2004, 3192 : 246 - 255
  • [3] A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging
    Maulud, Dastan
    Jacksi, Karwan
    Ali, Ismael
    DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2023, 38 (04) : 1604 - 1612
  • [4] An automatic part-of-speech tagger for Middle Low German
    Koleva, Mariya
    Farasyn, Melissa
    Desmet, Bart
    Breitbarth, Anne
    Hoste, Veronique
    INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS, 2017, 22 (01) : 107 - 140
  • [5] FOLK-Gold - A GOLD standard for Part-of-Speech Tagging of Spoken German
    Westpfahl, Swantje
    Schmidt, Thomas
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1493 - 1499
  • [6] A TENGRAM method based part-of-speech tagging of multi-category words in Hindi language
    Gupta, J. P.
    Tayal, Devendra K.
    Gupta, Arti
    EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (12) : 15084 - 15093
  • [7] Automatic measurement of propositional idea density from part-of-speech tagging
    Cati Brown
    Tony Snodgrass
    Susan J. Kemper
    Ruth Herman
    Michael A. Covington
    Behavior Research Methods, 2008, 40 : 540 - 545
  • [8] Automatic measurement of propositional idea density from part-of-speech tagging
    Brown, Cati
    Snodgrass, Tony
    Kemper, Susan J.
    Herman, Ruth
    Covington, Michael A.
    BEHAVIOR RESEARCH METHODS, 2008, 40 (02) : 540 - 545
  • [9] Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?
    Manning, Christopher D.
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT I, 2011, 6608 : 171 - 189
  • [10] Building a Corpus from Handwritten Picture Postcards: Transcription, Annotation and Part-of-Speech Tagging
    Sugisaki, Kyoko
    Wiedmer, Nicolas
    Hausendorf, Heiko
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 255 - 259