Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

被引:2
|
作者
Wolfer, Sascha [1 ]
Koplenig, Alexander [1 ]
Kupietz, Marc [1 ]
Mueller-Spitzer, Carolin [1 ]
机构
[1] Leibniz Inst German Language IDS, D-68161 Mannheim, Germany
关键词
language; n-grams; corpus frequency; dataset; German; vocabulary growth; EYE-MOVEMENTS; PREDICTABILITY; LENGTH;
D O I
10.3390/data8110170
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.Dataset: https://www.owid.de/plus/derekogram/ (along with information and sample code).Dataset License: DeReKo license (non-commercial, academic).
引用
收藏
页数:10
相关论文
共 47 条
  • [1] Part-of-speech persistence: The influence of part-of-speech information on lexical processes
    Melinger, Alissa
    Koenig, Jean-Pierre
    [J]. JOURNAL OF MEMORY AND LANGUAGE, 2007, 56 (04) : 472 - 489
  • [2] Evidence for the shared representation of part-of-speech information
    Melinger, A
    Koenig, JP
    [J]. LACUS FORUM XXVI: THE LEXICON, 2000, 26 : 533 - 541
  • [3] An automatic part-of-speech tagger for Middle Low German
    Koleva, Mariya
    Farasyn, Melissa
    Desmet, Bart
    Breitbarth, Anne
    Hoste, Veronique
    [J]. INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS, 2017, 22 (01) : 107 - 140
  • [4] A morphology-system and part-of-speech tagger for German
    Lezius, W
    Rapp, R
    Wettler, M
    [J]. NATURAL LANGUAGE PROCESSING AND SPEECH TECHNOLOGY: RESULTS OF THE 3RD KONVENS CONFERENCE, 1996, : 369 - 378
  • [5] Part-of-Speech Tagging with Both Character and Word Information
    Zhou, You
    Liu, Fangzhou
    [J]. Proceedings of the 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 2016), 2016, 67 : 945 - 948
  • [6] Automatic Machine Translation Evaluation with Part-of-Speech Information
    Han, Aaron L. -F.
    Wong, Derek F.
    Chao, Lidia S.
    He, Liangye
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2013, 2013, 8082 : 121 - 128
  • [7] Korean part-of-speech tagging based on context information
    An, YM
    Lim, HD
    Seo, YH
    [J]. ISIE 2001: IEEE INTERNATIONAL SYMPOSIUM ON INDUSTRIAL ELECTRONICS PROCEEDINGS, VOLS I-III, 2001, : 334 - 337
  • [8] Adding Morphological Information to a Connectionist Part-Of-Speech Tagger
    Zamora-Martinez, Francisco
    Jose Castro-Bleda, Maria
    Espana-Boquera, Salvador
    Tortajada-Velert, Salvador
    [J]. CURRENT TOPICS IN ARTIFICIAL INTELLIGENCE, 2010, 5988 : 191 - +
  • [9] Personality Profiling from Text: Introducing Part-of-Speech N-Grams
    Wright, William R.
    Chin, David N.
    [J]. USER MODELING, ADAPTATION, AND PERSONALIZATION, UMAP 2014, 2014, 8538 : 243 - 253
  • [10] A Part-Of-Speech term weighting scheme for biomedical information retrieval
    Wang, Yanshan
    Wu, Stephen
    Li, Dingcheng
    Mehrabi, Saeed
    Liu, Hongfang
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2016, 63 : 379 - 389