Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

被引:2
|
作者
Wolfer, Sascha [1 ]
Koplenig, Alexander [1 ]
Kupietz, Marc [1 ]
Mueller-Spitzer, Carolin [1 ]
机构
[1] Leibniz Inst German Language IDS, D-68161 Mannheim, Germany
关键词
language; n-grams; corpus frequency; dataset; German; vocabulary growth; EYE-MOVEMENTS; PREDICTABILITY; LENGTH;
D O I
10.3390/data8110170
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.Dataset: https://www.owid.de/plus/derekogram/ (along with information and sample code).Dataset License: DeReKo license (non-commercial, academic).
引用
收藏
页数:10
相关论文
共 47 条
  • [31] GreekLex 2: A comprehensive lexical database with part-of-speech, syllabic, phonological, and stress information
    Kyparissiadis, Antonios
    van Heuven, Walter J. B.
    Pitchford, Nicola J.
    Ledgeway, Timothy
    [J]. PLOS ONE, 2017, 12 (02):
  • [32] Unsupervised Part-of-Speech Disambiguation for High Frequency Words and Its Influence on Unsupervised Parsing
    Haenig, Christian
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2010, 6008 : 113 - 120
  • [33] Using target-language information to train part-of-speech taggers for machine translation
    Sanchez-Martinez, Felipe
    Antonio Perez-Ortiz, Juan
    Forcada, Mikel L.
    [J]. MACHINE TRANSLATION, 2008, 22 (1-2) : 29 - 66
  • [34] Predicting Biological Signals from Speech: Introducing a Novel Multimodal Dataset and Results
    Baird, Alice
    Amiriparian, Shahin
    Berschneider, Miriam
    Schmitt, Maximilian
    [J]. 2019 IEEE 21ST INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP 2019), 2019,
  • [35] From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
    Schulz, Sarah
    Ketschik, Nora
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2019, 53 (04) : 837 - 863
  • [36] From 0 to 10 million annotated words: part-of-speech tagging for Middle High German
    Sarah Schulz
    Nora Ketschik
    [J]. Language Resources and Evaluation, 2019, 53 : 837 - 863
  • [37] A Scalable Solution for Rule-Based Part-of-Speech Tagging on Novel Hardware Accelerators
    Sadredini, Elaheh
    Guo, Deyuan
    Bo, Chunkun
    Rahimi, Reza
    Skadron, Kevin
    Wang, Hongning
    [J]. KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 665 - 674
  • [38] A Novel Joint Entity Relation Extraction Based on Capsule Network and Part-of-Speech Weighting
    Wang, Jianmin
    Song, Yujia
    Zhao, Wenbin
    Jia, Ziyue
    Wu, Feng
    [J]. MOBILE INFORMATION SYSTEMS, 2022, 2022
  • [39] Part-of-speech tagging in molecular biology scientific abstracts using morphological and contextual statistical information
    Dimitris, G
    Evangelos, D
    [J]. METHODS AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 3025 : 371 - 380
  • [40] Domain-specific Chinese Transformer-XL Language Model with Part-of-speech Information
    Qu, Huaichang
    Zhao, Haifeng
    Wang, Xin
    [J]. 2020 16TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS 2020), 2020, : 81 - 85