Introducing DeReKoGram: A Novel Frequency Dataset with Lemma and Part-of-Speech Information for German

被引:2
|
作者
Wolfer, Sascha [1 ]
Koplenig, Alexander [1 ]
Kupietz, Marc [1 ]
Mueller-Spitzer, Carolin [1 ]
机构
[1] Leibniz Inst German Language IDS, D-68161 Mannheim, Germany
关键词
language; n-grams; corpus frequency; dataset; German; vocabulary growth; EYE-MOVEMENTS; PREDICTABILITY; LENGTH;
D O I
10.3390/data8110170
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We introduce DeReKoGram, a novel frequency dataset containing lemma and part-of-speech (POS) information for 1-, 2-, and 3-grams from the German Reference Corpus. The dataset contains information based on a corpus of 43.2 billion tokens and is divided into 16 parts based on 16 corpus folds. We describe how the dataset was created and structured. By evaluating the distribution over the 16 folds, we show that it is possible to work with a subset of the folds in many use cases (e.g., to save computational resources). In a case study, we investigate the growth of vocabulary (as well as the number of hapax legomena) as an increasing number of folds are included in the analysis. We cross-combine this with the various cleaning stages of the dataset. We also give some guidance in the form of Python, R, and Stata markdown scripts on how to work with the resource.Dataset: https://www.owid.de/plus/derekogram/ (along with information and sample code).Dataset License: DeReKo license (non-commercial, academic).
引用
收藏
页数:10
相关论文
共 47 条
  • [41] BERT-POS: Sentiment Analysis of MOOC Reviews Based on BERT with Part-of-Speech Information
    Liu, Wenxiao
    Lin, Shuyuan
    Gao, Boyu
    Huang, Kai
    Liu, Weilin
    Huang, Zhongcai
    Feng, Junjie
    Chen, Xinhong
    Huang, Feiran
    [J]. ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS AND DOCTORAL CONSORTIUM, PT II, 2022, 13356 : 371 - 374
  • [42] Leveraging Part-of-Speech Tagging Features and a Novel Regularization Strategy for Chinese Medical Named Entity Recognition
    Jiang, Miao
    Zhang, Xin
    Chen, Chonghao
    Shao, Taihua
    Chen, Honghui
    [J]. MATHEMATICS, 2022, 10 (09)
  • [43] Exploring the use of target-language information to train the part-of-speech tagger of machine translation-systems
    Sánchez-Martínez, F
    Pérez-Ortiz, JA
    Forcada, ML
    [J]. ADVANCES IN NATURAL LANGUAGE PROCESSING, 2004, 3230 : 137 - 148
  • [44] Enhancement of Automatic Oral Presentation Assessment System using Latent N-Grams Word Representation and Part-of-Speech Information
    Huang, Wen-Yu
    Hsiao, Shan-Wen
    Sun, Hung-Ching
    Hsieh, Ming-Chuan
    Tsai, Ming-Hsueh
    Lee, Chi-Chun
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 1432 - 1436
  • [45] Introducing the Urdu-Sindhi Speech Emotion Corpus: A Novel Dataset of Speech Recordings for Emotion Recognition for Two Low-Resource Languages
    Syed, Zafi Sherhan
    Memon, Sajjad Ali
    Shah, Muhammad Shehram
    Syed, Abbas Shah
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (04) : 805 - 810
  • [46] Effects of introducing unprocessed low-frequency information on the reception of envelope-vocoder processed speech
    Qin, MK
    Oxenham, AJ
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 119 (04): : 2417 - 2426
  • [47] Open Information Extraction from real Internet texts in Spanish using constraints over part-of-speech sequences: Problems of the method, their causes, and ways for improvement
    Zhila, Alisa
    Gelbukh, Alexander
    [J]. REVISTA SIGNOS, 2016, 49 (90): : 119 - 142