The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引:0
|
作者
Eckart, Thomas [1 ]
Quasthoff, Uwe [1 ]
Goldhahn, Dirk [1 ]
机构
[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany
关键词
Corpus quality; Standardization; Statistical Evaluation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.
引用
收藏
页码:2318 / 2321
页数:4
相关论文
共 50 条
  • [41] Statistical cross-language Web content quality assessment
    Geng, Guang-Gang
    Wang, Li-Ming
    Wang, Wei
    Hu, An-Lei
    Shen, Shuo
    KNOWLEDGE-BASED SYSTEMS, 2012, 35 : 312 - 319
  • [42] Corpus linguistics and statistical methods
    Samuelsson, C
    LANGUAGE ENGINEERING FOR LESSER-STUDIED LANGUAGES, 2003, 188 : 101 - 131
  • [43] The Hmong Medical Corpus: a biomedical corpus for a minority language
    White, Nathan M.
    LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (04) : 1315 - 1332
  • [44] Converting the Corpus Query Language to the Natural Language
    Rysava, Daniela
    Volkova, Nikol
    Rambousek, Adam
    RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2015), 2015, : 43 - 48
  • [45] The Hmong Medical Corpus: a biomedical corpus for a minority language
    Nathan M. White
    Language Resources and Evaluation, 2022, 56 : 1315 - 1332
  • [46] Influence of corpus luteum and ovarian volume on the number and quality of bovine oocytes
    Penitente-Filho, Jurandy Mauro
    Jimenez, Carolina Rodrigues
    Zolini, Adriana Moreira
    Carrascal, Erly
    Azevedo, Jovana Luiza
    Silveira, Camila Oliveira
    Oliveira, Fabricio Albani
    Alves Torres, Ciro Alexandre
    ANIMAL SCIENCE JOURNAL, 2015, 86 (02) : 148 - 152
  • [47] A Semi-supervised Method for Efficient Construction of Statistical Spoken Language Understanding Resources
    Kim, Seokhwan
    Jeong, Minwoo
    Lee, Gary Geunbae
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 977 - 980
  • [48] Cree Corpus: A Collection of nehiyawewin Resources
    Teodorescu, Daniela
    Matalski, Josie
    Lothian, Delaney
    Barbosa, Denilson
    Epp, Carrie Demmans
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6354 - 6364
  • [49] INCREASING QUALITY OF THE CORPUS OF FREQUENCY DICTIONARY OF CONTEMPORARY POLISH FOR MORPHOSYNTACTIC TAGGING OF THE POLISH LANGUAGE
    Kuta, Marcin
    Chrzaszcz, Pawel
    Kitowski, Jacek
    COMPUTING AND INFORMATICS, 2009, 28 (03) : 319 - 338
  • [50] A new form of Web corpus: Display of search results based on English language quality
    Murakami, Masayuki
    Kimura, Masaru
    Honda, Nakaji
    NAFIPS 2006 - 2006 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, VOLS 1 AND 2, 2006, : 63 - +