The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引:0
|
作者
Eckart, Thomas [1 ]
Quasthoff, Uwe [1 ]
Goldhahn, Dirk [1 ]
机构
[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany
关键词
Corpus quality; Standardization; Statistical Evaluation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.
引用
收藏
页码:2318 / 2321
页数:4
相关论文
共 50 条
  • [21] A Standardized Project Gutenberg Corpus for Statistical Analysis of Natural Language and Quantitative Linguistics
    Gerlach, Martin
    Font-Clos, Francesc
    ENTROPY, 2020, 22 (01) : 126
  • [22] The influence of statistical variations on image quality
    Hultgren, Bror
    Hertel, Dirk
    Bullitt, Julian
    IMAGE QUALITY AND SYSTEM PERFORMANCE III, 2006, 6059
  • [23] Novel Language Resources for Hindi: An Aesthetics Text Corpus and a Comprehensive Stop Lemma List
    Venugopal-Wairagade, Gayatri
    Saini, Jatinderkumar R.
    Pramod, Dhanya
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (01) : 233 - 239
  • [24] Accurate statistical Spoken Language Understanding from limited development resources
    Meza-Ruiz, Ivan V.
    Riedel, Sebastian
    Lemon, Oliver
    2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 5021 - 5024
  • [25] CORPUS LINGUISTICS AND LANGUAGE RESOURCES: POTENTIAL, STATE OF PLAY AND PERSPECTIVES IN CONTEMPORARY MONTENEGRIN LINGUISTICS
    Bozovic, Petar
    FOLIA LINGUISTICA ET LITTERARIA, 2020, (32): : 239 - 259
  • [26] The influence of quality tools in human resources management
    Blaga, Petruta
    Jozsef, Boer
    INTERNATIONAL CONFERENCE EMERGING MARKETS QUERIES IN FINANCE AND BUSINESS, 2012, 3 : 672 - 680
  • [27] Does bilingual experience influence statistical language learning?
    Aguasvivas, Jose A.
    Cespon, Jesus
    Carreiras, Manuel
    COGNITION, 2024, 242
  • [28] THE STATISTICAL-ANALYSIS OF QUALITY-CONTROL MEASUREMENTS
    MANDEL, J
    SUGAR Y AZUCAR, 1982, 77 (08): : 22 - 22
  • [29] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Huu-anh Tran
    Yuhang Guo
    Ping Jian
    Shumin Shi
    Heyan Huang
    JournalofBeijingInstituteofTechnology, 2018, 27 (01) : 127 - 136
  • [30] RLD corpus: The corpus of linguistic resources in Spanish jurisprudence
    Alonso-Cortes Manteca, Angel
    Diaz Ayuga, Juan Manuel
    Fernandez-Pampillon Cesteros, Ana Maria
    REVISTA ESPANOLA DE LINGUISTICA APLICADA, 2022, 35 (02): : 425 - 448