The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引:0
|
作者
Eckart, Thomas [1 ]
Quasthoff, Uwe [1 ]
Goldhahn, Dirk [1 ]
机构
[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany
关键词
Corpus quality; Standardization; Statistical Evaluation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.
引用
收藏
页码:2318 / 2321
页数:4
相关论文
共 50 条
  • [1] Language Independent Statistical Software for Corpus Exploration
    John Sinclair
    Oliver Mason
    Jackie Ball
    Geoff Barnbrook
    Computers and the Humanities, 1997, 31 : 229 - 255
  • [2] Language independent statistical software for Corpus exploration
    Sinclair, J
    Mason, O
    Ball, J
    Barnbrook, G
    COMPUTERS AND THE HUMANITIES, 1997, 31 (03): : 229 - 255
  • [3] The Australian National Corpus: National Infrastructure for Language Resources
    Cassidy, Steve
    Haugh, Michael
    Peters, Pam
    Fallu, Mark
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3295 - 3299
  • [4] A Representative Corpus of the Romanian Language: Resources in Underrepresented Languages
    Midrigan-Ciochina, Ludmila
    Boyd, Victoria
    Ortega, Lucila Sanchez
    Malancea-Malac, Diana
    Midrigan, Doina
    Corina, David P.
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3291 - 3296
  • [5] Statistical Corpus and Language Comparison using Comparable Corpora
    Eckart, Thomas
    Quasthoff, Uwe
    LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 15 - 20
  • [6] IS DESCRIBING LANGUAGE MERE BUTTERFLY COLLECTION? ON EPISTEMOLOGY, STATISTICAL LANGUAGE MODELS, AND CORPUS
    de Uzeda-Garrao, Milena
    12TH INTERNATIONAL CONFERENCE OF EDUCATION, RESEARCH AND INNOVATION (ICERI2019), 2019, : 10900 - 10903
  • [7] The AMARA Corpus: Building Parallel Language Resources for the Educational Domain
    Abdelali, Ahmed
    Guzman, Francisco
    Sajjad, Hassan
    Vogel, Stephan
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1856 - 1862
  • [8] Statistical Analysis of Multilingual Text Corpus and Development of Language Models
    Agrawal, Shyam S.
    Bansal, Abhimanue Shweta
    Mahajan, Minakshi
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2436 - 2440
  • [9] Statistical Analysis of Polish Language Corpus for Speech Recognition Application
    Klosowski, Piotr
    2016 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2016, : 304 - 309
  • [10] English corpus and literary analysis based on statistical language model
    Bo Huang
    Xijun Lan
    Cluster Computing, 2019, 22 : 14897 - 14903