The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引：0

作者：

Eckart, Thomas ^{[1
]}

Quasthoff, Uwe ^{[1
]}

Goldhahn, Dirk ^{[1
]}

机构：

[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany

来源：

LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年

关键词：

Corpus quality; Standardization; Statistical Evaluation;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.

引用

页码：2318 / 2321

页数：4

共 50 条

[1] Language Independent Statistical Software for Corpus Exploration
John Sinclair
Oliver Mason
Jackie Ball
Geoff Barnbrook
Computers and the Humanities, 1997, 31 : 229 - 255
[2] Language independent statistical software for Corpus exploration
Sinclair, J
Mason, O
Ball, J
Barnbrook, G
COMPUTERS AND THE HUMANITIES, 1997, 31 (03): : 229 - 255
[3] The Australian National Corpus: National Infrastructure for Language Resources
Cassidy, Steve
Haugh, Michael
Peters, Pam
Fallu, Mark
LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3295 - 3299
[4] A Representative Corpus of the Romanian Language: Resources in Underrepresented Languages
Midrigan-Ciochina, Ludmila
Boyd, Victoria
Ortega, Lucila Sanchez
Malancea-Malac, Diana
Midrigan, Doina
Corina, David P.
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3291 - 3296
[5] Statistical Corpus and Language Comparison using Comparable Corpora
Eckart, Thomas
Quasthoff, Uwe
LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 15 - 20
[6] IS DESCRIBING LANGUAGE MERE BUTTERFLY COLLECTION? ON EPISTEMOLOGY, STATISTICAL LANGUAGE MODELS, AND CORPUS
de Uzeda-Garrao, Milena
12TH INTERNATIONAL CONFERENCE OF EDUCATION, RESEARCH AND INNOVATION (ICERI2019), 2019, : 10900 - 10903
[7] The AMARA Corpus: Building Parallel Language Resources for the Educational Domain
Abdelali, Ahmed
Guzman, Francisco
Sajjad, Hassan
Vogel, Stephan
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 1856 - 1862
[8] Statistical Analysis of Multilingual Text Corpus and Development of Language Models
Agrawal, Shyam S.
Bansal, Abhimanue Shweta
Mahajan, Minakshi
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2436 - 2440
[9] Statistical Analysis of Polish Language Corpus for Speech Recognition Application
Klosowski, Piotr
2016 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA), 2016, : 304 - 309
[10] English corpus and literary analysis based on statistical language model
Bo Huang
Xijun Lan
Cluster Computing, 2019, 22 : 14897 - 14903

← 1 2 3 4 5 →