The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引：0

作者：

Eckart, Thomas ^{[1
]}

Quasthoff, Uwe ^{[1
]}

Goldhahn, Dirk ^{[1
]}

机构：

[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany

来源：

LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年

关键词：

Corpus quality; Standardization; Statistical Evaluation;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.

引用

页码：2318 / 2321

页数：4

共 50 条

[41] Statistical cross-language Web content quality assessment
Geng, Guang-Gang
Wang, Li-Ming
Wang, Wei
Hu, An-Lei
Shen, Shuo
KNOWLEDGE-BASED SYSTEMS, 2012, 35 : 312 - 319
[42] Corpus linguistics and statistical methods
Samuelsson, C
LANGUAGE ENGINEERING FOR LESSER-STUDIED LANGUAGES, 2003, 188 : 101 - 131
[43] The Hmong Medical Corpus: a biomedical corpus for a minority language
White, Nathan M.
LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (04) : 1315 - 1332
[44] Converting the Corpus Query Language to the Natural Language
Rysava, Daniela
Volkova, Nikol
Rambousek, Adam
RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2015), 2015, : 43 - 48
[45] The Hmong Medical Corpus: a biomedical corpus for a minority language
Nathan M. White
Language Resources and Evaluation, 2022, 56 : 1315 - 1332
[46] Influence of corpus luteum and ovarian volume on the number and quality of bovine oocytes
Penitente-Filho, Jurandy Mauro
Jimenez, Carolina Rodrigues
Zolini, Adriana Moreira
Carrascal, Erly
Azevedo, Jovana Luiza
Silveira, Camila Oliveira
Oliveira, Fabricio Albani
Alves Torres, Ciro Alexandre
ANIMAL SCIENCE JOURNAL, 2015, 86 (02) : 148 - 152
[47] A Semi-supervised Method for Efficient Construction of Statistical Spoken Language Understanding Resources
Kim, Seokhwan
Jeong, Minwoo
Lee, Gary Geunbae
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 977 - 980
[48] Cree Corpus: A Collection of nehiyawewin Resources
Teodorescu, Daniela
Matalski, Josie
Lothian, Delaney
Barbosa, Denilson
Epp, Carrie Demmans
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 6354 - 6364
[49] INCREASING QUALITY OF THE CORPUS OF FREQUENCY DICTIONARY OF CONTEMPORARY POLISH FOR MORPHOSYNTACTIC TAGGING OF THE POLISH LANGUAGE
Kuta, Marcin
Chrzaszcz, Pawel
Kitowski, Jacek
COMPUTING AND INFORMATICS, 2009, 28 (03) : 319 - 338
[50] A new form of Web corpus: Display of search results based on English language quality
Murakami, Masayuki
Kimura, Masaru
Honda, Nakaji
NAFIPS 2006 - 2006 ANNUAL MEETING OF THE NORTH AMERICAN FUZZY INFORMATION PROCESSING SOCIETY, VOLS 1 AND 2, 2006, : 63 - +

← 1 2 3 4 5 →