The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引：0

作者：

Eckart, Thomas ^{[1
]}

Quasthoff, Uwe ^{[1
]}

Goldhahn, Dirk ^{[1
]}

机构：

[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany

来源：

LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2012年

关键词：

Corpus quality; Standardization; Statistical Evaluation;

D O I：

暂无

中图分类号：

H0 [语言学];

学科分类号：

030303 ; 0501 ; 050102 ;

摘要：

The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.

引用

页码：2318 / 2321

页数：4

共 50 条

[31] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
Tran H.-A.
Guo Y.
Jian P.
Shi S.
Huang H.
Journal of Beijing Institute of Technology (English Edition), 2018, 27 (01): : 127 - 136
[32] Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling
Klosowski, Piotr
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2017,
[33] Language Teaching Resources and Quality Assurance in Higher Technical Education
Greculescu, A.
Todorescu, L.
NEW APPROACHES IN SOCIAL AND HUMANISTIC SCIENCES, 2016, : 249 - 253
[34] Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling
Piotr Kłosowski
EURASIP Journal on Audio, Speech, and Music Processing, 2017
[35] Votter Corpus: A Corpus of Social Polling Language
Green, Nathan David
Larasati, Septina Dian
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3693 - 3697
[36] Corpus of the Georgian Language
Doborjginidze, Nino
Lobzhanidze, Irina
PROCEEDINGS OF THE XVII EURALEX INTERNATIONAL CONGRESS: LEXICOGRAPHY AND LINGUISTIC DIVERSITY, 2016, : 328 - 334
[37] The Nisvai Corpus of Oral Narrative Practices from Malekula (Vanuatu) and its Associated Language Resources
Aznar, Jocelyn
Gala, Nuria
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2649 - 2656
[38] Somali Information Retrieval Corpus: Bridging the Gap between Query Translation and Dedicated Language Resources
Badel, Abdisalam Mahamed
Zhong, Ting
Tai, Wenxin
Zhou, Fan
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7463 - 7469
[39] Impact of training corpus size on the quality of different types of language models for Serbian
Ostrogonac, Stevan
Secujski, Milan
Miskovic, Dragisa
2012 20TH TELECOMMUNICATIONS FORUM (TELFOR), 2012, : 720 - 723
[40] LANGUAGE TECHNOLOGIES AND RESOURCES - NEW ADVANCES IN BULGARIAN LANGUAGE TEACHING (THE BULGARIAN LEXICAL SEMANTIC NET BULNET AND THE BULGARIAN NATIONAL CORPUS)
Koeva, Svetla
Leseva, Svetlozara
Stoyanova, Ivelina
Todorova, Maria
BULGARSKI EZIK I LITERATURA-BULGARIAN LANGUAGE AND LITERATURE, 2016, 58 (04): : 377 - 393

← 1 2 3 4 5 →