The Influence of Corpus Quality on Statistical Measurements on Language Resources

被引:0
|
作者
Eckart, Thomas [1 ]
Quasthoff, Uwe [1 ]
Goldhahn, Dirk [1 ]
机构
[1] Univ Leipzig, Nat Language Proc Grp, D-04103 Leipzig, Germany
关键词
Corpus quality; Standardization; Statistical Evaluation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The quality of statistical measurements on corpora is strongly related to a strict definition of the measuring process and to corpus quality. In the case of multiple result inspections, an exact measurement of previously specified parameters ensures compatibility of the different measurements performed by different researchers on possibly different objects. Hence, the comparison of different values requires an exact description of the measuring process. To illustrate this correlation the influence of different definitions for the concepts word and sentence is shown for several properties of large text corpora. It is also shown that corpus pre-processing strongly influences corpus size and quality as well. As an example near duplicate sentences are identified as source of many statistical irregularities. The problem of strongly varying results especially holds for Web corpora with a large set of pre-processing steps. Here, a well-defined and language independent pre-processing is indispensable for language comparison based on measured values. Conversely, irregularities found in such measurements are often a result of poor pre-processing and therefore such measurements can help to improve corpus quality.
引用
收藏
页码:2318 / 2321
页数:4
相关论文
共 50 条
  • [31] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Tran H.-A.
    Guo Y.
    Jian P.
    Shi S.
    Huang H.
    Journal of Beijing Institute of Technology (English Edition), 2018, 27 (01): : 127 - 136
  • [32] Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling
    Klosowski, Piotr
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2017,
  • [33] Language Teaching Resources and Quality Assurance in Higher Technical Education
    Greculescu, A.
    Todorescu, L.
    NEW APPROACHES IN SOCIAL AND HUMANISTIC SCIENCES, 2016, : 249 - 253
  • [34] Statistical analysis of orthographic and phonemic language corpus for word-based and phoneme-based Polish language modelling
    Piotr Kłosowski
    EURASIP Journal on Audio, Speech, and Music Processing, 2017
  • [35] Votter Corpus: A Corpus of Social Polling Language
    Green, Nathan David
    Larasati, Septina Dian
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3693 - 3697
  • [36] Corpus of the Georgian Language
    Doborjginidze, Nino
    Lobzhanidze, Irina
    PROCEEDINGS OF THE XVII EURALEX INTERNATIONAL CONGRESS: LEXICOGRAPHY AND LINGUISTIC DIVERSITY, 2016, : 328 - 334
  • [37] The Nisvai Corpus of Oral Narrative Practices from Malekula (Vanuatu) and its Associated Language Resources
    Aznar, Jocelyn
    Gala, Nuria
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2649 - 2656
  • [38] Somali Information Retrieval Corpus: Bridging the Gap between Query Translation and Dedicated Language Resources
    Badel, Abdisalam Mahamed
    Zhong, Ting
    Tai, Wenxin
    Zhou, Fan
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7463 - 7469
  • [39] Impact of training corpus size on the quality of different types of language models for Serbian
    Ostrogonac, Stevan
    Secujski, Milan
    Miskovic, Dragisa
    2012 20TH TELECOMMUNICATIONS FORUM (TELFOR), 2012, : 720 - 723
  • [40] LANGUAGE TECHNOLOGIES AND RESOURCES - NEW ADVANCES IN BULGARIAN LANGUAGE TEACHING (THE BULGARIAN LEXICAL SEMANTIC NET BULNET AND THE BULGARIAN NATIONAL CORPUS)
    Koeva, Svetla
    Leseva, Svetlozara
    Stoyanova, Ivelina
    Todorova, Maria
    BULGARSKI EZIK I LITERATURA-BULGARIAN LANGUAGE AND LITERATURE, 2016, 58 (04): : 377 - 393