Comparing web-crawled and traditional corpora

被引:7
|
作者
Cvrcek, Vaclav [3 ]
Komrskova, Zuzana [3 ]
Lukes, David [3 ]
Poukarova, Petra [3 ]
Rehorkova, Anna [3 ]
Zasina, Adrian Jan [3 ]
Benko, Vladimir [1 ,2 ]
机构
[1] Slovak Acad Sci, L Stur Inst Linguist, Bratislava, Slovakia
[2] Comenius Univ, UNESCO Chair Plurilingual & Multicultural Commun, Bratislava, Slovakia
[3] Charles Univ Prague, Fac Arts, Inst Czech Natl Corpus, Prague, Czech Republic
关键词
Web corpus; Crawling; Register; Variation; Multi-dimensional analysis; Czech;
D O I
10.1007/s10579-020-09487-4
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a "traditional" corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the "searchable" web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).
引用
收藏
页码:713 / 745
页数:33
相关论文
共 50 条
  • [1] Comparing web-crawled and traditional corpora
    Václav Cvrček
    Zuzana Komrsková
    David Lukeš
    Petra Poukarová
    Anna Řehořková
    Adrian Jan Zasina
    Vladimír Benko
    [J]. Language Resources and Evaluation, 2020, 54 : 713 - 745
  • [2] Human evaluation of web-crawled parallel corpora for machine translation
    Ramirez-Sanchez, Gema
    Banon, Marta
    Zaragoza-Bernabeu, Jaume
    Ortiz-Rojas, Sergio
    [J]. PROCEEDINGS OF THE 2ND WORKSHOP ON HUMAN EVALUATION OF NLP SYSTEMS (HUMEVAL 2022), 2022, : 32 - 41
  • [3] The Aranea Corpora Family: Ten plus Years of Processing Web-Crawled Data
    Benko, Vladimir
    [J]. TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 : 55 - 70
  • [4] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
    Marco Baroni
    Silvia Bernardini
    Adriano Ferraresi
    Eros Zanchetta
    [J]. Language Resources and Evaluation, 2009, 43 : 209 - 226
  • [5] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
    Baroni, Marco
    Bernardini, Silvia
    Ferraresi, Adriano
    Zanchetta, Eros
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2009, 43 (03) : 209 - 226
  • [6] In the melting pot of web-crawled texts: The challenges of extracting English words from Croatian corpora
    Colakovac, Jasmina Jelcic
    Borucinsky, Mirjana
    [J]. INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2024, 34 (01) : 166 - 182
  • [7] Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
    Kreutzer, Julia
    Caswell, Isaac
    Wang, Lisa
    Wahab, Ahsan
    van Esch, Daan
    Ulzii-Orshikh, Nasanbayar
    Tapo, Allahsera
    Subramani, Nishant
    Sokolov, Artem
    Sikasote, Claytone
    Setyawan, Monang
    Sarin, Supheakmungkol
    Samb, Sokhar
    Sagot, Benoit
    Rivera, Clara
    Rios, Annette
    Papadimitriou, Isabel
    Osei, Salomey
    Ortiz Suarez, Pedro
    Orife, Iroro
    Ogueji, Kelechi
    Rubungo, Andre Niyongabo
    Nguyen, Toan Q.
    Mueller, Mathias
    Mueller, Andre
    Hassan Muhammad, Shamsuddeen
    Muhammad, Nanda
    Mnyakeni, Ayanda
    Mirzakhalov, Jamshidbek
    Matangira, Tapiwanashe
    Leong, Colin
    Lawson, Nze
    Kudugunta, Sneha
    Jernite, Yacine
    Jenny, Mathias
    Firat, Orhan
    Dossou, Bonaventure F. P.
    Dlamini, Sakhile
    de Silva, Nisansa
    Balli, Sakine Cabuk
    Biderman, Stella
    Battisti, Alessia
    Baruwa, Ahmed
    Bapna, Ankur
    Baljekar, Pallavi
    Abebe Azime, Israel
    Awokoya, Ayodele
    Ataman, Duygu
    Ahia, Orevaoghene
    Ahia, Oghenefego
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 50 - 72
  • [8] A Corpus-Based Study of the Concept of 'Luxury' Using Web-Crawled Corpora, enTenTen 2013 and ukWaC
    Kondo, Yukie
    [J]. CORPUS PRAGMATICS, 2019, 3 (01) : 1 - 20
  • [9] A Corpus-Based Study of the Concept of ‘Luxury’ Using Web-Crawled Corpora, enTenTen 2013 and ukWaC
    Yukie Kondo
    [J]. Corpus Pragmatics, 2019, 3 : 1 - 20
  • [10] Weakly Supervised Semantic Segmentation using Web-Crawled Videos
    Hong, Seunghoon
    Yeo, Donghun
    Kwak, Suha
    Lee, Honglak
    Han, Bohyung
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2224 - 2232