Comparing web-crawled and traditional corpora

被引：7

作者：

Cvrcek, Vaclav ^{[3
]}

Komrskova, Zuzana ^{[3
]}

Lukes, David ^{[3
]}

Poukarova, Petra ^{[3
]}

Rehorkova, Anna ^{[3
]}

Zasina, Adrian Jan ^{[3
]}

Benko, Vladimir ^{[1
,2
]}

机构：

[1] Slovak Acad Sci, L Stur Inst Linguist, Bratislava, Slovakia

[2] Comenius Univ, UNESCO Chair Plurilingual & Multicultural Commun, Bratislava, Slovakia

[3] Charles Univ Prague, Fac Arts, Inst Czech Natl Corpus, Prague, Czech Republic

来源：

LANGUAGE RESOURCES AND EVALUATION | 2020年 / 54卷 / 03期

关键词：

Web corpus; Crawling; Register; Variation; Multi-dimensional analysis; Czech;

D O I：

10.1007/s10579-020-09487-4

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a "traditional" corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the "searchable" web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).

引用

页码：713 / 745

页数：33

共 50 条

[1] Comparing web-crawled and traditional corpora
Václav Cvrček
Zuzana Komrsková
David Lukeš
Petra Poukarová
Anna Řehořková
Adrian Jan Zasina
Vladimír Benko
[J]. Language Resources and Evaluation, 2020, 54 : 713 - 745
[2] Human evaluation of web-crawled parallel corpora for machine translation
Ramirez-Sanchez, Gema
Banon, Marta
Zaragoza-Bernabeu, Jaume
Ortiz-Rojas, Sergio
[J]. PROCEEDINGS OF THE 2ND WORKSHOP ON HUMAN EVALUATION OF NLP SYSTEMS (HUMEVAL 2022), 2022, : 32 - 41
[3] The Aranea Corpora Family: Ten plus Years of Processing Web-Crawled Data
Benko, Vladimir
[J]. TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT I, 2024, 15048 : 55 - 70
[4] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
Marco Baroni
Silvia Bernardini
Adriano Ferraresi
Eros Zanchetta
[J]. Language Resources and Evaluation, 2009, 43 : 209 - 226
[5] The WaCky wide web: a collection of very large linguistically processed web-crawled corpora
Baroni, Marco
Bernardini, Silvia
Ferraresi, Adriano
Zanchetta, Eros
[J]. LANGUAGE RESOURCES AND EVALUATION, 2009, 43 (03) : 209 - 226
[6] In the melting pot of web-crawled texts: The challenges of extracting English words from Croatian corpora
Colakovac, Jasmina Jelcic
Borucinsky, Mirjana
[J]. INTERNATIONAL JOURNAL OF APPLIED LINGUISTICS, 2024, 34 (01) : 166 - 182
[7] Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Kreutzer, Julia
Caswell, Isaac
Wang, Lisa
Wahab, Ahsan
van Esch, Daan
Ulzii-Orshikh, Nasanbayar
Tapo, Allahsera
Subramani, Nishant
Sokolov, Artem
Sikasote, Claytone
Setyawan, Monang
Sarin, Supheakmungkol
Samb, Sokhar
Sagot, Benoit
Rivera, Clara
Rios, Annette
Papadimitriou, Isabel
Osei, Salomey
Ortiz Suarez, Pedro
Orife, Iroro
Ogueji, Kelechi
Rubungo, Andre Niyongabo
Nguyen, Toan Q.
Mueller, Mathias
Mueller, Andre
Hassan Muhammad, Shamsuddeen
Muhammad, Nanda
Mnyakeni, Ayanda
Mirzakhalov, Jamshidbek
Matangira, Tapiwanashe
Leong, Colin
Lawson, Nze
Kudugunta, Sneha
Jernite, Yacine
Jenny, Mathias
Firat, Orhan
Dossou, Bonaventure F. P.
Dlamini, Sakhile
de Silva, Nisansa
Balli, Sakine Cabuk
Biderman, Stella
Battisti, Alessia
Baruwa, Ahmed
Bapna, Ankur
Baljekar, Pallavi
Abebe Azime, Israel
Awokoya, Ayodele
Ataman, Duygu
Ahia, Orevaoghene
Ahia, Oghenefego
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 50 - 72
[8] A Corpus-Based Study of the Concept of 'Luxury' Using Web-Crawled Corpora, enTenTen 2013 and ukWaC
Kondo, Yukie
[J]. CORPUS PRAGMATICS, 2019, 3 (01) : 1 - 20
[9] A Corpus-Based Study of the Concept of ‘Luxury’ Using Web-Crawled Corpora, enTenTen 2013 and ukWaC
Yukie Kondo
[J]. Corpus Pragmatics, 2019, 3 : 1 - 20
[10] Weakly Supervised Semantic Segmentation using Web-Crawled Videos
Hong, Seunghoon
Yeo, Donghun
Kwak, Suha
Lee, Honglak
Han, Bohyung
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2224 - 2232

← 1 2 3 4 5 →