Comparing web-crawled and traditional corpora

被引:7
|
作者
Cvrcek, Vaclav [3 ]
Komrskova, Zuzana [3 ]
Lukes, David [3 ]
Poukarova, Petra [3 ]
Rehorkova, Anna [3 ]
Zasina, Adrian Jan [3 ]
Benko, Vladimir [1 ,2 ]
机构
[1] Slovak Acad Sci, L Stur Inst Linguist, Bratislava, Slovakia
[2] Comenius Univ, UNESCO Chair Plurilingual & Multicultural Commun, Bratislava, Slovakia
[3] Charles Univ Prague, Fac Arts, Inst Czech Natl Corpus, Prague, Czech Republic
关键词
Web corpus; Crawling; Register; Variation; Multi-dimensional analysis; Czech;
D O I
10.1007/s10579-020-09487-4
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a "traditional" corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the "searchable" web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).
引用
收藏
页码:713 / 745
页数:33
相关论文
共 50 条
  • [21] Comparing the effectiveness of the web site with traditional media
    Leong, EKF
    Huang, XL
    Stanners, PJ
    [J]. JOURNAL OF ADVERTISING RESEARCH, 1998, 38 (05) : 44 - 51
  • [22] Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
    Dodge, Jesse
    Sap, Maarten
    Marasovic, Ana
    Agnew, William
    Ilharco, Gabriel
    Groeneveld, Dirk
    Mitchell, Margaret
    Gardner, Matt
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1286 - 1305
  • [23] Comparing Learning Results of Web Based and Traditional Learning Students
    Bele, Julija Lapuh
    Rugelj, Joze
    [J]. ADVANCES IN WEB-BASED LEARNING-ICWL 2010, 2010, 6483 : 375 - 380
  • [24] On the Differences between Traditional and Web-Corpora based on the Analysis of High-Frequency Nouns
    Maria, Khokhlova
    [J]. PROCEEDINGS OF THE 45TH INTERNATIONAL PHILOLOGICAL CONFERENCE (IPC 2016), 2017, 122 : 301 - 304
  • [25] Exploring Semantic Change of Chinese Word Using Crawled Web Data
    Xu, Xiaofei
    Cao, Yukun
    Li, Li
    [J]. WEB ENGINEERING (ICWE 2019), 2019, 11496 : 81 - 88
  • [26] Detecting Spam in Web Corpora
    Baisa, Vit
    Suchomel, Vit
    [J]. RASLAN 2012: RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING, 2012, : 69 - 76
  • [27] Enhancing the Introductory Statistics Course: Comparing Student Learning and Performance in Traditional and Web-Enhanced Traditional Courses
    Herman, Jacqueline
    [J]. JOURNAL OF STATISTICS AND DATA SCIENCE EDUCATION, 2024,
  • [28] CoCo, a web interface for corpora compilation
    Espana-Bonet, C.
    Vila, M.
    Rodriguez, H.
    Marti, M. A.
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (43): : 367 - 368
  • [29] Large Web Corpora for Indian Languages
    Kilgarriff, Adam
    Duvuru, Girish
    [J]. INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 312 - 313
  • [30] Document Attrition in Web Corpora: an Exploration
    Wattam, Stephen
    Rayson, Paul
    Berridge, Damon
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1486 - 1489