A Dataset and Strong Baselines for Classification of Czech News Texts

被引:0
|
作者
Kydlicek, Hynek [1 ]
Libovicky, Jindrich [1 ]
机构
[1] Charles Univ Prague, Inst Formal & Appl Linguist, Fac Math & Phys, Malostranske Nam 25, CR-11800 Prague, Czech Republic
来源
TEXT, SPEECH, AND DIALOGUE, TSD 2023 | 2023年 / 14102卷
关键词
News classification; NLP in Czech; News Dataset;
D O I
10.1007/978-3-031-40498-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
引用
收藏
页码:33 / 44
页数:12
相关论文
共 50 条
  • [21] Breaking news: Unveiling a new dataset for Portuguese news classification and comparative analysis of approaches
    Garcia, Klaifer
    Shiguihara, Pedro
    Berton, Lilian
    PLOS ONE, 2024, 19 (01):
  • [22] Kurdish News Dataset Headlines (KNDH) through multiclass classification
    Badawi, Soran
    Saeed, Ari M.
    Ahmed, Sara A.
    Abdalla, Peshraw Ahmed
    Hassan, Diyari A.
    DATA IN BRIEF, 2023, 48
  • [23] News Title Classification with Support from Auxiliary Long Texts
    Ouyang, Yuanxin
    Yao Huangfu
    Sheng, Hao
    Xiong, Zhang
    NEURAL INFORMATION PROCESSING (ICONIP 2014), PT II, 2014, 8835 : 581 - 588
  • [24] Guilloche Detection for ID Authentication: A Dataset and Baselines
    Al-Ghadi, Musab
    Ming, Zuheng
    Gomez-Kramer, Petra
    Burie, Jean-Christophe
    Coustaty, Mickael
    Sidere, Nicolas
    2023 IEEE 25TH INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, MMSP, 2023,
  • [25] Strong motion dataset of Turkey:: data processing and site classification
    Zaré, M
    Bard, PY
    SOIL DYNAMICS AND EARTHQUAKE ENGINEERING, 2002, 22 (08) : 703 - 718
  • [26] A Multimodal Handover Failure Detection Dataset and Baselines
    Thoduka, Santosh
    Hochgeschwender, Nico
    Ga, Juergen
    Ploeger, Paul G.
    2024 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2024), 2024, : 17013 - 17019
  • [27] Fine-grained Czech News Article Dataset: An Interdisciplinary Approach to Trustworthiness Analysis
    Boháček, Matyáš
    Bravanský, Michal
    Trhlík, Filip
    Moravec, Václav
    arXiv, 2022,
  • [28] Fine-grained Czech News Article Dataset: An Interdisciplinary Approach to Trustworthiness Analysis
    Boháček, Matyáš
    Bravanský, Michal
    Trhlík, Filip
    Moravec, Václav
    CEUR Workshop Proceedings, 2022, 3555
  • [29] A survey on the multiple classifier for new benchmark dataset of Vietnamese news classification
    Huu-Thanh Duong
    Vinh Truong Hoang
    2019 11TH INTERNATIONAL CONFERENCE ON KNOWLEDGE AND SMART TECHNOLOGY (KST), 2019, : 23 - 28
  • [30] SMAD: Text Classification of Arabic Social Media Dataset for News Sources
    Gaber, Amira M.
    El-din, Mohamed Nour
    Moussa, Hanan
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (10) : 508 - 516