A Dataset and Strong Baselines for Classification of Czech News Texts

被引:0
|
作者
Kydlicek, Hynek [1 ]
Libovicky, Jindrich [1 ]
机构
[1] Charles Univ Prague, Inst Formal & Appl Linguist, Fac Math & Phys, Malostranske Nam 25, CR-11800 Prague, Czech Republic
来源
TEXT, SPEECH, AND DIALOGUE, TSD 2023 | 2023年 / 14102卷
关键词
News classification; NLP in Czech; News Dataset;
D O I
10.1007/978-3-031-40498-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
引用
收藏
页码:33 / 44
页数:12
相关论文
共 50 条
  • [41] Towards hierarchical affiliation resolution: framework, baselines, dataset
    Backes, Tobias
    Hienert, Daniel
    Dietze, Stefan
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2022, 23 (03) : 267 - 288
  • [42] Dataset Independent Baselines for Relation Prediction in Argument Mining
    Cocarascu, Oana
    Cabrio, Elena
    Villata, Serena
    Toni, Francesca
    COMPUTATIONAL MODELS OF ARGUMENT (COMMA 2020), 2020, 326 : 45 - 52
  • [43] How Strong Is the President in Government Formation? A New Classification and the Czech Case
    Kopecek, Lubomir
    Brunclik, Milos
    EAST EUROPEAN POLITICS AND SOCIETIES, 2019, 33 (01) : 109 - 134
  • [44] Reducing Data Volume in News Topic Classification: Deep Learning Framework and Dataset
    Serreli, Luigi
    Marche, Claudio
    Nitti, Michele
    IEEE OPEN JOURNAL OF THE COMPUTER SOCIETY, 2025, 6 : 152 - 163
  • [45] Establishing Strong Baselines For TripClick Health Retrieval
    Hofstaetter, Sebastian
    Althammer, Sophia
    Sertkan, Mete
    Hanbury, Allan
    ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 144 - 152
  • [46] Strong and Simple Baselines for Multimodal Utterance Embeddings
    Liang, Paul Pu
    Lim, Yao Chong
    Tsai, Yao-Hung Hubert
    Salakhutdinov, Ruslan
    Morency, Louis-Philippe
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2599 - 2609
  • [47] Classification of strong current events based on Gulf of Mexico BOEM NTL Dataset
    Ivanov, L. I.
    Magnell, B. A.
    2012 OCEANS, 2012,
  • [48] FAcupoint: The first dense facial acupoint localization dataset and baselines
    Zhang, Tingting
    Liu, Chao
    Zhou, Jizhe
    Yang, Hongyu
    Lin, Yi
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 272
  • [49] TEXTS - CZECH - BERG,J
    DOUBRAVOVA, J
    HUDEBNI VEDA, 1990, 27 (04): : 362 - 363
  • [50] Topology Repairing of Disconnected Pulmonary Airways and Vessels: Baselines and a Dataset
    Weng, Ziqiao
    Yang, Jiancheng
    Liu, Dongnan
    Cai, Weidong
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT VII, 2023, 14226 : 382 - 392