A Dataset and Strong Baselines for Classification of Czech News Texts

被引:0
|
作者
Kydlicek, Hynek [1 ]
Libovicky, Jindrich [1 ]
机构
[1] Charles Univ Prague, Inst Formal & Appl Linguist, Fac Math & Phys, Malostranske Nam 25, CR-11800 Prague, Czech Republic
来源
TEXT, SPEECH, AND DIALOGUE, TSD 2023 | 2023年 / 14102卷
关键词
News classification; NLP in Czech; News Dataset;
D O I
10.1007/978-3-031-40498-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
引用
收藏
页码:33 / 44
页数:12
相关论文
共 50 条
  • [1] Czech-ing the News: Article Trustworthiness Dataset for Czech
    Boháček, Matyáš
    Bravanský, Michal
    Trhlík, Filip
    Moravec, Václav
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023, : 96 - 109
  • [2] New Dataset and Strong Baselines for the Grammatical Error Correction of Russian
    Trinh, Viet Anh
    Rozovskaya, Alla
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4103 - 4111
  • [3] Czech news dataset for semantic textual similarity
    Sido, Jakub
    Sejak, Michal
    Prazak, Ondrej
    Konopik, Miloslav
    Moravec, Vaclav
    LANGUAGE RESOURCES AND EVALUATION, 2024,
  • [4] Towards Non-IID image classification: A dataset and baselines
    He, Yue
    Shen, Zheyan
    Cui, Peng
    PATTERN RECOGNITION, 2021, 110
  • [5] Are Strong Baselines Enough? False News Detection with Machine Learning
    Aslan, Lara
    Ptaszynski, Michal
    Jauhiainen, Jukka
    FUTURE INTERNET, 2024, 16 (09)
  • [6] An open dataset of (φ-OTDR events with two classification models as baselines
    Cao, Xiaomin
    Su, Yunsheng
    Jin, Zhiyan
    Yu, Kuanglu
    RESULTS IN OPTICS, 2024, 10
  • [7] Sentiment Classification of the Slovenian News Texts
    Bucar, Joze
    Povh, Janez
    Znidarsic, Martin
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS, CORES 2015, 2016, 403 : 777 - 787
  • [8] SumeCzech: Large Czech News-Based Summarization Dataset
    Straka, Milan
    Mediankin, Nikita
    Kocmi, Tom
    Zabokrtsky, Zdenek
    Hudecek, Vojtech
    Hajic, Jan
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3488 - 3495
  • [9] RibSeg Dataset and Strong Point Cloud Baselines for Rib Segmentation from CT Scans
    Yang, Jiancheng
    Gu, Shixuan
    Wei, Donglai
    Pfister, Hanspeter
    Ni, Bingbing
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT I, 2021, 12901 : 611 - 621
  • [10] Classification and Detection of Emotions in Czech News Headlines
    Burget, Radim
    Smekal, Zdenek
    Karasek, Jan
    TSP 2010: 33RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING, 2010, : 64 - 68