A Dataset and Strong Baselines for Classification of Czech News Texts

被引:0
|
作者
Kydlicek, Hynek [1 ]
Libovicky, Jindrich [1 ]
机构
[1] Charles Univ Prague, Inst Formal & Appl Linguist, Fac Math & Phys, Malostranske Nam 25, CR-11800 Prague, Czech Republic
来源
TEXT, SPEECH, AND DIALOGUE, TSD 2023 | 2023年 / 14102卷
关键词
News classification; NLP in Czech; News Dataset;
D O I
10.1007/978-3-031-40498-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
引用
收藏
页码:33 / 44
页数:12
相关论文
共 50 条
  • [31] SMAD: Text Classification of Arabic Social Media Dataset for News Sources
    Gaber, Amira M.
    Gaber, Amira M.
    Moussa, Hanan
    International Journal of Advanced Computer Science and Applications, 2021, 12 (10): : 508 - 516
  • [32] AFND: Arabic fake news dataset for the detection and classification of articles credibility
    Khalil, Ashwaq
    Jarrah, Moath
    Aldwairi, Monther
    Jaradat, Manar
    DATA IN BRIEF, 2022, 42
  • [33] Voting-Based Multiple Classification Approach for Turkish News Texts
    Buluz, Basak
    Komecoglu, Yavuz
    Kizrak, Merve Ayyuce
    2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 401 - 406
  • [34] Improving Multi-label Document Classification of Czech News Articles
    Lehecka, Jan
    Svec, Jan
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 307 - 315
  • [35] Statistical Machine Transliteration Baselines for NEWS 2018
    Singhania, Snigdha
    Nguyen, Minh
    Ngo, Hoang Gia
    Chen, Nancy F.
    NAMED ENTITIES, 2018, : 74 - 78
  • [36] A Dataset and Baselines for e-Commerce Product Categorization
    Lin, Yiu-Chang
    Das, Pradipto
    Trotman, Andrew
    Kallumadi, Surya
    PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19), 2019, : 212 - 215
  • [37] A Clinical Dataset and Various Baselines for Chromosome Instance Segmentation
    Huang, Runhua
    Lin, Chengchuang
    Yin, Aihua
    Chen, Hanbiao
    Guo, Li
    Zhao, Gansen
    Fan, Xiaomao
    Li, Shuangyin
    Yang, Jinji
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (01) : 31 - 39
  • [38] The Second DIHARD Diarization Challenge: Dataset, task, and baselines
    Ryant, Neville
    Church, Kenneth
    Cieri, Christopher
    Cristia, Alejandrina
    Du, Jun
    Ganapathy, Sriram
    Liberman, Mark
    INTERSPEECH 2019, 2019, : 978 - 982
  • [39] The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
    Damen, Dima
    Doughty, Hazel
    Farinella, Giovanni Maria
    Fidler, Sanja
    Furnari, Antonino
    Kazakos, Evangelos
    Moltisanti, Davide
    Munro, Jonathan
    Perrett, Toby
    Price, Will
    Wray, Michael
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) : 4125 - 4141
  • [40] Towards hierarchical affiliation resolution: framework, baselines, dataset
    Tobias Backes
    Daniel Hienert
    Stefan Dietze
    International Journal on Digital Libraries, 2022, 23 : 267 - 288