A Dataset and Strong Baselines for Classification of Czech News Texts

被引：0

作者：

Kydlicek, Hynek ^{[1
]}

Libovicky, Jindrich ^{[1
]}

机构：

[1] Charles Univ Prague, Inst Formal & Appl Linguist, Fac Math & Phys, Malostranske Nam 25, CR-11800 Prague, Czech Republic

来源：

TEXT, SPEECH, AND DIALOGUE, TSD 2023 | 2023年 / 14102卷

关键词：

News classification; NLP in Czech; News Dataset;

D O I：

10.1007/978-3-031-40498-6_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.

引用

页码：33 / 44

页数：12

共 50 条

[31] SMAD: Text Classification of Arabic Social Media Dataset for News Sources
Gaber, Amira M.
Gaber, Amira M.
Moussa, Hanan
International Journal of Advanced Computer Science and Applications, 2021, 12 (10): : 508 - 516
[32] AFND: Arabic fake news dataset for the detection and classification of articles credibility
Khalil, Ashwaq
Jarrah, Moath
Aldwairi, Monther
Jaradat, Manar
DATA IN BRIEF, 2022, 42
[33] Voting-Based Multiple Classification Approach for Turkish News Texts
Buluz, Basak
Komecoglu, Yavuz
Kizrak, Merve Ayyuce
2019 INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS CONFERENCE (ASYU), 2019, : 401 - 406
[34] Improving Multi-label Document Classification of Czech News Articles
Lehecka, Jan
Svec, Jan
TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 307 - 315
[35] Statistical Machine Transliteration Baselines for NEWS 2018
Singhania, Snigdha
Nguyen, Minh
Ngo, Hoang Gia
Chen, Nancy F.
NAMED ENTITIES, 2018, : 74 - 78
[36] A Dataset and Baselines for e-Commerce Product Categorization
Lin, Yiu-Chang
Das, Pradipto
Trotman, Andrew
Kallumadi, Surya
PROCEEDINGS OF THE 2019 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'19), 2019, : 212 - 215
[37] A Clinical Dataset and Various Baselines for Chromosome Instance Segmentation
Huang, Runhua
Lin, Chengchuang
Yin, Aihua
Chen, Hanbiao
Guo, Li
Zhao, Gansen
Fan, Xiaomao
Li, Shuangyin
Yang, Jinji
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2022, 19 (01) : 31 - 39
[38] The Second DIHARD Diarization Challenge: Dataset, task, and baselines
Ryant, Neville
Church, Kenneth
Cieri, Christopher
Cristia, Alejandrina
Du, Jun
Ganapathy, Sriram
Liberman, Mark
INTERSPEECH 2019, 2019, : 978 - 982
[39] The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines
Damen, Dima
Doughty, Hazel
Farinella, Giovanni Maria
Fidler, Sanja
Furnari, Antonino
Kazakos, Evangelos
Moltisanti, Davide
Munro, Jonathan
Perrett, Toby
Price, Will
Wray, Michael
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (11) : 4125 - 4141
[40] Towards hierarchical affiliation resolution: framework, baselines, dataset
Tobias Backes
Daniel Hienert
Stefan Dietze
International Journal on Digital Libraries, 2022, 23 : 267 - 288

← 1 2 3 4 5 →