A Dataset and Strong Baselines for Classification of Czech News Texts

被引：0

作者：

Kydlicek, Hynek ^{[1
]}

Libovicky, Jindrich ^{[1
]}

机构：

[1] Charles Univ Prague, Inst Formal & Appl Linguist, Fac Math & Phys, Malostranske Nam 25, CR-11800 Prague, Czech Republic

来源：

TEXT, SPEECH, AND DIALOGUE, TSD 2023 | 2023年 / 14102卷

关键词：

News classification; NLP in Czech; News Dataset;

D O I：

10.1007/978-3-031-40498-6_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.

引用

页码：33 / 44

页数：12

共 50 条

[1] Czech-ing the News: Article Trustworthiness Dataset for Czech
Boháček, Matyáš
Bravanský, Michal
Trhlík, Filip
Moravec, Václav
Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2023, : 96 - 109
[2] New Dataset and Strong Baselines for the Grammatical Error Correction of Russian
Trinh, Viet Anh
Rozovskaya, Alla
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 4103 - 4111
[3] Czech news dataset for semantic textual similarity
Sido, Jakub
Sejak, Michal
Prazak, Ondrej
Konopik, Miloslav
Moravec, Vaclav
LANGUAGE RESOURCES AND EVALUATION, 2024,
[4] Towards Non-IID image classification: A dataset and baselines
He, Yue
Shen, Zheyan
Cui, Peng
PATTERN RECOGNITION, 2021, 110
[5] Are Strong Baselines Enough? False News Detection with Machine Learning
Aslan, Lara
Ptaszynski, Michal
Jauhiainen, Jukka
FUTURE INTERNET, 2024, 16 (09)
[6] An open dataset of (φ-OTDR events with two classification models as baselines
Cao, Xiaomin
Su, Yunsheng
Jin, Zhiyan
Yu, Kuanglu
RESULTS IN OPTICS, 2024, 10
[7] Sentiment Classification of the Slovenian News Texts
Bucar, Joze
Povh, Janez
Znidarsic, Martin
PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER RECOGNITION SYSTEMS, CORES 2015, 2016, 403 : 777 - 787
[8] SumeCzech: Large Czech News-Based Summarization Dataset
Straka, Milan
Mediankin, Nikita
Kocmi, Tom
Zabokrtsky, Zdenek
Hudecek, Vojtech
Hajic, Jan
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3488 - 3495
[9] RibSeg Dataset and Strong Point Cloud Baselines for Rib Segmentation from CT Scans
Yang, Jiancheng
Gu, Shixuan
Wei, Donglai
Pfister, Hanspeter
Ni, Bingbing
MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2021, PT I, 2021, 12901 : 611 - 621
[10] Classification and Detection of Emotions in Czech News Headlines
Burget, Radim
Smekal, Zdenek
Karasek, Jan
TSP 2010: 33RD INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING, 2010, : 64 - 68

← 1 2 3 4 5 →