Czech news dataset for semantic textual similarity

被引：0

作者：

Sido, Jakub ^{[1
,2
]}

Sejak, Michal ^{[1
]}

Prazak, Ondrej ^{[1
,2
]}

Konopik, Miloslav ^{[1
,2
]}

Moravec, Vaclav ^{[3
]}

机构：

[1] NTIS New Technol Informat Soc, Plzen, Czech Republic

[2] Univ West Bohemia, Fac Appl Sci, Dept Comp Sci & Engn, Plzen, Czech Republic

[3] Charles Univ Prague, Fac Social Sci, Prague, Czech Republic

来源：

LANGUAGE RESOURCES AND EVALUATION | 2024年

关键词：

Semantic textual similarity; BERT model; Czech dataset; Annotations;

D O I：

10.1007/s10579-024-09795-z

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson's correlation coefficient).

引用

页数：18

共 50 条

[31] A proposal for annotation, semantic similarity and classification of textual documents
Nauer, Emmanuel
Napoli, Amedeo
ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 201 - 212
[32] Evaluating Multimodal Representations on Visual Semantic Textual Similarity
de Lacalle, Oier Lopez
Salaberria, Ander
Soroa, Aitor
Azkune, Gorka
Agirre, Eneko
ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1990 - 1997
[33] Calculation of Textual Similarity Using Semantic Relatedness Functions
Kairaldeen, Ammar Riadh
Ercan, Gonenc
COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 516 - 524
[34] C-STS: Conditional Semantic Textual Similarity
Deshpande, Ameet
Jimenez, Carlos E.
Chen, Howard
Murahari, Vishvak
Graf, Victoria
Rajpurohit, Tanmay
Kalyan, Ashwin
Chen, Danqi
Narasimhan, Karthik
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 5669 - 5690
[35] Exploiting Syntactic and Semantic Information for Textual Similarity Estimation
Luo, Jiajia
Shan, Hongtao
Zhang, Gaoyu
Yuan, George
Zhang, Shuyi
Yan, Fengting
Li, Zhiwei
MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
[36] UESTS: An Unsupervised Ensemble Semantic Textual Similarity Method
Hassan, Basma
Abdelrahman, Samir E.
Bahgat, Reem
Farag, Ibrahim
IEEE ACCESS, 2019, 7 : 85462 - 85482
[37] A Combination of Enhanced WordNet and BERT for Semantic Textual Similarity
Ramaiah Institute of Technology, India
不详
ACM Int. Conf. Proc. Ser., (191-198):
[38] Fine-grained Semantic Textual Similarity for Serbian
Batanovic, Vuk
Cvetanovic, Milos
Nikolic, Bosko
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1370 - 1378
[39] A Dataset and Strong Baselines for Classification of Czech News Texts
Kydlicek, Hynek
Libovicky, Jindrich
TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 33 - 44
[40] A semantic textual similarity measurement model based on the syntactic-semantic representation
Tang, Zhuo
Xiao, Qi
Zhu, Li
Li, Kenli
Li, Keqin
INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 933 - 950

← 1 2 3 4 5 →