Czech news dataset for semantic textual similarity

被引:0
|
作者
Sido, Jakub [1 ,2 ]
Sejak, Michal [1 ]
Prazak, Ondrej [1 ,2 ]
Konopik, Miloslav [1 ,2 ]
Moravec, Vaclav [3 ]
机构
[1] NTIS New Technol Informat Soc, Plzen, Czech Republic
[2] Univ West Bohemia, Fac Appl Sci, Dept Comp Sci & Engn, Plzen, Czech Republic
[3] Charles Univ Prague, Fac Social Sci, Prague, Czech Republic
关键词
Semantic textual similarity; BERT model; Czech dataset; Annotations;
D O I
10.1007/s10579-024-09795-z
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson's correlation coefficient).
引用
收藏
页数:18
相关论文
共 50 条
  • [31] A proposal for annotation, semantic similarity and classification of textual documents
    Nauer, Emmanuel
    Napoli, Amedeo
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 201 - 212
  • [32] Evaluating Multimodal Representations on Visual Semantic Textual Similarity
    de Lacalle, Oier Lopez
    Salaberria, Ander
    Soroa, Aitor
    Azkune, Gorka
    Agirre, Eneko
    ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1990 - 1997
  • [33] Calculation of Textual Similarity Using Semantic Relatedness Functions
    Kairaldeen, Ammar Riadh
    Ercan, Gonenc
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT II, 2015, 9042 : 516 - 524
  • [34] C-STS: Conditional Semantic Textual Similarity
    Deshpande, Ameet
    Jimenez, Carlos E.
    Chen, Howard
    Murahari, Vishvak
    Graf, Victoria
    Rajpurohit, Tanmay
    Kalyan, Ashwin
    Chen, Danqi
    Narasimhan, Karthik
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 5669 - 5690
  • [35] Exploiting Syntactic and Semantic Information for Textual Similarity Estimation
    Luo, Jiajia
    Shan, Hongtao
    Zhang, Gaoyu
    Yuan, George
    Zhang, Shuyi
    Yan, Fengting
    Li, Zhiwei
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2021, 2021
  • [36] UESTS: An Unsupervised Ensemble Semantic Textual Similarity Method
    Hassan, Basma
    Abdelrahman, Samir E.
    Bahgat, Reem
    Farag, Ibrahim
    IEEE ACCESS, 2019, 7 : 85462 - 85482
  • [37] A Combination of Enhanced WordNet and BERT for Semantic Textual Similarity
    Ramaiah Institute of Technology, India
    不详
    ACM Int. Conf. Proc. Ser., (191-198):
  • [38] Fine-grained Semantic Textual Similarity for Serbian
    Batanovic, Vuk
    Cvetanovic, Milos
    Nikolic, Bosko
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1370 - 1378
  • [39] A Dataset and Strong Baselines for Classification of Czech News Texts
    Kydlicek, Hynek
    Libovicky, Jindrich
    TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 : 33 - 44
  • [40] A semantic textual similarity measurement model based on the syntactic-semantic representation
    Tang, Zhuo
    Xiao, Qi
    Zhu, Li
    Li, Kenli
    Li, Keqin
    INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 933 - 950