Czech news dataset for semantic textual similarity

被引:0
|
作者
Sido, Jakub [1 ,2 ]
Sejak, Michal [1 ]
Prazak, Ondrej [1 ,2 ]
Konopik, Miloslav [1 ,2 ]
Moravec, Vaclav [3 ]
机构
[1] NTIS New Technol Informat Soc, Plzen, Czech Republic
[2] Univ West Bohemia, Fac Appl Sci, Dept Comp Sci & Engn, Plzen, Czech Republic
[3] Charles Univ Prague, Fac Social Sci, Prague, Czech Republic
关键词
Semantic textual similarity; BERT model; Czech dataset; Annotations;
D O I
10.1007/s10579-024-09795-z
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson's correlation coefficient).
引用
收藏
页数:18
相关论文
共 50 条
  • [41] Spectral Learning of Semantic Units in a Sentence Pair to Evaluate Semantic Textual Similarity
    Mehndiratta, Akanksha
    Asawa, Krishna
    8TH INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS, BDA 2020, 2020, 12581 : 49 - 59
  • [42] News Summarization Based on Semantic Similarity Measure
    Yu, Hui
    HIS 2009: 2009 NINTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, VOL 1, PROCEEDINGS, 2009, : 180 - 183
  • [43] SEMANTIC AND PRAGMATIC ASPECTS OF TEXTUAL COHERENCE - CZECH - HOFFMANNOVA,J
    MACUROVA, A
    CESKA LITERATURA, 1987, 35 (03): : 282 - 285
  • [44] SEMANTIC AND PRAGMATIC ASPECTS OF TEXTUAL COHESION - CZECH - HOFFMANOVA,J
    SHORT, D
    SLAVONIC AND EAST EUROPEAN REVIEW, 1985, 63 (04): : 576 - 576
  • [45] A Semantic Logic-Based Approach to Determine Textual Similarity
    Blanco, Eduardo
    Moldovan, Dan
    IEEE Transactions on Audio, Speech and Language Processing, 2015, 23 (04): : 683 - 693
  • [46] Enhancing inter-sentence attention for Semantic Textual Similarity
    Zhao, Ying
    Xia, Tingyu
    Jiang, Yunqi
    Tian, Yuan
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (01)
  • [47] Crosslinguistic Semantic Textual Similarity of Buddhist Chinese and Classical Tibetan
    Felbur, Rafal
    Meelen, Marieke
    Vierthaler, Paul
    JOURNAL OF OPEN HUMANITIES DATA, 2022, 8
  • [48] Benchmarking Natural Language Inference and Semantic Textual Similarity for Portuguese
    Fialho, Pedro
    Coheur, Luisa
    Quaresma, Paulo
    INFORMATION, 2020, 11 (10) : 1 - 20
  • [49] Mapping sentences to concept transferred space for semantic textual similarity
    Heyan Huang
    Hao Wu
    Xiaochi Wei
    Yang Gao
    Shumin Shi
    Knowledge and Information Systems, 2019, 60 : 1353 - 1376
  • [50] Advancing Knowledge Discoveries in Criminal Investigations with Semantic Textual Similarity
    Skipanes, Mads
    Jorgensen, Tollef Emil
    Franke, Katrin
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 379 : 269 - 274