Czech news dataset for semantic textual similarity

被引:0
|
作者
Sido, Jakub [1 ,2 ]
Sejak, Michal [1 ]
Prazak, Ondrej [1 ,2 ]
Konopik, Miloslav [1 ,2 ]
Moravec, Vaclav [3 ]
机构
[1] NTIS New Technol Informat Soc, Plzen, Czech Republic
[2] Univ West Bohemia, Fac Appl Sci, Dept Comp Sci & Engn, Plzen, Czech Republic
[3] Charles Univ Prague, Fac Social Sci, Prague, Czech Republic
关键词
Semantic textual similarity; BERT model; Czech dataset; Annotations;
D O I
10.1007/s10579-024-09795-z
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper describes a novel dataset consisting of sentences with two different semantic similarity annotations; with and without surrounding context. The data originate from the journalistic domain in the Czech language. The final dataset contains 138,556 human annotations divided into train and test sets. In total, 485 journalism students participated in the creation process. To increase the reliability of the test set, we compute the final annotations as an average of 9 individual annotation scores. We evaluate the dataset quality by measuring inter and intra-annotator agreements. Besides agreement numbers, we provide detailed statistics of the collected dataset. We conclude our paper with a baseline experiment of building a system for predicting the semantic similarity of sentences. Due to the massive number of training annotations (116,956), the model significantly outperforms an average annotator (0.92 versus 0.86 of Pearson's correlation coefficient).
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Czech Dataset for Semantic Textual Similarity
    Svoboda, Lukas
    Brychcin, Tomas
    TEXT, SPEECH, AND DIALOGUE (TSD 2018), 2018, 11107 : 213 - 221
  • [2] Turkish Dataset for Semantic Textual Similarity
    Fikri, Figen Beken
    Oflazer, Kemal
    Yanikoglu, Berrin
    29TH IEEE CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS (SIU 2021), 2021,
  • [3] Building Evaluation Dataset for Textual Entailment in Czech
    Neverilova, Zuzana
    RASLAN 2012: RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING, 2012, : 61 - 66
  • [4] Influence of Token Similarity Measures for Semantic Textual Similarity
    Sowmya, V.
    Vardhan, Vishnu B.
    Raju, Bhadri M. S. V. S.
    2016 IEEE 6TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (IACC), 2016, : 41 - 44
  • [5] FlexSTS: A Framework for Semantic Textual Similarity
    Freire, Janio
    Pinheiro, Vadia
    Feitosa, David
    LINGUAMATICA, 2016, 8 (02): : 23 - 31
  • [6] Semantic Textual Similarity in Bengali Text
    Shajalal, Md
    Aono, Masaki
    2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [7] Semantic Textual Similarity in Quality Estimation
    Bechara, Hanna
    Parra Escartin, Carla
    Orasan, Constantin
    Specia, Lucia
    BALTIC JOURNAL OF MODERN COMPUTING, 2016, 4 (02): : 256 - 268
  • [8] Linguistically Conditioned Semantic Textual Similarity
    Tu, Jingxuan
    Xu, Keer
    Yue, Liulu
    Ye, Bingyang
    Rim, Kyeongmin
    Pustejovsky, James
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 1161 - 1172
  • [9] Correlation Coefficients and Semantic Textual Similarity
    Zhelezniak, Vitalii
    Savkov, Aleksandar
    Shen, April
    Hammerla, Nils Y.
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 951 - 962
  • [10] Semantic Textual Similarity of Sentences with Emojis
    Debnath, Alok
    Pinnaparaju, Nikhil
    Shrivastava, Manish
    Varma, Vasudeva
    Augenstein, Isabelle
    WWW'20: COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2020, 2020, : 426 - 430