Authorship Attribution on Short Texts in the Slovenian Language

被引:0
|
作者
Gabrovsek, Gregor [1 ]
Peer, Peter [1 ]
Emersic, Ziga [1 ]
Batagelj, Borut [1 ]
机构
[1] Univ Ljubljana, Fac Comp & Informat Sci, SI-1000 Ljubljana, Slovenia
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 19期
关键词
authorship attribution; BERT model fine-tuning; dataset construction;
D O I
10.3390/app131910965
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts
    Sarwar, Raheem
    Perera, Maneesha
    Teh, Pin Shen
    Nawaz, Raheel
    Hassan, Muhammad Umair
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
  • [22] Naive Bayes classifiers for authorship attribution of Arabic texts
    Altheneyan, Alaa Saleh
    Menai, Mohamed El Bachir
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2014, 26 (04) : 473 - 484
  • [23] Using Lexical Stress in Authorship Attribution of Historical Texts
    Ivanov, Lubomir
    Petrovic, Smiljana
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 105 - 113
  • [24] Effects of Language Processing in Turkish Authorship Attribution
    Agun, Hayri Volkan
    Yilmazel, Sibel
    Yilmazel, Ozgur
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 1876 - 1881
  • [25] A Modified Language Modeling Method for Authorship Attribution
    Vazirian, Samane
    Zahedi, Morteza
    2016 EIGHTH INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE TECHNOLOGY (IKT), 2016, : 32 - 37
  • [26] A Comparison of Several AI Techniques for Authorship Attribution on Romanian Texts
    Avram, Sanda-Maria
    Oltean, Mihai
    MATHEMATICS, 2022, 10 (23)
  • [27] Authorship Attribution for Polish Texts Based on Part of Speech Tagging
    Szwed, Piotr
    BEYOND DATABASES, ARCHITECTURES AND STRUCTURES: TOWARDS EFFICIENT SOLUTIONS FOR DATA ANALYSIS AND KNOWLEDGE REPRESENTATION, 2017, 716 : 316 - 328
  • [28] Determining of Discriminative Blog Size for Authorship Attribution on the Turkish Texts
    Canbay, Pelin
    Sever, Hayri
    Sezer, Ebru Akcapinar
    2018 6TH INTERNATIONAL SYMPOSIUM ON DIGITAL FORENSIC AND SECURITY (ISDFS), 2018, : 319 - 323
  • [29] EVALUATION AND QUANTIFICATION OF SOME TECHNIQUES OF "ATTRIBUTION OF AUTHORSHIP" IN SPANISH TEXTS
    Blasco, Javier
    Ruiz Urbon, Cristina
    CASTILLA-ESTUDIOS DE LITERATURA, 2009, : 27 - 47
  • [30] On the role of words in the network structure of texts: Application to authorship attribution
    Akimushkin, Camilo
    Amancio, Diego R.
    Oliveira, Osvaldo N., Jr.
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2018, 495 : 49 - 58