Authorship Attribution on Short Texts in the Slovenian Language

被引:0
|
作者
Gabrovsek, Gregor [1 ]
Peer, Peter [1 ]
Emersic, Ziga [1 ]
Batagelj, Borut [1 ]
机构
[1] Univ Ljubljana, Fac Comp & Informat Sci, SI-1000 Ljubljana, Slovenia
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 19期
关键词
authorship attribution; BERT model fine-tuning; dataset construction;
D O I
10.3390/app131910965
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts
    Phani, Shanta
    Lahiri, Shibamouli
    Biswas, Arindam
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2017, 16 (04)
  • [32] Authorship Attribution of Small Messages Through Language Models
    Theophilo, Antonio
    Rocha, Anderson
    2022 IEEE INTERNATIONAL WORKSHOP ON INFORMATION FORENSICS AND SECURITY (WIFS), 2022,
  • [33] Language and Obfuscation Oblivious Source Code Authorship Attribution
    Zafar, Sarim
    Sarwar, Muhammad Usman
    Salem, Saeed
    Malik, Muhammad Zubair
    IEEE ACCESS, 2020, 8 (08): : 197581 - 197596
  • [34] A Comparison of Authorship Attribution Approaches Applied on the Lithuanian Language
    Kapociute-Dzikiene, Jurgita
    Venckauskas, Algimantas
    Damasevicius, Robertas
    PROCEEDINGS OF THE 2017 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2017, : 347 - 351
  • [35] Language independent authorship attribution using character level language models
    Peng, FC
    Schuurmans, D
    Keselj, V
    Wang, SJ
    EACL 2003: 10TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 267 - 274
  • [36] Authorship Attribution for a Resource Poor Language-Urdu
    Nazir, Zulqarnain
    Shahzad, Khurram
    Malik, Muhammad Kamran
    Anwar, Waheed
    Bajwa, Imran Sarwar
    Mehmood, Khawar
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (03)
  • [37] Authorship Attribution Using a Neural Network Language Model
    Ge, Zhenhao
    Sun, Yufang
    Smith, Mark J. T.
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 4212 - 4213
  • [38] A Transformer-Based Approach to Authorship Attribution in Classical Arabic Texts
    AlZahrani, Fetoun Mansour
    Al-Yahya, Maha
    APPLIED SCIENCES-BASEL, 2023, 13 (12):
  • [39] Semantic Clustering and Transfer Learning in Social Media Texts Authorship Attribution
    Fedotova, Anastasia
    Kurtukova, Anna
    Romanov, Aleksandr
    Shelupanov, Alexander
    IEEE ACCESS, 2024, 12 : 39783 - 39803
  • [40] A Computational Approach for Authorship Attribution of Literary Texts using Sintatic Features
    Varela, Paulo
    Justino, Edson
    Britto, Alceu, Jr.
    Bortolozzi, Flavio
    2016 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2016, : 4835 - 4842