Authorship Attribution on Short Texts in the Slovenian Language

被引:0
|
作者
Gabrovsek, Gregor [1 ]
Peer, Peter [1 ]
Emersic, Ziga [1 ]
Batagelj, Borut [1 ]
机构
[1] Univ Ljubljana, Fac Comp & Informat Sci, SI-1000 Ljubljana, Slovenia
来源
APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 19期
关键词
authorship attribution; BERT model fine-tuning; dataset construction;
D O I
10.3390/app131910965
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The study investigates the task of authorship attribution on short texts in Slovenian using the BERT language model. Authorship attribution is the task of attributing a written text to its author, frequently using stylometry or computational techniques. We create five custom datasets for different numbers of included text authors and fine-tune two BERT models, SloBERTa and BERT Multilingual (mBERT), to evaluate their performance in closed-class and open-class problems with varying numbers of authors. Our models achieved an F1 score of approximately 0.95 when using the dataset with the comments of the top five users by the number of written comments. Training on datasets that include comments written by an increasing number of people results in models with a gradually decreasing F1 score. Including out-of-class comments in the evaluation decreases the F1 score by approximately 0.05. The study demonstrates the feasibility of using BERT models for authorship attribution in short texts in the Slovenian language.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution
    Hoenen, Armin
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2017, 2017, 10260 : 274 - 277
  • [42] Authorship attribution and feature testing for short Chinese emails
    Zhang, Shaomin
    INTERNATIONAL JOURNAL OF SPEECH LANGUAGE AND THE LAW, 2016, 23 (01) : 71 - 97
  • [43] Determination of the Distribution of Sentence Length Frequencies for Hindi Language Texts and Utilization of Sentence Length Frequency Profiles for Authorship Attribution
    Pande, Hemlata
    Dhami, Hoshiyar S.
    JOURNAL OF QUANTITATIVE LINGUISTICS, 2015, 22 (04) : 338 - 348
  • [44] Authorship Attribution of Social Media and Literary Russian-Language Texts Using Machine Learning Methods and Feature Selection
    Fedotova, Anastasia
    Romanov, Aleksandr
    Kurtukova, Anna
    Shelupanov, Alexander
    FUTURE INTERNET, 2022, 14 (01):
  • [45] AUTHORSHIP ATTRIBUTION
    HOLMES, DI
    COMPUTERS AND THE HUMANITIES, 1994, 28 (02): : 87 - 106
  • [46] A Computational Approach Based on Syntactic Levels of Language in Authorship Attribution
    Varela, P. J.
    Justino, E. J. R.
    Bortolozzi, F.
    Oliveira, L. E. S.
    IEEE LATIN AMERICA TRANSACTIONS, 2016, 14 (01) : 259 - 266
  • [47] Towards Authorship Attribution in Arabic Short-Microblog Text
    Jambi, Kamal Mansour
    Khan, Imtiaz Hussain
    Siddiqui, Muazzam Ahmed
    Alhaj, Salma Omar
    IEEE ACCESS, 2021, 9 : 128506 - 128520
  • [48] Time-Aware Authorship Attribution for Short Text Streams
    Azarbonyad, Hosein
    Dehghani, Mostafa
    Marx, Maarten
    Kamps, Jaap
    SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 727 - 730
  • [49] Ensemble Methods for Instance-Based Arabic Language Authorship Attribution
    Al-Sarem, Mohammed
    Saeed, Faisal
    Alsaeedi, Abdullah
    Boulila, Wadii
    Al-Hadhrami, Tawfik
    IEEE ACCESS, 2020, 8 : 17331 - 17345
  • [50] Authorship Attribution in Huayan Texts by Machine Learning using N-gram and SVM
    Park, Boram
    INTERNATIONAL JOURNAL OF BUDDHIST THOUGHT & CULTURE, 2018, 28 (02): : 69 - 86