Authorship Attribution in Portuguese Using Character N-grams

被引:13
|
作者
Markov, Ilia [1 ]
Baptista, Jorge [2 ,3 ]
Pichardo-Lagunas, Obdulia [4 ]
机构
[1] IPN, CIC, Av Juan de Dios Batiz S-N, Mexico City 07738, DF, Mexico
[2] Univ Algarve, FCHS, Campus Gambelas, P-8005139 Faro, Portugal
[3] INESC ID Lisboa L2F, Campus Gambelas, P-8005139 Faro, Portugal
[4] IPN, UPIITA, Av Inst Politecn Nacl 2580, Mexico City 07340, DF, Mexico
关键词
authorship attribution; character n-grams; Portuguese; stylometry; computational linguistics; machine learning; LANGUAGE;
D O I
10.12700/APH.14.3.2017.3.4
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
For the Authorship Attribution (AA) task, character n-grams are considered among the best predictive features. In the English language, it has also been shown that some types of character n-grams perform better than others. This paper tackles the AA task in Portuguese by examining the performance of different types of character n-grams, and various combinations of them. The paper also experiments with different feature representations and machine-learning algorithms. Moreover, the paper demonstrates that the performance of the character n-gram approach can be improved by fine-tuning the feature set and by appropriately selecting the length and type of character n-grams. This relatively simple and language-independent approach to the AA task outperforms both a bag-of-words baseline and other approaches, using the same corpus.
引用
收藏
页码:59 / 78
页数:20
相关论文
共 50 条
  • [1] Authorship Attribution of Ancient Texts Written by Ten Arabic Travelers Using Character N-Grams
    Ouamour, Siham
    Sayoud, Halim
    [J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (CITS), 2013,
  • [2] An improved N-grams based Model for Authorship Attribution
    Boughaci, Dalila
    Benmesbah, Mounir
    Zebiri, Aniss
    [J]. 2019 INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCES (ICCIS), 2019, : 70 - 75
  • [3] Instance Based Authorship Attribution for Kannada Text Using Amalgamation of Character and Word N-grams Technique
    Chandrika, C. P.
    Kallimani, Jagadish S.
    [J]. DISTRIBUTED COMPUTING AND OPTIMIZATION TECHNIQUES, ICDCOT 2021, 2022, 903 : 547 - 557
  • [4] Authorship attribution of Spanish poems using n-grams and the Web as Corpus
    Guzman-Cabrera, Rafael
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2391 - 2396
  • [5] Impact of Character n-grams Attention Scores for English and Russian News Articles Authorship Attribution
    Makhmutova, Liliya
    Ross, Robert
    Salton, Giancarlo
    [J]. 38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 939 - 941
  • [6] Complete Syntactic N-grams as Style Markers for Authorship Attribution
    Posadas-Duran, Juan-Pablo
    Sidorov, Grigori
    Batyrshin, Ildar
    [J]. HUMAN-INSPIRED COMPUTING AND ITS APPLICATIONS, PT I, 2014, 8856 : 9 - 17
  • [7] Spam detection using character N-grams
    Kanaris, Ioannis
    Kanaris, Konstantinos
    Stamatatos, Efstathios
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 3955 : 95 - 104
  • [8] Authorship Identification of the Azerbaijani Texts Using n-grams
    Aida-zade, K. R.
    Talibov, S. Q.
    [J]. 2016 IEEE 10TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT), 2016, : 210 - 212
  • [9] Document embeddings learned on various types of n-grams for cross-topic authorship attribution
    Gomez-Adorno, Helena
    Posadas-Duran, Juan-Pablo
    Sidorov, Grigori
    Pinto, David
    [J]. COMPUTING, 2018, 100 (07) : 741 - 756