Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages

被引:14
|
作者
Savoy, Jacques [1 ]
机构
[1] Univ Neuchatel, Dept Comp Sci, CH-2000 Neuchatel, Switzerland
关键词
DELTA;
D O I
10.1080/09296174.2012.659003
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
The first objective of this paper is carry out three experiments intended to evaluate authorship attribution methods based on three test-collections available in three different languages (English, French, and German). In the first we represent and categorize 52 text excerpts written by nine authors and taken from 19th century English novels. In the second we work with 44 segments from French novels written by eleven authors, mostly from the 19th century. In the third we extract 59 German text excerpts from novels published mainly during the 19th and the beginning of the 20th century, written by 15 authors. The second objective is to analyse performance differences obtained when using word types or lemmas as text representations, and the third objective is to evaluate three authorship attribution schemes, the first of which uses principal component analysis (PCA), the second applies the Delta approach, and the third corresponds to a new authorship attribution method based on specific vocabulary. This concept is computed for a given text (or author profile) and then compared with the entire corpus. Based on this information, we show how a distance measure can be derived and by means of the nearest neighbor approach we suggest a simple and efficient authorship attribution scheme. Based on three test collections and using either word types or lemmas as features, we demonstrate that the suggested classification scheme performs better than the PCA method, and slightly better than the Delta approach.
引用
收藏
页码:132 / 161
页数:30
相关论文
共 50 条
  • [21] Analysis of source identified text corpora: Exploring the statistics of the reused text and authorship
    Aizawa, A
    41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 383 - 390
  • [22] Towards Authorship Attribution in Arabic Short-Microblog Text
    Jambi, Kamal Mansour
    Khan, Imtiaz Hussain
    Siddiqui, Muazzam Ahmed
    Alhaj, Salma Omar
    IEEE ACCESS, 2021, 9 : 128506 - 128520
  • [23] Authorship Attribution of The Golden Lotus Based on Text Classification Methods
    Tang, Xuemei
    Liang, Shichen
    Liu, Zhiying
    3RD INTERNATIONAL CONFERENCE ON INNOVATION IN ARTIFICIAL INTELLIGENCE (ICIAI 2019), 2019, : 69 - 72
  • [24] Time-Aware Authorship Attribution for Short Text Streams
    Azarbonyad, Hosein
    Dehghani, Mostafa
    Marx, Maarten
    Kamps, Jaap
    SIGIR 2015: PROCEEDINGS OF THE 38TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2015, : 727 - 730
  • [25] Text Clustering on Authorship Attribution Based on the Features of Punctuations Usage
    Jin Mingzhe
    Jiang Minghu
    PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 2175 - +
  • [26] Authorship attribution in twitter: a comparative study of machine learning and deep learning approaches
    Aouchiche R.I.A.
    Boumahdi F.
    Remmide M.A.
    Madani A.
    International Journal of Information Technology, 2024, 16 (5) : 3303 - 3310
  • [27] A case study in authorship attribution: The Mondrigo
    Sierra G.
    Hernández-García T.
    Gómez-Adorno H.
    Bel-Enguix G.
    Journal of Intelligent and Fuzzy Systems, 2022, 42 (05): : 4473 - 4480
  • [28] Collecting and annotating corpora for three under-resourced languages of France: Methodological issues
    Bernhard, Delphine
    Ligozat, Anne-Laure
    Bras, Myriam
    Martin, Fanny
    Vergez-Couret, Marianne
    Erhart, Pascale
    Sibille, Jean
    Todirascu, Amalia
    de Mareuil, Philippe Boula
    Huck, Dominique
    LANGUAGE DOCUMENTATION & CONSERVATION, 2021, 15 : 316 - 357
  • [29] A comparative assessment of the difficulty of authorship attribution in Greek and in English
    Juola, Patrick
    Mikros, George K.
    Vinsick, Sean
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2019, 70 (01) : 61 - 70
  • [30] Differences in the effects of filters on health information retrieval from the Internet in three languages from three countries: A comparative study
    Su, KC
    Waldren, SE
    Patrick, TB
    MEDINFO 2004: PROCEEDINGS OF THE 11TH WORLD CONGRESS ON MEDICAL INFORMATICS, PT 1 AND 2, 2004, 107 : 1313 - 1317