An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

被引:8
|
作者
Polydouri, Andrianna [1 ]
Vathi, Eleni [1 ]
Siolas, Georgios [1 ]
Stafylopatis, Andreas [1 ]
机构
[1] Natl & Tech Univ Athens, Sch Elect & Comp Engn, Intelligent Syst Content & Interact Lab, Athens, Greece
关键词
Intrinsic plagiarism detection; Stylometry; Supervised learning; Unbalanced training data; SMOTE; PAN Webis;
D O I
10.1007/s12530-018-9232-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores.
引用
收藏
页码:503 / 515
页数:13
相关论文
共 50 条
  • [1] An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection
    Andrianna Polydouri
    Eleni Vathi
    Georgios Siolas
    Andreas Stafylopatis
    [J]. Evolving Systems, 2020, 11 : 503 - 515
  • [2] An integrated approach for intrinsic plagiarism detection
    AlSallal, Muna
    Iqbal, Rahat
    Palade, Vasile
    Amin, Saad
    Chang, Victor
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 96 (700-712): : 700 - 712
  • [3] Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning
    Polydouri, Andrianna
    Siolas, Georgios
    Stafylopatis, Andreas
    [J]. ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2017, 2017, 744 : 99 - 110
  • [4] A New Hybrid Sampling Approach for Classification of Imbalanced Datasets
    Hanskunatai, Anantaporn
    [J]. PROCEEDINGS OF 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS), 2018, : 67 - 71
  • [5] ARCID: A New Approach to Deal with Imbalanced Datasets Classification
    Abdellatif, Safa
    Ben Hassine, Mohamed Ali
    Ben Yahia, Sadok
    Bouzeghoub, Amel
    [J]. SOFSEM 2018: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2018, 10706 : 569 - 580
  • [6] To improve classification of imbalanced datasets
    Shukla, Pratyusha
    Bhowmick, Kiran
    [J]. 2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
  • [7] Intrinsic plagiarism detection
    Eissen, Sven Meyer zu
    Stein, Benno
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2006, 3936 : 565 - 569
  • [8] An Evolutionary Neural Network Approach to Intrinsic Plagiarism Detection
    Curran, Dara
    [J]. ARTIFICIAL INTELLIGENCE AND COGNITIVE SCIENCE, 2010, 6206 : 33 - 40
  • [9] Classification of Antimicrobial Peptides with Imbalanced Datasets
    Camacho, Francy L.
    Torres, Rodrigo
    Ramos Pollan, Raul
    [J]. 11TH INTERNATIONAL SYMPOSIUM ON MEDICAL INFORMATION PROCESSING AND ANALYSIS, 2015, 9681
  • [10] Discrimination Aware Classification for Imbalanced Datasets
    Ristanoski, Goce
    Liu, Wei
    Bailey, James
    [J]. PROCEEDINGS OF THE 22ND ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM'13), 2013, : 1529 - 1532