A Ranking-Based Text Matching Approach for Plagiarism Detection

被引:3
|
作者
Kong, Leilei [1 ]
Han, Zhongyuan [1 ]
Qi, Haoliang [2 ]
Lu, Zhimao [3 ]
机构
[1] Heilongjiang Inst Technol, Harbin, Heilongjiang, Peoples R China
[2] State Key Lab Digital Publishing Technol China, Harbin, Heilongjiang, Peoples R China
[3] Dalian Univ Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
plagiarism detection; plagiarism text matching; high-obfuscation plagiarism; ranking; meteor; N-GRAMS;
D O I
10.1587/transfun.E101.A.799
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper addresses the issue of text matching for plagiarism detection. This task aims at identifying the matching plagiarism segments in a pair of suspicious document and its plagiarism source document. All the time, heuristic-based methods are mainly utilized to resolve this problem. But the heuristics rely on the experts' experiences and fail to integrate more features to detect the high obfuscation plagiarism matches. In this paper, a statistical machine learning approach, named the Ranking-based Text Matching Approach for Plagiarism Detection, is proposed to deal with the issues of high obfuscation plagiarism detection. The plagiarism text matching is formalized as a ranking problem, and a pairwise learning to rank algorithm is exploited to identify the most probable plagiarism matches for a given suspicious segment. Especially, the Meteor evaluation metrics of machine translation are subsumed by the proposed method to capture the lexical and semantic text similarity. The proposed method is evaluated on PAN12 and PAN13 text alignment corpus of plagiarism detection and compared to the methods achieved the best performance in PAN12, PAN13 and PAN14. Experimental results demonstrate that the proposed method achieves statistically significantly better performance than the baseline methods in all twelve document collections belonging to five different plagiarism categories. Especially at the PAN12 Artificial-high Obfuscation sub-corpus and PAN13 Summary Obfuscation plagiarism sub-corpus, the main evaluation metrics PlagDet of the proposed method are even 22% and 43% relative improvements than the baselines. Moreover, the efficiency of the proposed method is also better than that of baseline methods.
引用
收藏
页码:799 / 810
页数:12
相关论文
共 50 条
  • [31] Ranking-based evaluation of regression models
    Saharon Rosset
    Claudia Perlich
    Bianca Zadrozny
    Knowledge and Information Systems, 2007, 12 : 331 - 353
  • [32] Implementing Ranking-Based Semantics in ConArg
    Bistarelli, Stefano
    Faloci, Francesco
    Taticchi, Carlo
    2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 1180 - 1187
  • [33] Ranking-based evaluation of regression models
    Rosset, S
    Perlich, C
    Zadrozny, B
    Fifth IEEE International Conference on Data Mining, Proceedings, 2005, : 370 - 377
  • [34] Neural Compatibility Ranking for Text-based Fashion Matching
    Chaidaroon, Suthee
    Fang, Yi
    Xie, Mix
    Magnani, Alessandro
    PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 1229 - 1232
  • [35] An Improved SRL based Plagiarism Detection Technique using Sentence Ranking
    Paul, Merin
    Jamal, Sangeetha
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGIES, ICICT 2014, 2015, 46 : 223 - 230
  • [36] Differential Evolution With Ranking-Based Mutation Operators
    Gong, Wenyin
    Cai, Zhihua
    IEEE TRANSACTIONS ON CYBERNETICS, 2013, 43 (06) : 2066 - 2081
  • [37] Ranking-based instance selection for pattern classification
    Cavalcanti, George D. C.
    Soares, Rodolfo J. O.
    EXPERT SYSTEMS WITH APPLICATIONS, 2020, 150
  • [38] On the Characteristics of Ranking-based Gender Bias Measures
    Klasnja, Anja
    Arabzadeh, Negar
    Mehrvarz, Mahbod
    Bagheri, Ebrahim
    PROCEEDINGS OF THE 14TH ACM WEB SCIENCE CONFERENCE, WEBSCI 2022, 2022, : 245 - 249
  • [39] Viewpoints Using Ranking-Based Argumentation Semantics
    Yun, Bruno
    Vesic, Srdjan
    Croitoru, Madalina
    Bisquert, Pierre
    COMPUTATIONAL MODELS OF ARGUMENT (COMMA 2018), 2018, 305 : 381 - 392
  • [40] Proximity ranking-based multimodal differential evolution
    Zhang, Junna
    Chen, Degang
    Yang, Qiang
    Wang, Yiqiao
    Liu, Dong
    Jeon, Sang-Woon
    Zhang, Jun
    SWARM AND EVOLUTIONARY COMPUTATION, 2023, 78