Towards Improving Multiple Authorship Attribution of Source Code

被引:1
|
作者
Hao, Pengnan [1 ]
Li, Zhen [1 ]
Liu, Cui [1 ]
Wen, Yu [1 ]
Liu, Fanming [1 ]
机构
[1] Hebei Univ, Dept Cyber Secur & Comp, Baoding, Hebei, Peoples R China
关键词
Multiple authorship attribution; Siamese network; machine learning; IDENTIFICATION;
D O I
10.1109/QRS57517.2022.00059
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Source code authorship attribution addresses the problems of copyright infringement disputes and plagiarism detection. However, most software projects are collaborative development projects. It is necessary to study multiple authorship attribution. Existing methods are not reliable in the domain of multiple authorship attribution. The reasons are as follows: i) It is a challenge to divide the code boundaries of different authors in a sample; ii) code segments belonging to different authors in a sample are usually small or incomplete. This paper proposes a method to address these challenges. We first divide the code sample into multiple lines, then integrate the code lines with similar author styles into code segments using Siamese networks. Finally, we use a path-based code representation and machine learning to identify authors. Experimental results show the method achieves an accuracy of 87.35% on C/C++ dataset and 91.35% on Java dataset, which performs better than existing methods.
引用
收藏
页码:516 / 526
页数:11
相关论文
共 50 条
  • [1] On Improving Authorship Attribution of Source Code
    Tennyson, Matthew F.
    [J]. DIGITAL FORENSICS AND CYBER CRIME, ICDF2C 2012, 2013, 114 : 58 - 65
  • [2] Comparing techniques for authorship attribution of source code
    Burrows, Steven
    Uitdenbogerd, Alexandra L.
    Turpin, Andrew
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2014, 44 (01): : 1 - 32
  • [3] Analysis of Source Code Authorship Attribution Problem
    Bogdanova, Alina
    Farina, Mirko
    Kholmatova, Zamira
    Kruglov, Artem
    Romanov, Vitaly
    Succi, Giancarlo
    [J]. 2022 INTERNATIONAL CONFERENCE ON COMPUTERS AND ARTIFICIAL INTELLIGENCE TECHNOLOGIES, CAIT, 2022, : 109 - 115
  • [4] Language and Obfuscation Oblivious Source Code Authorship Attribution
    Zafar, Sarim
    Sarwar, Muhammad Usman
    Salem, Saeed
    Malik, Muhammad Zubair
    [J]. IEEE ACCESS, 2020, 8 (08): : 197581 - 197596
  • [5] A Bayesian Ensemble Classifier for Source Code Authorship Attribution
    Tennyson, Matthew F.
    Mitropoulos, Francisco J.
    [J]. SIMILARITY SEARCH AND APPLICATIONS, 2014, 8821 : 265 - 276
  • [6] Misleading Authorship Attribution of Source Code using Adversarial Learning
    Quiring, Erwin
    Maier, Alwin
    Rieck, Konrad
    [J]. PROCEEDINGS OF THE 28TH USENIX SECURITY SYMPOSIUM, 2019, : 479 - 496
  • [7] Source code authorship attribution using n-grams
    Burrows, Steven
    Tahaghoghi, S.M.M.
    [J]. ADCS 2007 - Proceedings of the Twelfth Australasian Document Computing Symposium, 2007, : 32 - 39
  • [8] Application of Information Retrieval Techniques for Source Code Authorship Attribution
    Burrows, Steven
    Uitdenbogerd, Alexandra L.
    Turpin, Andrew
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2009, 5463 : 699 - 713
  • [9] The effect of time drift in source code authorship attribution: Time drifting in source code - Stylochronometry
    Petrik, Juraj
    Chuda, Daniela
    [J]. ACM International Conference Proceeding Series, 2021, : 87 - 92
  • [10] Choosing a Profile Length in the SCAP Method of Source Code Authorship Attribution
    Tennyson, Matthew F.
    Mitropoulos, Francisco J.
    [J]. IEEE SOUTHEASTCON 2014, 2014,