Language and Obfuscation Oblivious Source Code Authorship Attribution

被引:6
|
作者
Zafar, Sarim [1 ]
Sarwar, Muhammad Usman [1 ]
Salem, Saeed [1 ]
Malik, Muhammad Zubair [1 ]
机构
[1] North Dakota State Univ, Dept Comp Sci, Fargo, ND 58105 USA
来源
IEEE ACCESS | 2020年 / 8卷 / 08期
关键词
Software engineering; natural language processing; artificial neural networks; IDENTIFICATION;
D O I
10.1109/ACCESS.2020.3034932
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Source Code Authorship Attribution can answer many interesting questions such as: Who wrote the malicious source code? Is the source code plagiarized, and does it infringe on copyright? Source Code Authorship Attribution is done by observing distinctive patterns of style in a source code whose author is unknown and comparing them with patterns learned from known authors' source codes. In this paper, we present an efficient approach to learn a novel representation using deep metric learning. The existing state of the art approaches tokenize the source code and work on the keyword level, limiting the elements of style they can consider. Our approach uses the raw character stream of source code. It can examine keywords and different stylistic features such as variable naming conventions or using tabs vs. spaces, enabling us to learn a richer representation than other keyword-based approaches. Our approach uses a character-level Convolutional Neural Network (CNN). We train the CNN to map the input character stream to a dense vector, mapping the source codes authored by the same author close to each other. In contrast, source codes written by different programmers are mapped farther apart in the embedding space. We then feed these source code vectors into the K-nearest neighbor (KNN) classifier that uses Manhattan-distance to perform authorship attribution. We validated our approach on Google Code Jam (GCJ) dataset across three different programming languages. We prepare our large-scale dataset in such a way that it does not induce type-I error. Our approach is more scalable and efficient than existing methods. We were able to achieve an accuracy of 84.94% across 20,458 authors, which is more than twice the scale of any previous study under a much more challenging setting.
引用
收藏
页码:197581 / 197596
页数:16
相关论文
共 50 条
  • [1] On Improving Authorship Attribution of Source Code
    Tennyson, Matthew F.
    [J]. DIGITAL FORENSICS AND CYBER CRIME, ICDF2C 2012, 2013, 114 : 58 - 65
  • [2] Comparing techniques for authorship attribution of source code
    Burrows, Steven
    Uitdenbogerd, Alexandra L.
    Turpin, Andrew
    [J]. SOFTWARE-PRACTICE & EXPERIENCE, 2014, 44 (01): : 1 - 32
  • [3] Analysis of Source Code Authorship Attribution Problem
    Bogdanova, Alina
    Farina, Mirko
    Kholmatova, Zamira
    Kruglov, Artem
    Romanov, Vitaly
    Succi, Giancarlo
    [J]. 2022 INTERNATIONAL CONFERENCE ON COMPUTERS AND ARTIFICIAL INTELLIGENCE TECHNOLOGIES, CAIT, 2022, : 109 - 115
  • [4] Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering
    Bogomolov, Egor
    Kovalenko, Vladimir
    Rebryk, Yurii
    Bacchelli, Alberto
    Bryksin, Timofey
    [J]. PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21), 2021, : 932 - 944
  • [5] Towards Improving Multiple Authorship Attribution of Source Code
    Hao, Pengnan
    Li, Zhen
    Liu, Cui
    Wen, Yu
    Liu, Fanming
    [J]. 2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2022, : 516 - 526
  • [6] A Bayesian Ensemble Classifier for Source Code Authorship Attribution
    Tennyson, Matthew F.
    Mitropoulos, Francisco J.
    [J]. SIMILARITY SEARCH AND APPLICATIONS, 2014, 8821 : 265 - 276
  • [7] Large-Scale and Language-Oblivious Code Authorship Identification
    Abuhamad, Mohammed
    AbuHmed, Tamer
    Mohaisen, Aziz
    Nyang, DaeHun
    [J]. PROCEEDINGS OF THE 2018 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY (CCS'18), 2018, : 101 - 114
  • [8] Misleading Authorship Attribution of Source Code using Adversarial Learning
    Quiring, Erwin
    Maier, Alwin
    Rieck, Konrad
    [J]. PROCEEDINGS OF THE 28TH USENIX SECURITY SYMPOSIUM, 2019, : 479 - 496
  • [9] Source code authorship attribution using n-grams
    Burrows, Steven
    Tahaghoghi, S.M.M.
    [J]. ADCS 2007 - Proceedings of the Twelfth Australasian Document Computing Symposium, 2007, : 32 - 39
  • [10] Application of Information Retrieval Techniques for Source Code Authorship Attribution
    Burrows, Steven
    Uitdenbogerd, Alexandra L.
    Turpin, Andrew
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2009, 5463 : 699 - 713