Large-Scale and Language-Oblivious Code Authorship Identification

被引:48
|
作者
Abuhamad, Mohammed [1 ]
AbuHmed, Tamer [1 ]
Mohaisen, Aziz [2 ]
Nyang, DaeHun [1 ]
机构
[1] Inha Univ, Incheon, South Korea
[2] Univ Cent Florida, Orlando, FL 32816 USA
基金
新加坡国家研究基金会;
关键词
Code Authorship Identiication; program features; deep learning identiication; software forensics; ATTRIBUTION;
D O I
10.1145/3243734.3243738
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Efficient extraction of code authorship attributes is key for successful identification. However, the extraction of such attributes is very challenging, due to various programming language specifics, the limited number of available code samples per author, and the average code lines per file, among others. To this end, this work proposes a Deep Learning-based Code Authorship Identification System (DL-CAIS) for code authorship attribution that facilitates large-scale, language-oblivious, and obfuscation-resilient code authorship identification. The deep learning architecture adopted in this work includes TF-IDF-based deep representation using multiple Recurrent Neural Network (RNN) layers and fully-connected layers dedicated to authorship attribution learning. The deep representation then feeds into a random forest classifier for scalability to de-anonymize the author. Comprehensive experiments are conducted to evaluate DL-CAIS over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1987 public repositories on GitHub. The results of our work show the high accuracy despite requiring a smaller number of files per author. Namely, we achieve an accuracy of 96% when experimenting with 1,600 authors for GCJ, and 94.38% for the real-world dataset for 745 C programmers. Our system also allows us to identify 8,903 authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Moreover, our technique is resilient to language-speciics, and thus it can identify authors of four programming languages (e. g., C, C++, Java, and Python), and authors writing in mixed languages (e. g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e. g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors.
引用
收藏
页码:101 / 114
页数:14
相关论文
共 50 条
  • [1] Large-scale and Robust Code Authorship Identification with Deep Feature Learning
    Abuhamad, Mohammed
    Abuhmed, Tamer
    Mohaisen, David
    Nyang, Daehun
    [J]. ACM TRANSACTIONS ON PRIVACY AND SECURITY, 2021, 24 (04)
  • [2] Language and Obfuscation Oblivious Source Code Authorship Attribution
    Zafar, Sarim
    Sarwar, Muhammad Usman
    Salem, Saeed
    Malik, Muhammad Zubair
    [J]. IEEE ACCESS, 2020, 8 (08): : 197581 - 197596
  • [3] Large-Scale Experiments in Authorship Attribution
    Juola, Patrick
    [J]. ENGLISH STUDIES, 2012, 93 (03) : 275 - 283
  • [4] Visual Low-Code Language for Orchestrating Large-Scale Distributed Computing
    Kamil Rybiński
    Michał Śmiałek
    Agris Sostaks
    Krzysztof Marek
    Radosław Roszczyk
    Marek Wdowiak
    [J]. Journal of Grid Computing, 2023, 21
  • [5] Visual Low-Code Language for Orchestrating Large-Scale Distributed Computing
    Rybinski, Kamil
    Smialek, Michal
    Sostaks, Agris
    Marek, Krzysztof
    Roszczyk, Radoslaw
    Wdowiak, Marek
    [J]. JOURNAL OF GRID COMPUTING, 2023, 21 (03)
  • [7] Oblivious Equilibrium for Large-Scale Stochastic Games with Unbounded Costs
    Adlakha, Sachin
    Johari, Ramesh
    Weintraub, Gabriel
    Goldsmith, Andrea
    [J]. 47TH IEEE CONFERENCE ON DECISION AND CONTROL, 2008 (CDC 2008), 2008, : 5531 - 5538
  • [8] Significance of neural phonotactic models for large-scale spoken language identification
    Srivastava, Brij Mohan Lal
    Vydana, Hari
    Vuppala, Anil Kumar
    Shrivastava, Manish
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 2144 - 2151
  • [9] LARGE-SCALE SPEAKER IDENTIFICATION
    Schmidt, Ludwig
    Sharifi, Matthew
    Moreno, Ignacio Lopez
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [10] Channel-Oblivious Counting Algorithms for Large-Scale RFID Systems
    Sze, Wai Kit
    Deng, Yulin
    Lau, Wing Cheong
    Kodialam, Murali
    Nandagopal, Thyaga
    Yue, Onching
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (12) : 3303 - 3316