Source Code Authorship Attribution Using Long Short-Term Memory Based Networks

被引:44
|
作者
Alsulami, Bander [1 ]
Dauber, Edwin [1 ]
Harang, Richard [2 ]
Mancoridis, Spiros [1 ]
Greenstadt, Rachel [1 ]
机构
[1] Drexel Univ, Philadelphia, PA 19104 USA
[2] Sophos, Abingdon, Oxon, England
来源
关键词
Source code authorship attribution; Code stylometry; Long short-term memory; Abstract syntax tree; Security; Privacy; BACKPROPAGATION;
D O I
10.1007/978-3-319-66402-6_6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning approaches to source code authorship attribution attempt to find statistical regularities in human-generated source code that can identify the author or authors of that code. This has applications in plagiarism detection, intellectual property infringement, and post-incident forensics in computer security. The introduction of features derived from the Abstract Syntax Tree (AST) of source code has recently set new benchmarks in this area, significantly improving over previous work that relied on easily obfuscatable lexical and format features of program source code. However, these AST-based approaches rely on hand-constructed features derived from such trees, and often include ancillary information such as function and variable names that may be obfuscated or manipulated. In this work, we provide novel contributions to AST-based source code authorship attribution using deep neural networks. We implement Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models to automatically extract relevant features from the AST representation of programmers' source code. We show that our models can automatically learn efficient representations of AST-based features without needing hand-constructed ancillary information used by previous methods. Our empirical study on multiple datasets with different programming languages shows that our proposed approach achieves the state-of-the-art performance for source code authorship attribution on AST-based features, despite not leveraging information that was previously thought to be required for high-confidence classification.
引用
收藏
页码:65 / 82
页数:18
相关论文
共 50 条
  • [1] Short-Term Traffic Prediction Using Long Short-Term Memory Neural Networks
    Abbas, Zainab
    Al-Shishtawy, Ahmad
    Girdzijauskas, Sarunas
    Vlassov, Vladimir
    [J]. 2018 IEEE INTERNATIONAL CONGRESS ON BIG DATA (IEEE BIGDATA CONGRESS), 2018, : 57 - 65
  • [2] Improving source code suggestion with code embedding and enhanced convolutional long short-term memory
    Hussain, Yasir
    Huang, Zhiqiu
    Zhou, Yu
    [J]. IET SOFTWARE, 2021, 15 (03) : 199 - 213
  • [3] Authorship Obfuscation System Development based on Long Short-term Memory Algorithm
    Maulana, Hendrik
    Sari, Riri Fitri
    [J]. INTERNATIONAL JOURNAL OF TECHNOLOGY, 2022, 13 (02) : 345 - 355
  • [4] Android Authorship Attribution Using Source Code-Based Features
    Aydogan, Emre
    Sen, Sevil
    [J]. IEEE ACCESS, 2024, 12 : 6569 - 6589
  • [5] Reliability Estimation Using Long Short-Term Memory Networks
    Davila-Frias, Alex
    Khumprom, Phattara
    Yadav, Om Prakash
    [J]. 2023 ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, RAMS, 2023,
  • [6] Classification of HRV using Long Short-Term Memory Networks
    Leite, Argentina
    Silva, Maria Eduarda
    Rocha, Ana Paula
    [J]. 2020 11TH CONFERENCE OF THE EUROPEAN STUDY GROUP ON CARDIOVASCULAR OSCILLATIONS (ESGCO): COMPUTATION AND MODELLING IN PHYSIOLOGY NEW CHALLENGES AND OPPORTUNITIES, 2020,
  • [7] Short-term traffic travel time forecasting using ensemble approach based on long short-term memory networks
    Jia, Xingli
    Zhou, Wuxiao
    Yang, Hongzhi
    Li, Shuangqing
    Chen, Xingpeng
    [J]. IET INTELLIGENT TRANSPORT SYSTEMS, 2023, 17 (06) : 1262 - 1273
  • [8] On Improving Authorship Attribution of Source Code
    Tennyson, Matthew F.
    [J]. DIGITAL FORENSICS AND CYBER CRIME, ICDF2C 2012, 2013, 114 : 58 - 65
  • [9] On the Initialization of Long Short-Term Memory Networks
    Ghazi, Mostafa Mehdipour
    Nielsen, Mads
    Pai, Akshay
    Modat, Marc
    Cardoso, M. Jorge
    Ourselin, Sebastien
    Sorensen, Lauge
    [J]. NEURAL INFORMATION PROCESSING (ICONIP 2019), PT I, 2019, 11953 : 275 - 286
  • [10] Evolving Long Short-Term Memory Networks
    Neto, Vicente Coelho Lobo
    Passos, Leandro Aparecido
    Papa, Joao Paulo
    [J]. COMPUTATIONAL SCIENCE - ICCS 2020, PT II, 2020, 12138 : 337 - 350