Detecting code vulnerabilities by learning from large-scale open source repositories

被引:5
|
作者
Xu, Rongze [1 ]
Tang, Zhanyong [1 ]
Ye, Guixin [1 ]
Wang, Huanting [1 ]
Ke, Xin [1 ]
Fang, Dingyi [1 ]
Wang, Zheng [2 ]
机构
[1] Northwest Univ, Xian, Peoples R China
[2] Univ Leeds, Leeds, England
基金
中国国家自然科学基金;
关键词
Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability; LSTM;
D O I
10.1016/j.jisa.2022.103293
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model's capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present DEVELOPER,(1 )a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, DEVELOPER automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, DEVELOPER employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network - a bidirectional long-short term memory architecture - to predict if the target code contains a vulnerability or not. We apply DEVELOPER to identify vulnerabilities at the program source-code level. Our evaluation shows that DEVELOPER outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Special issue on towards advancements in machine learning for exploiting large-scale and heterogeneous repositories
    Sajid Anwar
    Álvaro Rocha
    Neural Computing and Applications, 2023, 35 : 7909 - 7911
  • [42] Code4ML: a large-scale dataset of annotated Machine Learning code
    Drozdova, Anastasia
    Trofimova, Ekaterina
    Guseva, Polina
    Scherbakova, Anna
    Ustyuzhanin, Andrey
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [43] Code4ML: a large-scale dataset of annotated Machine Learning code
    Drozdova A.
    Trofimova E.
    Guseva P.
    Scherbakova A.
    Ustyuzhanin A.
    PeerJ Computer Science, 2023, 9
  • [44] An empirical analysis of the open source development process based on mining of source code repositories
    Scotto, Marco
    Sillitti, Alberto
    Succi, Giancarlo
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2007, 17 (02) : 231 - 247
  • [45] Inferring Behavioral Specifications from Large-scale Repositories by Leveraging Collective Intelligence
    Rajan, Hridesh
    Nguyen, Tien N.
    Leavens, Gary T.
    Dyer, Robert
    2015 IEEE/ACM 37TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, VOL 2, 2015, : 579 - 582
  • [46] Intelligent Code Review Assignment for Large Scale Open Source Software Stacks
    Aryendu, Ishan
    Wang, Ying
    Elkourdi, Farah
    AlOmar, Eman
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [47] Toward a Large-Scale Open Learning System for Data Management
    Murthy, Sean
    Figueroa, Andrew
    Rollo, Steven
    PROCEEDINGS OF THE FIFTH ANNUAL ACM CONFERENCE ON LEARNING AT SCALE (L@S'18), 2018,
  • [48] Large-scale and Robust Code Authorship Identification with Deep Feature Learning
    Abuhamad, Mohammed
    Abuhmed, Tamer
    Mohaisen, David
    Nyang, Daehun
    ACM TRANSACTIONS ON PRIVACY AND SECURITY, 2021, 24 (04)
  • [49] BRAN: Reduce Vulnerability Search Space in Large Open Source Repositories by Learning Bug Symptoms
    Meng, Dongyu
    Guerriero, Michele
    Machiry, Aravind
    Aghakhani, Hojjat
    Bose, Priyanka
    Continella, Andrea
    Kruegel, Christopher
    Vigna, Giovanni
    ASIA CCS'21: PROCEEDINGS OF THE 2021 ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2021, : 731 - 743
  • [50] An open source analysis framework for large-scale building energy modeling
    Ball, Brian L.
    Long, Nicholas
    Fleming, Katherine
    Balbach, Chris
    Lopez, Phylroy
    JOURNAL OF BUILDING PERFORMANCE SIMULATION, 2020, 13 (05) : 487 - 500