Detecting code vulnerabilities by learning from large-scale open source repositories

被引:5
|
作者
Xu, Rongze [1 ]
Tang, Zhanyong [1 ]
Ye, Guixin [1 ]
Wang, Huanting [1 ]
Ke, Xin [1 ]
Fang, Dingyi [1 ]
Wang, Zheng [2 ]
机构
[1] Northwest Univ, Xian, Peoples R China
[2] Univ Leeds, Leeds, England
基金
中国国家自然科学基金;
关键词
Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability; LSTM;
D O I
10.1016/j.jisa.2022.103293
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model's capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present DEVELOPER,(1 )a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, DEVELOPER automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, DEVELOPER employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network - a bidirectional long-short term memory architecture - to predict if the target code contains a vulnerability or not. We apply DEVELOPER to identify vulnerabilities at the program source-code level. Our evaluation shows that DEVELOPER outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] CCEyes: An Effective Tool for Code Clone Detection on Large-Scale Open Source Repositories
    Zhang, Yanzhi
    Wang, Tao
    2021 IEEE INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SOFTWARE ENGINEERING (ICICSE 2021), 2021, : 61 - 70
  • [2] Query by Example in Large-Scale Code Repositories
    Balachandran, Vipin
    2015 31ST INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME) PROCEEDINGS, 2015, : 467 - 476
  • [3] Detecting Vulnerabilities in Source Code Using Machine Learning
    Hany, Omar
    Abu-Elkheir, Mervat
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON APPLIED CYBER SECURITY (ACS) 2021, 2022, 378 : 35 - 41
  • [4] ExPort: Detecting and Visualizing API Usages in Large Source Code Repositories
    Moritz, Evan
    Linares-Vasquez, Mario
    Poshyvanyk, Denys
    Grechanik, Mark
    McMillan, Collin
    Gethers, Malcom
    2013 28TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2013, : 646 - 651
  • [5] Sourcerer: An infrastructure for large-scale collection and analysis of open-source code
    Bajracharya, Sushi
    Ossher, Joel
    Lopes, Cristina
    SCIENCE OF COMPUTER PROGRAMMING, 2014, 79 : 241 - 259
  • [6] Code smells and their collocations: A large-scale experiment on open-source systems
    Walter, Bartosz
    Fontana, Francesca Arcelli
    Ferme, Vincenzo
    JOURNAL OF SYSTEMS AND SOFTWARE, 2018, 144 : 1 - 21
  • [7] Code Coverage and Postrelease Defects: A Large-Scale Study on Open Source Projects
    Kochhar, Pavneet Singh
    Lo, David
    Lawall, Julia
    Nagappan, Nachiappan
    IEEE TRANSACTIONS ON RELIABILITY, 2017, 66 (04) : 1213 - 1228
  • [8] Discovery of Technical Expertise from Open Source Code Repositories
    Venkataramani, Rahul
    Gupta, Atul
    Asadullah, Allahbaksh
    Muddu, Basavaraju
    Bhat, Vasudev
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 97 - 98
  • [9] Understanding Source Code Comments at Large-Scale
    He, Hao
    ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, : 1217 - 1219
  • [10] Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers
    Fehrer, Therese
    Lozoya, Rocio Cabrera
    Sabetta, Antonino
    Di Nucci, Dario
    Tamburri, Damian A.
    PROCEEDINGS OF 2024 28TH INTERNATION CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2024, 2024, : 429 - 432