Detecting code vulnerabilities by learning from large-scale open source repositories

被引:5
|
作者
Xu, Rongze [1 ]
Tang, Zhanyong [1 ]
Ye, Guixin [1 ]
Wang, Huanting [1 ]
Ke, Xin [1 ]
Fang, Dingyi [1 ]
Wang, Zheng [2 ]
机构
[1] Northwest Univ, Xian, Peoples R China
[2] Univ Leeds, Leeds, England
基金
中国国家自然科学基金;
关键词
Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability; LSTM;
D O I
10.1016/j.jisa.2022.103293
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model's capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present DEVELOPER,(1 )a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, DEVELOPER automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, DEVELOPER employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network - a bidirectional long-short term memory architecture - to predict if the target code contains a vulnerability or not. We apply DEVELOPER to identify vulnerabilities at the program source-code level. Our evaluation shows that DEVELOPER outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] On the cost of mining very large open source repositories
    Banerjee, Sean
    Cukic, Bojan
    2015 IEEE/ACM 1ST INTERNATIONAL WORKSHOP ON BIG DATA SOFTWARE ENGINEERING, 2015, : 37 - 43
  • [32] Software evolution in open source projects - a large-scale investigation
    Koch, Stefan
    JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION-RESEARCH AND PRACTICE, 2007, 19 (06): : 361 - 382
  • [33] Adaptation of large-scale open source software - An experience report
    Pizka, M
    CSMR 2004: EIGHTH EUROPEAN CONFERENCE ON SOFTWARE MAINTENANCE AND REENGINEERING, PROCEEDINGS, 2004, : 147 - 153
  • [34] A Large-scale Dataset of (Open Source) License Text Variants
    Zacchiroli, Stefano
    2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 757 - 761
  • [35] Large-scale open innovation: open source vs. patent pools
    Rayna, Thierry
    Striukova, Ludmila
    INTERNATIONAL JOURNAL OF TECHNOLOGY MANAGEMENT, 2010, 52 (3-4) : 477 - 496
  • [36] MigrationAdvisor: Recommending Library Migrations from Large-Scale Open-Source Data
    He, Hao
    Xu, Yulin
    Cheng, Xiao
    Liang, Guangtai
    Zhou, Minghui
    2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2021), 2021, : 9 - 12
  • [37] MigrationAdvisor: Recommending Library Migrations from Large-Scale Open-Source Data
    He, Hao
    Xu, Yulin
    Cheng, Xiao
    Liang, Guangtai
    Zhou, Minghui
    Proceedings - International Conference on Software Engineering, 2021, : 9 - 12
  • [38] MRAttractor: Detecting Communities from Large-Scale Graphs
    Nguyen Vo
    Lee, Kyumin
    Thanh Tran
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 797 - 806
  • [39] AN AGENDA FOR RESEARCH IN LARGE-SCALE DISTRIBUTED DATA REPOSITORIES
    SATYANARAYANAN, M
    LECTURE NOTES IN COMPUTER SCIENCE, 1991, 563 : 2 - 10
  • [40] Special issue on towards advancements in machine learning for exploiting large-scale and heterogeneous repositories
    Anwar, Sajid
    Rocha, Alvaro
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (11): : 7909 - 7911