Detecting code vulnerabilities by learning from large-scale open source repositories

被引:5
|
作者
Xu, Rongze [1 ]
Tang, Zhanyong [1 ]
Ye, Guixin [1 ]
Wang, Huanting [1 ]
Ke, Xin [1 ]
Fang, Dingyi [1 ]
Wang, Zheng [2 ]
机构
[1] Northwest Univ, Xian, Peoples R China
[2] Univ Leeds, Leeds, England
基金
中国国家自然科学基金;
关键词
Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability; LSTM;
D O I
10.1016/j.jisa.2022.103293
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model's capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present DEVELOPER,(1 )a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, DEVELOPER automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, DEVELOPER employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network - a bidirectional long-short term memory architecture - to predict if the target code contains a vulnerability or not. We apply DEVELOPER to identify vulnerabilities at the program source-code level. Our evaluation shows that DEVELOPER outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] SGL: A domain-specific language for large-scale analysis of open-source code
    Foo, Darius
    Yi, Ang Ming
    Yeo, Jason
    Sharma, Asankhaya
    2018 IEEE CYBERSECURITY DEVELOPMENT CONFERENCE (SECDEV 2018), 2018, : 61 - 68
  • [22] The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software
    Reid, David
    Jahanshahi, Mahmoud
    Mockus, Audris
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 2104 - 2115
  • [23] Detecting Vulnerabilities Using Open-Source Intelligence
    Balaji, S. Jai
    Karmel, A.
    HYBRID INTELLIGENT SYSTEMS, HIS 2021, 2022, 420 : 530 - 540
  • [24] Uncovering Source Code Reuse in Large-Scale Academic Environments
    Flores, Enrique
    Barron-Cedeno, Alberto
    Moreno, Lidia
    Rosso, Paolo
    COMPUTER APPLICATIONS IN ENGINEERING EDUCATION, 2015, 23 (03) : 383 - 390
  • [25] TRANSCODE: Detecting Status Code Mapping Errors in Large-Scale Systems
    Tang, Wensheng
    Hu, Yikun
    Fan, Gang
    Yao, Peisen
    Wu, Rongxin
    Bai, Guangyuan
    Wang, Pengcheng
    Zhang, Charles
    2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 829 - 841
  • [26] Development of a code clone search tool for open source repositories
    Xia, Pei
    Manabe, Yuki
    Yoshida, Norihiro
    Inoue, Katsuro
    Computer Software, 2012, 29 (03): : 181 - 187
  • [27] Detecting and Mitigating Secret-Key Leaks in Source Code Repositories
    Sinha, Vibha Singhal
    Saha, Diptikalyan
    Dhoolia, Pankaj
    Padhye, Rohan
    Mani, Senthil
    12TH WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2015), 2015, : 396 - 400
  • [28] A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories
    Lixiang Hong
    Jinjian Lin
    Shuya Li
    Fangping Wan
    Hui Yang
    Tao Jiang
    Dan Zhao
    Jianyang Zeng
    Nature Machine Intelligence, 2020, 2 : 347 - 355
  • [29] A collaborative and open solution for large-scale online learning
    Zhou, Qingguo
    Sun, Hongyu
    Zhou, Rui
    Sun, Geng
    Shen, Jun
    Li, Kuan-Ching
    COMPUTER APPLICATIONS IN ENGINEERING EDUCATION, 2018, 26 (06) : 2266 - 2281
  • [30] A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories
    Hong, Lixiang
    Lin, Jinjian
    Li, Shuya
    Wan, Fangping
    Yang, Hui
    Jiang, Tao
    Zhao, Dan
    Zeng, Jianyang
    NATURE MACHINE INTELLIGENCE, 2020, 2 (06) : 347 - +