Detecting code vulnerabilities by learning from large-scale open source repositories

被引：5

作者：

Xu, Rongze ^{[1
]}

Tang, Zhanyong ^{[1
]}

Ye, Guixin ^{[1
]}

Wang, Huanting ^{[1
]}

Ke, Xin ^{[1
]}

Fang, Dingyi ^{[1
]}

Wang, Zheng ^{[2
]}

机构：

[1] Northwest Univ, Xian, Peoples R China

[2] Univ Leeds, Leeds, England

来源：

JOURNAL OF INFORMATION SECURITY AND APPLICATIONS | 2022年 / 69卷

基金：

中国国家自然科学基金;

关键词：

Code vulnerability detection; Deep learning; Attention mechanism; Software vulnerability; LSTM;

D O I：

10.1016/j.jisa.2022.103293

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Machine learning methods are widely used to identify common, repeatedly occurring bugs and code vulnerabilities. The performance of a machine-learned model is bounded by the quality and quantity of training data and the model's capability in extracting and capturing the essential information of the problem domain. Unfortunately, there is a storage of high-quality samples for training code vulnerability detection models, and existing machine learning methods are inadequate in capturing code vulnerability patterns.We present DEVELOPER,(1 )a novel learning framework for building code vulnerability detection models. To address the data scarcity challenge, DEVELOPER automatically gathers training samples from open-source projects and applies constraints rules to the collected data to filter out noisy data to improve the quality of the collected samples. The collected data provides many real-world vulnerable code training samples to complement the samples available in standard vulnerable databases. To build an effective code vulnerability detection model, DEVELOPER employs a convolutional neural network architecture with attention mechanisms to extract code representation from the program abstract syntax tree. The extracted program representation is then fed to a downstream network - a bidirectional long-short term memory architecture - to predict if the target code contains a vulnerability or not. We apply DEVELOPER to identify vulnerabilities at the program source-code level. Our evaluation shows that DEVELOPER outperforms state-of-the-art methods by uncovering more vulnerabilities with a lower false-positive rate.

引用

页数：14

共 50 条

[21] SGL: A domain-specific language for large-scale analysis of open-source code
Foo, Darius
Yi, Ang Ming
Yeo, Jason
Sharma, Asankhaya
2018 IEEE CYBERSECURITY DEVELOPMENT CONFERENCE (SECDEV 2018), 2018, : 61 - 68
[22] The Extent of Orphan Vulnerabilities from Code Reuse in Open Source Software
Reid, David
Jahanshahi, Mahmoud
Mockus, Audris
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 2104 - 2115
[23] Detecting Vulnerabilities Using Open-Source Intelligence
Balaji, S. Jai
Karmel, A.
HYBRID INTELLIGENT SYSTEMS, HIS 2021, 2022, 420 : 530 - 540
[24] Uncovering Source Code Reuse in Large-Scale Academic Environments
Flores, Enrique
Barron-Cedeno, Alberto
Moreno, Lidia
Rosso, Paolo
COMPUTER APPLICATIONS IN ENGINEERING EDUCATION, 2015, 23 (03) : 383 - 390
[25] TRANSCODE: Detecting Status Code Mapping Errors in Large-Scale Systems
Tang, Wensheng
Hu, Yikun
Fan, Gang
Yao, Peisen
Wu, Rongxin
Bai, Guangyuan
Wang, Pengcheng
Zhang, Charles
2021 36TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING ASE 2021, 2021, : 829 - 841
[26] Development of a code clone search tool for open source repositories
Xia, Pei
Manabe, Yuki
Yoshida, Norihiro
Inoue, Katsuro
Computer Software, 2012, 29 (03): : 181 - 187
[27] Detecting and Mitigating Secret-Key Leaks in Source Code Repositories
Sinha, Vibha Singhal
Saha, Diptikalyan
Dhoolia, Pankaj
Padhye, Rohan
Mani, Senthil
12TH WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2015), 2015, : 396 - 400
[28] A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories
Lixiang Hong
Jinjian Lin
Shuya Li
Fangping Wan
Hui Yang
Tao Jiang
Dan Zhao
Jianyang Zeng
Nature Machine Intelligence, 2020, 2 : 347 - 355
[29] A collaborative and open solution for large-scale online learning
Zhou, Qingguo
Sun, Hongyu
Zhou, Rui
Sun, Geng
Shen, Jun
Li, Kuan-Ching
COMPUTER APPLICATIONS IN ENGINEERING EDUCATION, 2018, 26 (06) : 2266 - 2281
[30] A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories
Hong, Lixiang
Lin, Jinjian
Li, Shuya
Wan, Fangping
Yang, Hui
Jiang, Tao
Zhao, Dan
Zeng, Jianyang
NATURE MACHINE INTELLIGENCE, 2020, 2 (06) : 347 - +

← 1 2 3 4 5 →