Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases

被引：77

作者：

Lin, Guanjun ^{[1
]}

Zhang, Jun ^{[1
]}

Luo, Wei ^{[2
]}

Pan, Lei ^{[2
]}

De Vel, Olivier ^{[3
]}

Montague, Paul ^{[3
]}

Xiang, Yang ^{[1
]}

机构：

[1] Swinburne Univ Technol, Sch Software & Elect Engn, Melbourne, Vic 3122, Australia

[2] Deakin Univ, Sch Informat Technol, Geelong, Vic 3216, Australia

[3] Def Sci & Technol Grp DSTG, Dept Def, Canberra, ACT 2610, Australia

来源：

IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING | 2021年 / 18卷 / 05期

关键词：

Software; Feature extraction; Deep learning; Feeds; Task analysis; Neural networks; Data mining; Vulnerability discovery; representation learning; deep learning;

D O I：

10.1109/TDSC.2019.2954088

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The scarceness of vulnerability data is almost always a problem for any developing software project during its early stages, which is referred to as the cold-start problem. This article proposes a framework that utilizes transferable knowledge from pre-existing data sources. In order to improve the detection performance, multiple vulnerability-relevant data sources were selected to form a broader base for learning transferable knowledge. The selected vulnerability-relevant data sources are cross-domain, including historical vulnerability data from different software projects and data from the Software Assurance Reference Database (SARD) consisting of synthetic vulnerability examples and proof-of-concept test cases. To extract the information applicable in vulnerability detection from the cross-domain data sets, we designed a deep-learning-based framework with Long-short Term Memory (LSTM) cells. Our framework combines the heterogeneous data sources to learn unified representations of the patterns of the vulnerable source codes. Empirical studies showed that the unified representations generated by the proposed deep learning networks are feasible and effective, and are transferable for real-world vulnerability detection. Our experiments demonstrated that by leveraging two heterogeneous data sources, the performance of our vulnerability detection outperformed the static vulnerability discovery tool Flawfinder. The findings of this article may stimulate further research in ML-based vulnerability detection using heterogeneous data sources.

引用

页码：2469 / 2485

页数：17

共 50 条

[1] A uniform human knowledge interface to the multi-domain knowledge bases in the National Knowledge Infrastructure
Feng, QG
Cao, CN
Si, JX
Zheng, YF
APPLICATIONS AND INNOVATIONS IN INTELLIGENT SYSTEMS X, 2003, : 163 - 176
[2] Multi-Domain Sequential Recommendation via Domain Space Learning
Hwang, Junyoung
Ju, Hyunjun
Kang, SeongKu
Jang, Sanghwan
Yu, Hwanjo
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2134 - 2144
[3] Discovery of multi-domain spatiotemporal associations
Walkikar, Prathamesh
Shi, Lei
Tama, Bayu Adhi
Janeja, Vandana P.
GEOINFORMATICA, 2024, 28 (03) : 353 - 379
[4] Semantic Segmentation via Multi-task, Multi-domain Learning
Fourure, Damien
Emonet, Remi
Fromont, Elisa
Muselet, Damien
Tremeau, Alain
Wolf, Christian
STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2016, 2016, 10029 : 333 - 343
[5] Learning Low-dimensional Multi-domain Knowledge Graph Embedding via Dual Archimedean Spirals
Li, Jiang
Su, Xiangdong
Zhang, Fujun
Gao, Guanglai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1982 - 1994
[6] Software expert discovery via knowledge domain embeddings in a collaborative network
Huang, Chaoran
Yao, Lina
Wang, Xianzhi
Benatallah, Boualem
Zhang, Xiang
PATTERN RECOGNITION LETTERS, 2020, 130 : 46 - 53
[7] Trajectories of Software Piracy and Multi-Domain Predictors
Lee, Yeungjeom
Kim, Jihoon
Jennings, Wesley
Wu, Ethan Yih Chian
CRIME & DELINQUENCY, 2024,
[8] Multi-domain Software Defined Network Provisioning
Wibowo, Franciscus X. A.
Gregory, Mark A.
2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 81 - 87
[9] Multi-Domain Causal Representation Learning via Weak Distributional Invariances
Ahuja, Kartik
Mansouri, Amin
Wang, Yixin
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
[10] Multi-Domain Active Learning for Recommendation
Zhang, Zihan
Jin, Xiaoming
Li, Lianghao
Ding, Guiguang
Yang, Qiang
THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 2358 - 2364

← 1 2 3 4 5 →