Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases

被引:77
|
作者
Lin, Guanjun [1 ]
Zhang, Jun [1 ]
Luo, Wei [2 ]
Pan, Lei [2 ]
De Vel, Olivier [3 ]
Montague, Paul [3 ]
Xiang, Yang [1 ]
机构
[1] Swinburne Univ Technol, Sch Software & Elect Engn, Melbourne, Vic 3122, Australia
[2] Deakin Univ, Sch Informat Technol, Geelong, Vic 3216, Australia
[3] Def Sci & Technol Grp DSTG, Dept Def, Canberra, ACT 2610, Australia
关键词
Software; Feature extraction; Deep learning; Feeds; Task analysis; Neural networks; Data mining; Vulnerability discovery; representation learning; deep learning;
D O I
10.1109/TDSC.2019.2954088
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The scarceness of vulnerability data is almost always a problem for any developing software project during its early stages, which is referred to as the cold-start problem. This article proposes a framework that utilizes transferable knowledge from pre-existing data sources. In order to improve the detection performance, multiple vulnerability-relevant data sources were selected to form a broader base for learning transferable knowledge. The selected vulnerability-relevant data sources are cross-domain, including historical vulnerability data from different software projects and data from the Software Assurance Reference Database (SARD) consisting of synthetic vulnerability examples and proof-of-concept test cases. To extract the information applicable in vulnerability detection from the cross-domain data sets, we designed a deep-learning-based framework with Long-short Term Memory (LSTM) cells. Our framework combines the heterogeneous data sources to learn unified representations of the patterns of the vulnerable source codes. Empirical studies showed that the unified representations generated by the proposed deep learning networks are feasible and effective, and are transferable for real-world vulnerability detection. Our experiments demonstrated that by leveraging two heterogeneous data sources, the performance of our vulnerability detection outperformed the static vulnerability discovery tool Flawfinder. The findings of this article may stimulate further research in ML-based vulnerability detection using heterogeneous data sources.
引用
收藏
页码:2469 / 2485
页数:17
相关论文
共 50 条
  • [1] A uniform human knowledge interface to the multi-domain knowledge bases in the National Knowledge Infrastructure
    Feng, QG
    Cao, CN
    Si, JX
    Zheng, YF
    APPLICATIONS AND INNOVATIONS IN INTELLIGENT SYSTEMS X, 2003, : 163 - 176
  • [2] Multi-Domain Sequential Recommendation via Domain Space Learning
    Hwang, Junyoung
    Ju, Hyunjun
    Kang, SeongKu
    Jang, Sanghwan
    Yu, Hwanjo
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2134 - 2144
  • [3] Discovery of multi-domain spatiotemporal associations
    Walkikar, Prathamesh
    Shi, Lei
    Tama, Bayu Adhi
    Janeja, Vandana P.
    GEOINFORMATICA, 2024, 28 (03) : 353 - 379
  • [4] Semantic Segmentation via Multi-task, Multi-domain Learning
    Fourure, Damien
    Emonet, Remi
    Fromont, Elisa
    Muselet, Damien
    Tremeau, Alain
    Wolf, Christian
    STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2016, 2016, 10029 : 333 - 343
  • [5] Learning Low-dimensional Multi-domain Knowledge Graph Embedding via Dual Archimedean Spirals
    Li, Jiang
    Su, Xiangdong
    Zhang, Fujun
    Gao, Guanglai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1982 - 1994
  • [6] Software expert discovery via knowledge domain embeddings in a collaborative network
    Huang, Chaoran
    Yao, Lina
    Wang, Xianzhi
    Benatallah, Boualem
    Zhang, Xiang
    PATTERN RECOGNITION LETTERS, 2020, 130 : 46 - 53
  • [7] Trajectories of Software Piracy and Multi-Domain Predictors
    Lee, Yeungjeom
    Kim, Jihoon
    Jennings, Wesley
    Wu, Ethan Yih Chian
    CRIME & DELINQUENCY, 2024,
  • [8] Multi-domain Software Defined Network Provisioning
    Wibowo, Franciscus X. A.
    Gregory, Mark A.
    2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 81 - 87
  • [9] Multi-Domain Causal Representation Learning via Weak Distributional Invariances
    Ahuja, Kartik
    Mansouri, Amin
    Wang, Yixin
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [10] Multi-Domain Active Learning for Recommendation
    Zhang, Zihan
    Jin, Xiaoming
    Li, Lianghao
    Ding, Guiguang
    Yang, Qiang
    THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 2358 - 2364