Software Vulnerability Discovery via Learning Multi-Domain Knowledge Bases

被引:77
|
作者
Lin, Guanjun [1 ]
Zhang, Jun [1 ]
Luo, Wei [2 ]
Pan, Lei [2 ]
De Vel, Olivier [3 ]
Montague, Paul [3 ]
Xiang, Yang [1 ]
机构
[1] Swinburne Univ Technol, Sch Software & Elect Engn, Melbourne, Vic 3122, Australia
[2] Deakin Univ, Sch Informat Technol, Geelong, Vic 3216, Australia
[3] Def Sci & Technol Grp DSTG, Dept Def, Canberra, ACT 2610, Australia
关键词
Software; Feature extraction; Deep learning; Feeds; Task analysis; Neural networks; Data mining; Vulnerability discovery; representation learning; deep learning;
D O I
10.1109/TDSC.2019.2954088
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning (ML) has great potential in automated code vulnerability discovery. However, automated discovery application driven by off-the-shelf machine learning tools often performs poorly due to the shortage of high-quality training data. The scarceness of vulnerability data is almost always a problem for any developing software project during its early stages, which is referred to as the cold-start problem. This article proposes a framework that utilizes transferable knowledge from pre-existing data sources. In order to improve the detection performance, multiple vulnerability-relevant data sources were selected to form a broader base for learning transferable knowledge. The selected vulnerability-relevant data sources are cross-domain, including historical vulnerability data from different software projects and data from the Software Assurance Reference Database (SARD) consisting of synthetic vulnerability examples and proof-of-concept test cases. To extract the information applicable in vulnerability detection from the cross-domain data sets, we designed a deep-learning-based framework with Long-short Term Memory (LSTM) cells. Our framework combines the heterogeneous data sources to learn unified representations of the patterns of the vulnerable source codes. Empirical studies showed that the unified representations generated by the proposed deep learning networks are feasible and effective, and are transferable for real-world vulnerability detection. Our experiments demonstrated that by leveraging two heterogeneous data sources, the performance of our vulnerability detection outperformed the static vulnerability discovery tool Flawfinder. The findings of this article may stimulate further research in ML-based vulnerability detection using heterogeneous data sources.
引用
收藏
页码:2469 / 2485
页数:17
相关论文
共 50 条
  • [21] MULTI-DOMAIN LEARNING BY META-LEARNING: TAKING OPTIMAL STEPS IN MULTI-DOMAIN LOSS LANDSCAPES BY INNER-LOOP LEARNING
    Sicilia, Anthony
    Zhao, Xingchen
    Minhas, Davneet S.
    O'Connor, Erin E.
    Aizenstein, Howard J.
    Klunk, William E.
    Tudorascu, Dana L.
    Hwang, Seong Jae
    2021 IEEE 18TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (ISBI), 2021, : 650 - 654
  • [22] Efficient Multi-Domain Learning by Covariance Normalization
    Li, Yunsheng
    Vasconcelos, Nuno
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 5419 - 5428
  • [23] Unpaired Multi-Domain Causal Representation Learning
    Sturma, Nils
    Squires, Chandler
    Drton, Mathias
    Uhler, Caroline
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Multi-Domain Incremental Learning for Semantic Segmentation
    Garg, Prachi
    Saluja, Rohit
    Balasubramanian, Vineeth N.
    Arora, Chetan
    Subramanian, Anbumani
    Jawahar, C., V
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2080 - 2090
  • [25] Towards Learning Multi-Domain Crowd Counting
    Yan, Zhaoyi
    Li, Pengyu
    Wang, Biao
    Ren, Dongwei
    Zuo, Wangmeng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (11) : 6544 - 6557
  • [26] Argmax Centroids: with Applications to Multi-domain Learning
    Gong, Chengyue
    Ye, Mao
    Liu, Qiang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [27] Multi-Domain Generalized Graph Meta Learning
    Lin, Mingkai
    Li, Wenzhong
    Li, Ding
    Chen, Yizhou
    Li, Guohao
    Lu, Sanglu
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 4, 2023, : 4479 - 4487
  • [28] EFFICIENT MULTI-DOMAIN DICTIONARY LEARNING WITH GANS
    Wu, Cho Ying
    Neumann, Ulrich
    2019 7TH IEEE GLOBAL CONFERENCE ON SIGNAL AND INFORMATION PROCESSING (IEEE GLOBALSIP), 2019,
  • [29] Collaborative Learning in Multi-Domain Optical Networks
    Chen, Xiaoliang
    Proietti, Roberto
    Liu, Che-Yu
    Ben Yoo, S. J.
    2020 ASIA COMMUNICATIONS AND PHOTONICS CONFERENCE (ACP) AND INTERNATIONAL CONFERENCE ON INFORMATION PHOTONICS AND OPTICAL COMMUNICATIONS (IPOC), 2020,
  • [30] An environment for multi-domain ontology development and knowledge acquisition
    Si, JX
    Cao, CG
    Wang, H
    Gu, F
    Feng, QZ
    Zhang, CX
    Zeng, QT
    Tian, W
    Zheng, YF
    ENGINEERING AND DEPLOYMENT OF COOPERATIVE INFORMATION SYSTEMS, PROCEEDINGS, 2002, 2480 : 104 - 116