D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

被引:79
|
作者
Zheng, Yunhui [1 ]
Pujar, Saurabh [1 ]
Lewis, Burn [1 ]
Buratti, Luca [1 ]
Epstein, Edward [1 ]
Yang, Bo [1 ]
Laredo, Jim [1 ]
Morari, Alessandro [1 ]
Su, Zhong [1 ]
机构
[1] IBM Res, Armonk, NY 10504 USA
关键词
dataset; vulnerability detection; auto-labeler; STATIC ANALYSIS;
D O I
10.1109/ICSE-SEIP52600.2021.00020
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Static analysis tools are widely used fur vulnerability detection as they understand programs with complex behavior and millions or lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models fur vulnerability identification stiffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select hug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can he used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.
引用
收藏
页码:111 / 120
页数:10
相关论文
共 50 条
  • [1] Labelled Vulnerability Dataset on Android Source Code (LVDAndro) to Develop AI-Based Code Vulnerability Detection Models
    Senanayake, Janaka
    Kalutarage, Harsha
    Al-Kadri, Mhd Omar
    Piras, Luca
    Petrovski, Andrei
    PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 659 - 666
  • [2] STRAMPN: Histopathological image dataset for ovarian cancer detection incorporating AI-based methods
    Samridhi Singh
    Malti Kumari Maurya
    Nagendra Pratap Singh
    Multimedia Tools and Applications, 2024, 83 : 28175 - 28196
  • [3] STRAMPN: Histopathological image dataset for ovarian cancer detection incorporating AI-based methods
    Singh, Samridhi
    Maurya, Malti Kumari
    Singh, Nagendra Pratap
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (09) : 28175 - 28196
  • [4] Review of AI-based methods for chatter detection in machining based on bibliometric analysis
    Cheick Abdoul Kadir A Kounta
    Lionel Arnaud
    Bernard Kamsu-Foguem
    Fana Tangara
    The International Journal of Advanced Manufacturing Technology, 2022, 122 : 2161 - 2186
  • [5] Review of AI-based methods for chatter detection in machining based on bibliometric analysis
    Kounta, Cheick Abdoul Kadir A.
    Arnaud, Lionel
    Kamsu-Foguem, Bernard
    Tangara, Fana
    INTERNATIONAL JOURNAL OF ADVANCED MANUFACTURING TECHNOLOGY, 2022, 122 (5-6): : 2161 - 2186
  • [6] ETDD70: Eye-Tracking Dataset for Classification of Dyslexia Using AI-Based Methods
    Sedmidubsky, Jan
    Dostalova, Nicol
    Svaricek, Roman
    Culemann, Wolf
    SIMILARITY SEARCH AND APPLICATIONS, SISAP 2024, 2025, 15268 : 34 - 48
  • [7] Prediction of Breast Cancer Using AI-Based Methods
    Aamir, Sanam
    Rahim, Aqsa
    Bashir, Sajid
    Naeem, Muddasar
    INTELLIGENT ENVIRONMENTS 2021, 2021, 29 : 213 - 220
  • [8] A BONE FRACTURE DETECTION USING AI-BASED TECHNIQUES
    Mehta, Rushabh
    Pareek, Preksha
    Jayaswal, Ruchi
    Patil, Shruti
    Vyas, Kishan
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2023, 24 (02): : 161 - 171
  • [9] AI-Based Fall Detection Using Contactless Sensing
    Taha, Ahmad
    Taha, Mohammad M. A.
    Barakat, Basel
    Taylor, William
    Abbasi, Qammer H.
    Imran, Muhammad Ali
    2021 IEEE SENSORS, 2021,
  • [10] Detection of Adversarial Attacks in AI-Based Intrusion Detection Systems Using Explainable AI
    Tcydenova, Erzhena
    Kim, Tae Woo
    Lee, Changhoon
    Park, Jong Hyuk
    HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2021, 11