D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

被引:79
|
作者
Zheng, Yunhui [1 ]
Pujar, Saurabh [1 ]
Lewis, Burn [1 ]
Buratti, Luca [1 ]
Epstein, Edward [1 ]
Yang, Bo [1 ]
Laredo, Jim [1 ]
Morari, Alessandro [1 ]
Su, Zhong [1 ]
机构
[1] IBM Res, Armonk, NY 10504 USA
关键词
dataset; vulnerability detection; auto-labeler; STATIC ANALYSIS;
D O I
10.1109/ICSE-SEIP52600.2021.00020
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Static analysis tools are widely used fur vulnerability detection as they understand programs with complex behavior and millions or lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models fur vulnerability identification stiffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select hug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can he used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.
引用
收藏
页码:111 / 120
页数:10
相关论文
共 50 条
  • [31] Design and Analysis Methods for Trials with AI-Based Diagnostic Devices for Breast Cancer
    Liu, Lu
    Parker, Kevin J.
    Jung, Sin-Ho
    JOURNAL OF PERSONALIZED MEDICINE, 2021, 11 (11):
  • [32] Review of Deterministic and AI-Based Methods for Fluid Motion Modelling and Sloshing Analysis
    Filo, Grzegorz
    Lempa, Pawel
    Wisowski, Konrad
    ENERGIES, 2025, 18 (05)
  • [33] AI-BASED 3D DETECTION OF PARKED VEHICLES ON A MOBILE MAPPING PLATFORM USING EDGE COMPUTING
    Meyer, J.
    Blaser, S.
    Nebiker, S.
    XXIV ISPRS CONGRESS CONGRESS IMAGING TODAY, FORESEEING TOMORROW, COMMISSION I, 2022, 43-B1 : 437 - 445
  • [34] Using an AI-Based Object Detection Translation Application for English Vocabulary Learning
    Liu, Pei-Lin
    Chen, Chiu-Jung
    EDUCATIONAL TECHNOLOGY & SOCIETY, 2023, 26 (03): : 5 - 20
  • [35] A Novel AI-Based System for Detection and Severity Prediction of Dementia Using MRI
    Jain, Varun
    Nankar, Om
    Jerrish, Daryl Jacob
    Gite, Shilpa
    Patil, Shruti
    Kotecha, Ketan
    IEEE ACCESS, 2021, 9 : 154324 - 154346
  • [36] Generative AI-based style recommendation using fashion item detection and classification
    Kalinin, Aleksandr
    Jafari, Akbar Anbar
    Avots, Egils
    Ozcinar, Cagri
    Anbarjafari, Gholamreza
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (12) : 9179 - 9189
  • [37] AI-Based Bearing Defect Detection Using Variable Reluctance Sensor Signal
    Daly, Collin
    Haddad, Rami J.
    SOUTHEASTCON 2024, 2024, : 892 - 893
  • [38] Prediction and Detection of Ventricular Fibrillation Using Complex Features and AI-Based Classification
    Fira, Monica
    Costin, Hariton-Nicolae
    Goras, Liviu
    APPLIED SCIENCES-BASEL, 2024, 14 (07):
  • [39] Performance analysis of AI-based solutions for crop disease identification, detection, and classification
    Tirkey, Divyanshu
    Singh, Kshitiz Kumar
    Tripathi, Shrivishal
    SMART AGRICULTURAL TECHNOLOGY, 2023, 5
  • [40] AI-Based Channel Prediction in D2D Links: An Empirical Validation
    Simmons, Nidhi
    Ferreira Gomes, Samuel Borges
    Yacoub, Michel Daoud
    Simeone, Osvaldo
    Cotton, Simon L.
    Simmons, David E.
    IEEE ACCESS, 2022, 10 : 65459 - 65472