D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis

被引:79
|
作者
Zheng, Yunhui [1 ]
Pujar, Saurabh [1 ]
Lewis, Burn [1 ]
Buratti, Luca [1 ]
Epstein, Edward [1 ]
Yang, Bo [1 ]
Laredo, Jim [1 ]
Morari, Alessandro [1 ]
Su, Zhong [1 ]
机构
[1] IBM Res, Armonk, NY 10504 USA
关键词
dataset; vulnerability detection; auto-labeler; STATIC ANALYSIS;
D O I
10.1109/ICSE-SEIP52600.2021.00020
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Static analysis tools are widely used fur vulnerability detection as they understand programs with complex behavior and millions or lines of code. Despite their popularity, static analysis tools are known to generate an excess of false positives. The recent ability of Machine Learning models to understand programming languages opens new possibilities when applied to static analysis. However, existing datasets to train models fur vulnerability identification stiffer from multiple limitations such as limited bug context, limited size, and synthetic and unrealistic source code. We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. The D2A dataset is built by analyzing version pairs from multiple open source projects. From each project, we select hug fixing commits and we run static analysis on the versions before and after such commits. If some issues detected in a before-commit version disappear in the corresponding after-commit version, they are very likely to be real bugs that got fixed by the commit. We use D2A to generate a large labeled dataset to train models for vulnerability identification. We show that the dataset can he used to build a classifier to identify possible false alarms among the issues reported by static analysis, hence helping developers prioritize and investigate potential true positives first.
引用
收藏
页码:111 / 120
页数:10
相关论文
共 50 条
  • [21] A Bibliographic Study of Macular Fovea Detection: AI-Based Methods, Applications, and Issues
    Wang, Han
    Li, Zefeng
    Xing, Lumin
    Chong, Kelvin K. L.
    Zhou, Xiaoshu
    Wang, Fengling
    Zhou, Junjie
    Li, Zhiming
    PROCEEDINGS OF THE WORLD CONFERENCE ON INTELLIGENT AND 3-D TECHNOLOGIES, WCI3DT 2022, 2023, 323 : 273 - 284
  • [22] A comprehensive review of direct, indirect, and AI-based detection methods for milk powder
    Song, Xiaodong
    Shen, Song
    Dong, Guanjun
    Ding, Haohan
    Xie, Zhenqi
    Wang, Long
    Cheng, Wenxu
    FRONTIERS IN SUSTAINABLE FOOD SYSTEMS, 2025, 9
  • [23] Analysis of AI-Based Single-View 3D Reconstruction Methods for an Industrial Application
    Hartung, Julia
    Dold, Patricia M.
    Jahn, Andreas
    Heizmann, Michael
    SENSORS, 2022, 22 (17)
  • [24] Comparative Analysis of AI-Based Methods for Enhancing Cybersecurity Monitoring Systems
    Uccello, Federica
    Pawlicki, Marek
    D'Antonio, Salvatore
    Kozik, Rafal
    Choras, Michal
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS-ICCSA 2024 WORKSHOPS, PT II, 2024, 14816 : 100 - 112
  • [25] AI-Based Pedestrian Detection and Avoidance at Night Using Multiple Sensors
    Kulhandjian, Hovannes
    Barron, Jeremiah
    Tamiyasu, Megan
    Thompson, Mateo
    Kulhandjian, Michel
    JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2024, 13 (03)
  • [26] Boosting Dual Quality detection with AI-based social media analysis
    Brzezinski, Maksim
    Niemir, Maciej
    Muszynski, Krzysztof
    Lango, Mateusz
    Wisniewski, Dawid
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (04)
  • [27] Frequency-Domain Analysis of Traces for the Detection of AI-based Compression
    Bergmann, Sandra
    Moussa, Denise
    Brand, Fabian
    Kaup, Andre
    Riess, Christian
    2023 11TH INTERNATIONAL WORKSHOP ON BIOMETRICS AND FORENSICS, IWBF, 2023,
  • [28] 2D FFT and AI-Based Analysis of Wallpaper Patterns and Relations Between Kansei
    Ishihara, Shigekazu
    Nagamachi, Mitsuo
    Matsubara, Tatsuro
    Ishihara, Keiko
    Morinaga, Kosuke
    Ishihara, Taku
    ADVANCES IN AFFECTIVE AND PLEASURABLE DESIGN, 2020, 952 : 329 - 338
  • [29] Novel Methods of AI-Based Gait Analysis in Post-Stroke Patients
    Rojek, Izabela
    Prokopowicz, Piotr
    Dorozynski, Janusz
    Mikolajewski, Dariusz
    APPLIED SCIENCES-BASEL, 2023, 13 (10):
  • [30] Towards a taxonomy of AI-based methods in Financial Statement Analysis Completed Research
    Niessner, Tobias
    Nickerson, Robert C.
    Schumann, Matthias
    DIGITAL INNOVATION AND ENTREPRENEURSHIP (AMCIS 2021), 2021,