How secure is AI-generated code: a large-scale comparison of large language models

被引:1
|
作者
Tihanyi, Norbert [1 ,2 ]
Bisztray, Tamas [3 ]
Ferrag, Mohamed Amine [4 ]
Jain, Ridhi [2 ]
Cordeiro, Lucas C. [5 ,6 ]
机构
[1] Eotvos Lorand Univ, Budapest, Hungary
[2] Technol Innovat Inst TII, Abu Dhabi, U Arab Emirates
[3] Univ Oslo, Oslo, Norway
[4] Guelma Univ, Guelma, Algeria
[5] Univ Manchester, Manchester, England
[6] Fed Univ Amazonas Manaus, Manaus, Brazil
基金
英国工程与自然科学研究理事会;
关键词
Large language models; Vulnerability classification; Formal verification; Software security; Artificial intelligence; Dataset; CHECKING;
D O I
10.1007/s10664-024-10590-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.
引用
收藏
页数:42
相关论文
共 50 条
  • [21] How Useful Are Educational Questions Generated by Large Language Models?
    Elkins, Sabina
    Kochmar, Ekaterina
    Serban, Iulian
    Cheung, Jackie C. K.
    ARTIFICIAL INTELLIGENCE IN EDUCATION. POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2023, 2023, 1831 : 536 - 542
  • [22] A LARGE-SCALE STUDY OF LANGUAGE MODELS FOR CHORD PREDICTION
    Korzeniowski, Filip
    Sears, David R. W.
    Widmer, Gerhard
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 91 - 95
  • [23] Training Large-Scale Foundation Models on Emerging AI Chips
    Muhamed, Aashiq
    Bock, Christian
    Solanki, Rahul
    Park, Youngsuk
    Wang, Yida
    Huan, Jun
    PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5821 - 5822
  • [24] Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content
    Lv, Xiaolei
    Zhang, Xiaomeng
    Li, Yuan
    Ding, Xinxin
    Lai, Hongchang
    Shi, Junyu
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [25] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
    Cai, Yan
    Wang, Linlin
    Wang, Ye
    de Melo, Gerard
    Zhang, Ya
    Wang, Yanfeng
    He, Liang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
  • [26] Visual Comparison of Text Sequences Generated by Large Language Models
    Sevastjanova, Rita
    Vogelbacher, Simon
    Spitz, Andreas
    Keim, Daniel
    El-Assady, Mennatallah
    2023 IEEE VISUALIZATION IN DATA SCIENCE, VDS, 2023, : 11 - 20
  • [27] Aggregation models in ensemble learning: A large-scale comparison
    Campagner, Andrea
    Ciucci, Davide
    Cabitza, Federico
    INFORMATION FUSION, 2023, 90 : 241 - 252
  • [28] Comparison of Large-Scale Fading Models with RSRP Measurements
    Fastenbauer, Agnes
    Eller, Lukas
    Svoboda, Philipp
    Rupp, Markus
    2024 IEEE 99TH VEHICULAR TECHNOLOGY CONFERENCE, VTC2024-SPRING, 2024,
  • [29] On the Multilingual Capabilities of Very Large-Scale English Language Models
    Armengol-Estape, Jordi
    de Gibert Bonet, Ona
    Melero, Maite
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3056 - 3068
  • [30] StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
    Guo, Zhicheng
    Cheng, Sijie
    Wang, Hao
    Liang, Shihao
    Qin, Yujia
    Li, Peng
    Liu, Zhiyuan
    Sun, Maosong
    Liu, Yang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11143 - 11156