How secure is AI-generated code: a large-scale comparison of large language models

被引:1
|
作者
Tihanyi, Norbert [1 ,2 ]
Bisztray, Tamas [3 ]
Ferrag, Mohamed Amine [4 ]
Jain, Ridhi [2 ]
Cordeiro, Lucas C. [5 ,6 ]
机构
[1] Eotvos Lorand Univ, Budapest, Hungary
[2] Technol Innovat Inst TII, Abu Dhabi, U Arab Emirates
[3] Univ Oslo, Oslo, Norway
[4] Guelma Univ, Guelma, Algeria
[5] Univ Manchester, Manchester, England
[6] Fed Univ Amazonas Manaus, Manaus, Brazil
基金
英国工程与自然科学研究理事会;
关键词
Large language models; Vulnerability classification; Formal verification; Software security; Artificial intelligence; Dataset; CHECKING;
D O I
10.1007/s10664-024-10590-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.
引用
收藏
页数:42
相关论文
共 50 条
  • [41] Large-scale AI language systems display an emergent ability to reason by analogy
    Webb, Taylor
    NATURE HUMAN BEHAVIOUR, 2023, 7 (09) : 1426 - 1427
  • [42] Visual Low-Code Language for Orchestrating Large-Scale Distributed Computing
    Kamil Rybiński
    Michał Śmiałek
    Agris Sostaks
    Krzysztof Marek
    Radosław Roszczyk
    Marek Wdowiak
    Journal of Grid Computing, 2023, 21
  • [43] Visual Low-Code Language for Orchestrating Large-Scale Distributed Computing
    Rybinski, Kamil
    Smialek, Michal
    Sostaks, Agris
    Marek, Krzysztof
    Roszczyk, Radoslaw
    Wdowiak, Marek
    JOURNAL OF GRID COMPUTING, 2023, 21 (03)
  • [44] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
    Liu, Jiawei
    Xia, Chunqiu Steven
    Wang, Yuyao
    Zhang, Lingming
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [45] Foundation Models, Generative AI, and Large Language Models
    Ross, Angela
    McGrow, Kathleen
    Zhi, Degui
    Rasmy, Laila
    CIN-COMPUTERS INFORMATICS NURSING, 2024, 42 (05) : 377 - 387
  • [46] MAGECODE: Machine-Generated Code Detection Method Using Large Language Models
    Pham, Hung
    Ha, Huyen
    Tong, Van
    Hoang, Dung
    Tran, Duc
    Le, Tuyen Ngoc
    IEEE Access, 2024, 12 : 190186 - 190202
  • [47] LARGE-SCALE MODELS AND LARGE-SCALE THINKING - THE CASE OF THE HEALTH-SERVICES
    SMITH, P
    OMEGA-INTERNATIONAL JOURNAL OF MANAGEMENT SCIENCE, 1995, 23 (02): : 145 - 157
  • [48] MAGECODE: Machine-Generated Code Detection Method Using Large Language Models
    Pham, Hung
    Ha, Huyen
    Tong, Van
    Hoang, Dung
    Tran, Duc
    Le, Tuyen Ngoc
    IEEE ACCESS, 2024, 12 : 190186 - 190202
  • [49] Secure and Efficient Outsourcing of Large-Scale Nonlinear
    Du, Wei
    Li, Qinghua
    2017 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), 2017, : 236 - 244
  • [50] Do Large Language Models Recognize Python']Python Identifier Swaps in Their Generated Code?
    Chavan, Sagar Bhikan
    Mondal, Shouvick
    COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 663 - 664