How secure is AI-generated code: a large-scale comparison of large language models

被引：1

作者：

Tihanyi, Norbert ^{[1
,2
]}

Bisztray, Tamas ^{[3
]}

Ferrag, Mohamed Amine ^{[4
]}

Jain, Ridhi ^{[2
]}

Cordeiro, Lucas C. ^{[5
,6
]}

机构：

[1] Eotvos Lorand Univ, Budapest, Hungary

[2] Technol Innovat Inst TII, Abu Dhabi, U Arab Emirates

[3] Univ Oslo, Oslo, Norway

[4] Guelma Univ, Guelma, Algeria

[5] Univ Manchester, Manchester, England

[6] Fed Univ Amazonas Manaus, Manaus, Brazil

来源：

EMPIRICAL SOFTWARE ENGINEERING | 2025年 / 30卷 / 02期

基金：

英国工程与自然科学研究理事会;

关键词：

Large language models; Vulnerability classification; Formal verification; Software security; Artificial intelligence; Dataset; CHECKING;

D O I：

10.1007/s10664-024-10590-1

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.

引用

页数：42

共 50 条

[41] Large-scale AI language systems display an emergent ability to reason by analogy
Webb, Taylor
NATURE HUMAN BEHAVIOUR, 2023, 7 (09) : 1426 - 1427
[42] Visual Low-Code Language for Orchestrating Large-Scale Distributed Computing
Kamil Rybiński
Michał Śmiałek
Agris Sostaks
Krzysztof Marek
Radosław Roszczyk
Marek Wdowiak
Journal of Grid Computing, 2023, 21
[43] Visual Low-Code Language for Orchestrating Large-Scale Distributed Computing
Rybinski, Kamil
Smialek, Michal
Sostaks, Agris
Marek, Krzysztof
Roszczyk, Radoslaw
Wdowiak, Marek
JOURNAL OF GRID COMPUTING, 2023, 21 (03)
[44] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Liu, Jiawei
Xia, Chunqiu Steven
Wang, Yuyao
Zhang, Lingming
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[45] Foundation Models, Generative AI, and Large Language Models
Ross, Angela
McGrow, Kathleen
Zhi, Degui
Rasmy, Laila
CIN-COMPUTERS INFORMATICS NURSING, 2024, 42 (05) : 377 - 387
[46] MAGECODE: Machine-Generated Code Detection Method Using Large Language Models
Pham, Hung
Ha, Huyen
Tong, Van
Hoang, Dung
Tran, Duc
Le, Tuyen Ngoc
IEEE Access, 2024, 12 : 190186 - 190202
[47] LARGE-SCALE MODELS AND LARGE-SCALE THINKING - THE CASE OF THE HEALTH-SERVICES
SMITH, P
OMEGA-INTERNATIONAL JOURNAL OF MANAGEMENT SCIENCE, 1995, 23 (02): : 145 - 157
[48] MAGECODE: Machine-Generated Code Detection Method Using Large Language Models
Pham, Hung
Ha, Huyen
Tong, Van
Hoang, Dung
Tran, Duc
Le, Tuyen Ngoc
IEEE ACCESS, 2024, 12 : 190186 - 190202
[49] Secure and Efficient Outsourcing of Large-Scale Nonlinear
Du, Wei
Li, Qinghua
2017 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), 2017, : 236 - 244
[50] Do Large Language Models Recognize Python']Python Identifier Swaps in Their Generated Code?
Chavan, Sagar Bhikan
Mondal, Shouvick
COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 663 - 664

← 1 2 3 4 5 →