How secure is AI-generated code: a large-scale comparison of large language models

被引：1

作者：

Tihanyi, Norbert ^{[1
,2
]}

Bisztray, Tamas ^{[3
]}

Ferrag, Mohamed Amine ^{[4
]}

Jain, Ridhi ^{[2
]}

Cordeiro, Lucas C. ^{[5
,6
]}

机构：

[1] Eotvos Lorand Univ, Budapest, Hungary

[2] Technol Innovat Inst TII, Abu Dhabi, U Arab Emirates

[3] Univ Oslo, Oslo, Norway

[4] Guelma Univ, Guelma, Algeria

[5] Univ Manchester, Manchester, England

[6] Fed Univ Amazonas Manaus, Manaus, Brazil

来源：

EMPIRICAL SOFTWARE ENGINEERING | 2025年 / 30卷 / 02期

基金：

英国工程与自然科学研究理事会;

关键词：

Large language models; Vulnerability classification; Formal verification; Software security; Artificial intelligence; Dataset; CHECKING;

D O I：

10.1007/s10664-024-10590-1

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.

引用

页数：42

共 50 条

[31] Large-Scale Random Forest Language Models for Speech Recognition
Su, Yi
Jelinek, Frederick
Khudanpur, Sanjeev
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 945 - 948
[32] Towards Artwork Explanation in Large-scale Vision Language Models
Hayashi, Kazuki
Sakai, Yusuke
Kamigaito, Hidetaka
Hayashi, Katsuhiko
Watanabe, Taro
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 705 - 729
[33] Large-Scale Language Models for Sarcasm Detection with Data Augmentation
Zhang, Linrui
Copus, Belinda
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 1 - 9
[34] An editorial of "AI plus informetrics": Robust models for large-scale analytics
Zhang, Yi
Zhang, Chengzhi
Mayr, Philipp
Suominen, Arho
Ding, Ying
INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (01)
[35] AI Accelerator Embedded Computational Storage for Large-Scale DNN Models
Aim, Byungmin
Jang, Jaehun
Na, Hanbyeul
Seo, Mankeun
Son, Hongrak
Song, Yong Ho
2022 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE CIRCUITS AND SYSTEMS (AICAS 2022): INTELLIGENT TECHNOLOGY IN THE POST-PANDEMIC ERA, 2022, : 483 - 486
[36] Comparison of Large-scale SVM Training Algorithms for Language Recognition
Cumani, Sandro
Castaldo, Fabio
Laface, Pietro
Colibro, Daniele
Vair, Claudio
ODYSSEY 2010: THE SPEAKER AND LANGUAGE RECOGNITION WORKSHOP, 2010, : 222 - 229
[37] REQUIEM FOR LARGE-SCALE MODELS
LEE, DB
JOURNAL OF THE AMERICAN INSTITUTE OF PLANNERS, 1973, 39 (03): : 163 - 178
[38] MODELS OF LARGE-SCALE STRUCTURE
FRENK, CS
PHYSICA SCRIPTA, 1991, T36 : 70 - 87
[39] Large-scale AI language systems display an emergent ability to reason by analogy
Nature Human Behaviour, 2023, 7 : 1426 - 1427
[40] A Large-Scale Comparison of Python']Python Code in Jupyter Notebooks and Scripts
Grotov, Konstantin
Titov, Sergey
Sotnikov, Vladimir
Golubev, Yaroslav
Bryksin, Timofey
2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 353 - 364

← 1 2 3 4 5 →