How secure is AI-generated code: a large-scale comparison of large language models

被引：1

作者：

Tihanyi, Norbert ^{[1
,2
]}

Bisztray, Tamas ^{[3
]}

Ferrag, Mohamed Amine ^{[4
]}

Jain, Ridhi ^{[2
]}

Cordeiro, Lucas C. ^{[5
,6
]}

机构：

[1] Eotvos Lorand Univ, Budapest, Hungary

[2] Technol Innovat Inst TII, Abu Dhabi, U Arab Emirates

[3] Univ Oslo, Oslo, Norway

[4] Guelma Univ, Guelma, Algeria

[5] Univ Manchester, Manchester, England

[6] Fed Univ Amazonas Manaus, Manaus, Brazil

来源：

EMPIRICAL SOFTWARE ENGINEERING | 2025年 / 30卷 / 02期

基金：

英国工程与自然科学研究理事会;

关键词：

Large language models; Vulnerability classification; Formal verification; Software security; Artificial intelligence; Dataset; CHECKING;

D O I：

10.1007/s10664-024-10590-1

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.

引用

页数：42

共 50 条

[1] Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models
Xu, Zhenyu
Sheng, Victor S.
THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23155 - 23162
[2] Bias of AI-generated content: an examination of news produced by large language models
Fang, Xiao
Che, Shangkun
Mao, Minjia
Zhang, Hongzhe
Zhao, Ming
Zhao, Xiaohang
SCIENTIFIC REPORTS, 2024, 14 (01)
[3] Towards Fair Detection of AI-Generated Essays in Large-Scale Writing Assessments
Jiang, Yang
Hao, Jiangang
Fauss, Michael
Li, Chen
ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2024, 2024, 2151 : 317 - 324
[4] Survey on AI-Generated Plagiarism Detection: The Impact of Large Language Models on Academic Integrity
Pudasaini, Shushanta
Miralles-Pechuan, Luis
Lillis, David
Llorens Salvador, Marisa
JOURNAL OF ACADEMIC ETHICS, 2024,
[5] AI-Generated Faces in the RealWorld: A Large-Scale Case Study of Twitter Profile Images
Ricker, Jonas
Assenmacher, Dennis
Holz, Thorsten
Fischer, Asja
Quiring, Erwin
PROCEEDINGS OF 27TH INTERNATIONAL SYMPOSIUM ON RESEARCH IN ATTACKS, INTRUSIONS AND DEFENSES, RAID 2024, 2024, : 513 - 530
[6] Limits of Detecting Text Generated by Large-Scale Language Models
Varshney, Lav R.
Keskar, Nitish Shirish
Socher, Richard
2020 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), 2020,
[7] RepairCAT: Applying Large Language Model to Fix Bugs in AI-Generated Programs
Jiang, Nan
Wu, Yi
2024 ACM/IEEE INTERNATIONAL WORKSHOP ON AUTOMATED PROGRAM REPAIR, APR 2024, 2024, : 58 - 60
[8] Catalyst for large-scale digital twin applications AI-generated 3D digital surface models from digital orthophotos
Voigt, Jakob
Vettermann, Ferdinand
Heller, Johann
GIM INTERNATIONAL-THE WORLDWIDE MAGAZINE FOR GEOMATICS, 2023, 37 (07): : 31 - 33
[9] Deus Ex Machina and Personas from Large Language Models: Investigating the Composition of AI-Generated Persona Descriptions
Salminen, Joni
Liu, Chang
Pian, Wenjing
Chi, Jianxing
Hayhanen, Essi
Jansen, Bernard J.
PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,
[10] CAN LARGE-LANGUAGE MODELS ACCURATELY DISCERN AI-GENERATED SEXUAL MEDICINE SCIENTIFIC LITERATURE FROM HUMAN GENERATED?
Singh, D.
Greenberg, J. W.
Shkolnik, B.
Hellstrom, W.
JOURNAL OF SEXUAL MEDICINE, 2024, 21

← 1 2 3 4 5 →