How secure is AI-generated code: a large-scale comparison of large language models

被引:1
|
作者
Tihanyi, Norbert [1 ,2 ]
Bisztray, Tamas [3 ]
Ferrag, Mohamed Amine [4 ]
Jain, Ridhi [2 ]
Cordeiro, Lucas C. [5 ,6 ]
机构
[1] Eotvos Lorand Univ, Budapest, Hungary
[2] Technol Innovat Inst TII, Abu Dhabi, U Arab Emirates
[3] Univ Oslo, Oslo, Norway
[4] Guelma Univ, Guelma, Algeria
[5] Univ Manchester, Manchester, England
[6] Fed Univ Amazonas Manaus, Manaus, Brazil
基金
英国工程与自然科学研究理事会;
关键词
Large language models; Vulnerability classification; Formal verification; Software security; Artificial intelligence; Dataset; CHECKING;
D O I
10.1007/s10664-024-10590-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
This study compares state-of-the-art Large Language Models (LLMs) on their tendency to generate vulnerabilities when writing C programs using a neutral zero-shot prompt. Tihanyi et al. introduced the FormAI dataset at PROMISE '23, featuring 112,000 C programs generated by GPT-3.5-turbo, with over 51.24% identified as vulnerable. We extended that research with a large-scale study involving 9 state-of-the-art models such as OpenAI's GPT-4o-mini, Google's Gemini Pro 1.0, TII's 180 billion-parameter Falcon, Meta's 13 billion-parameter Code Llama, and several other compact models. Additionally, we introduce the FormAI-v2 dataset, which comprises 331 000 compilable C programs generated by these LLMs. Each program in the dataset is labeled based on the vulnerabilities detected in its source code through formal verification, using the Efficient SMT-based Context-Bounded Model Checker (ESBMC). This technique minimizes false positives by providing a counterexample for the specific vulnerability and reduces false negatives by thoroughly completing the verification process. Our study reveals that at least 62.07% of the generated programs are vulnerable. The differences between the models are minor, as they all show similar coding errors with slight variations. Our research highlights that while LLMs offer promising capabilities for code generation, deploying their output in a production environment requires proper risk assessment and validation.
引用
收藏
页数:42
相关论文
共 50 条
  • [1] Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models
    Xu, Zhenyu
    Sheng, Victor S.
    THIRTY-EIGTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 21, 2024, : 23155 - 23162
  • [2] Bias of AI-generated content: an examination of news produced by large language models
    Fang, Xiao
    Che, Shangkun
    Mao, Minjia
    Zhang, Hongzhe
    Zhao, Ming
    Zhao, Xiaohang
    SCIENTIFIC REPORTS, 2024, 14 (01)
  • [3] Towards Fair Detection of AI-Generated Essays in Large-Scale Writing Assessments
    Jiang, Yang
    Hao, Jiangang
    Fauss, Michael
    Li, Chen
    ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2024, 2024, 2151 : 317 - 324
  • [4] Survey on AI-Generated Plagiarism Detection: The Impact of Large Language Models on Academic Integrity
    Pudasaini, Shushanta
    Miralles-Pechuan, Luis
    Lillis, David
    Llorens Salvador, Marisa
    JOURNAL OF ACADEMIC ETHICS, 2024,
  • [5] AI-Generated Faces in the RealWorld: A Large-Scale Case Study of Twitter Profile Images
    Ricker, Jonas
    Assenmacher, Dennis
    Holz, Thorsten
    Fischer, Asja
    Quiring, Erwin
    PROCEEDINGS OF 27TH INTERNATIONAL SYMPOSIUM ON RESEARCH IN ATTACKS, INTRUSIONS AND DEFENSES, RAID 2024, 2024, : 513 - 530
  • [6] Limits of Detecting Text Generated by Large-Scale Language Models
    Varshney, Lav R.
    Keskar, Nitish Shirish
    Socher, Richard
    2020 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), 2020,
  • [7] RepairCAT: Applying Large Language Model to Fix Bugs in AI-Generated Programs
    Jiang, Nan
    Wu, Yi
    2024 ACM/IEEE INTERNATIONAL WORKSHOP ON AUTOMATED PROGRAM REPAIR, APR 2024, 2024, : 58 - 60
  • [8] Catalyst for large-scale digital twin applications AI-generated 3D digital surface models from digital orthophotos
    Voigt, Jakob
    Vettermann, Ferdinand
    Heller, Johann
    GIM INTERNATIONAL-THE WORLDWIDE MAGAZINE FOR GEOMATICS, 2023, 37 (07): : 31 - 33
  • [9] Deus Ex Machina and Personas from Large Language Models: Investigating the Composition of AI-Generated Persona Descriptions
    Salminen, Joni
    Liu, Chang
    Pian, Wenjing
    Chi, Jianxing
    Hayhanen, Essi
    Jansen, Bernard J.
    PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,
  • [10] CAN LARGE-LANGUAGE MODELS ACCURATELY DISCERN AI-GENERATED SEXUAL MEDICINE SCIENTIFIC LITERATURE FROM HUMAN GENERATED?
    Singh, D.
    Greenberg, J. W.
    Shkolnik, B.
    Hellstrom, W.
    JOURNAL OF SEXUAL MEDICINE, 2024, 21