Oracle-Guided Program Selection from Large Language Models

被引:0
|
作者
Fan, Zhiyu [1 ]
Ruan, Haifeng [1 ]
Mechtaev, Sergey [2 ]
Roychoudhury, Abhik [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Peking Univ, Beijing, Peoples R China
基金
新加坡国家研究基金会;
关键词
large language model; code generation; oracle inference; differential testing;
D O I
10.1145/3650212.3680308
中图分类号
学科分类号
摘要
While large language models (LLMs) have shown significant advancements in code generation, their susceptibility to producing incorrect code poses a significant challenge to the adoption of LLM-generated programs. This issue largely stems from the reliance on natural language descriptions as informal oracles in code generation. Current strategies to mitigate this involve selecting the best program from multiple LLM-generated alternatives, judged by criteria like the consistency of their execution results on an LLM-generated test suite. However, this approach has crucial limitations: (1) LLMs often generate redundant tests or tests that cannot distinguish between correct and incorrect solutions, (2) the used consistency criteria, such as the majority vote, fail to foster developer trust due to the absence of transparent rationale behind the made choices. In this work, we propose a new perspective on increasing the quality of LLM-generated code via program selection using the LLM as a test oracle. Our method is based on our experimentally confirmed observation that LLMs serve more effectively as oracles when tasked with selecting the correct output from multiple choices. Leveraging this insight, we first generate distinguishing inputs that capture semantic discrepancies of programs sampled from an LLM, and record outputs produced by the programs on these inputs. An LLM then selects the most likely to be correct output from these, guided by the natural language problem description. We implemented this idea in a tool LLIVIConECHoicf and evaluated its accuracy in generating and selecting standalone programs. Our experiments demonstrated its effectiveness in improving pass01 by 3.6-7% on HumanEval and MBPP benchmarks compared to the state-of-art cont:T. Most interestingly, the selected input-output specifications helped us to uncover incompleteness and ambiguities in task descriptions and also identify incorrect ground-truth implementations in the benchmarks.
引用
收藏
页码:628 / 640
页数:13
相关论文
共 50 条
  • [31] Guiding Enumerative Program Synthesis with Large Language Models
    Li, Yixuan
    Parsert, Julian
    Polgreen, Elizabeth
    COMPUTER AIDED VERIFICATION, PT II, CAV 2024, 2024, 14682 : 280 - 301
  • [32] Large language models in textual analysis for gesture selection
    Hensel, Laura B.
    Yongsatianchot, Nutchanon
    Torshizi, Parisa
    Minucci, Elena
    Marsella, Stacy
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, 2023, : 378 - 387
  • [33] PassGPT: Password Modeling and (Guided) Generation with Large Language Models
    Rando, Javier
    Perez-Cruz, Fernando
    Hitaj, Briland
    COMPUTER SECURITY - ESORICS 2023, PT IV, 2024, 14347 : 164 - 183
  • [34] Supporting the Development of Oracle APEX Low-Code Applications with Large Language Models
    Gorissen, Simon C.
    Sauer, Stefan
    Beckmann, Wolf G.
    PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT, PROFES 2024, 2025, 15452 : 221 - 237
  • [35] Performance of Large Language Models in a Computer Science Degree Program
    Krueger, Tim
    Gref, Michael
    ARTIFICIAL INTELLIGENCE-ECAI 2023 INTERNATIONAL WORKSHOPS, PT 2, XAI3, TACTIFUL, XI-ML, SEDAMI, RAAIT, AI4S, HYDRA, AI4AI, 2023, 2024, 1948 : 409 - 424
  • [36] Enhancing Program Synthesis with Large Language Models Using Many-Objective Grammar-Guided Genetic Programming
    Tao, Ning
    Ventresque, Anthony
    Nallur, Vivek
    Saber, Takfarinas
    ALGORITHMS, 2024, 17 (07)
  • [37] LPR: Large Language Models-Aided Program Reduction
    Zhang, Mengxiao
    Tian, Yongqiang
    Xu, Zhenyang
    Dong, Yiwen
    Tan, Shin Hwei
    Sun, Chengnian
    PROCEEDINGS OF THE 33RD ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2024, 2024, : 261 - 273
  • [38] From Large Language Models to Large Multimodal Models: A Literature Review
    Huang, Dawei
    Yan, Chuan
    Li, Qing
    Peng, Xiaojiang
    APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [39] Causal-Guided Active Learning for Debiasing Large Language Models
    Sun, Zhouhao
    Li Du
    Ding, Xiao
    Ma, Yixuan
    Zhao, Yang
    Qiu, Kaitao
    Liu, Ting
    Qin, Bing
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 14455 - 14469
  • [40] ELAD: Explanation-Guided Large Language Models Active Distillation
    Zhang, Yifei
    Pan, Bo
    Ling, Chen
    Hu, Yuntong
    Zhao, Liang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 4463 - 4475