Oracle-Guided Program Selection from Large Language Models

被引:0
|
作者
Fan, Zhiyu [1 ]
Ruan, Haifeng [1 ]
Mechtaev, Sergey [2 ]
Roychoudhury, Abhik [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Peking Univ, Beijing, Peoples R China
基金
新加坡国家研究基金会;
关键词
large language model; code generation; oracle inference; differential testing;
D O I
10.1145/3650212.3680308
中图分类号
学科分类号
摘要
While large language models (LLMs) have shown significant advancements in code generation, their susceptibility to producing incorrect code poses a significant challenge to the adoption of LLM-generated programs. This issue largely stems from the reliance on natural language descriptions as informal oracles in code generation. Current strategies to mitigate this involve selecting the best program from multiple LLM-generated alternatives, judged by criteria like the consistency of their execution results on an LLM-generated test suite. However, this approach has crucial limitations: (1) LLMs often generate redundant tests or tests that cannot distinguish between correct and incorrect solutions, (2) the used consistency criteria, such as the majority vote, fail to foster developer trust due to the absence of transparent rationale behind the made choices. In this work, we propose a new perspective on increasing the quality of LLM-generated code via program selection using the LLM as a test oracle. Our method is based on our experimentally confirmed observation that LLMs serve more effectively as oracles when tasked with selecting the correct output from multiple choices. Leveraging this insight, we first generate distinguishing inputs that capture semantic discrepancies of programs sampled from an LLM, and record outputs produced by the programs on these inputs. An LLM then selects the most likely to be correct output from these, guided by the natural language problem description. We implemented this idea in a tool LLIVIConECHoicf and evaluated its accuracy in generating and selecting standalone programs. Our experiments demonstrated its effectiveness in improving pass01 by 3.6-7% on HumanEval and MBPP benchmarks compared to the state-of-art cont:T. Most interestingly, the selected input-output specifications helped us to uncover incompleteness and ambiguities in task descriptions and also identify incorrect ground-truth implementations in the benchmarks.
引用
收藏
页码:628 / 640
页数:13
相关论文
共 50 条
  • [21] Mathematical discoveries from program search with large language models
    Romera-Paredes, Bernardino
    Barekatain, Mohammadamin
    Novikov, Alexander
    Balog, Matej
    Kumar, M. Pawan
    Dupont, Emilien
    Ruiz, Francisco J. R.
    Ellenberg, Jordan S.
    Wang, Pengming
    Fawzi, Omar
    Kohli, Pushmeet
    Fawzi, Alhussein
    NATURE, 2024, 625 (7995) : 468 - 475
  • [22] A BIST-based Dynamic Obfuscation Scheme for Resilience against Removal and Oracle-guided Attacks
    Talukdar, Jonti
    Chen, Siyuan
    Das, Amitabh
    Aftabjahani, Sohrab
    Song, Peilin
    Chakrabarty, Krishnendu
    2021 IEEE INTERNATIONAL TEST CONFERENCE (ITC 2021), 2021, : 170 - 179
  • [23] The use of large language models for program repair
    Zubair, Fida
    Al-Hitmi, Maryam
    Catal, Cagatay
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [24] Large Language Models for Automated Program Repair
    Ribeiro, Francisco
    SPLASH Companion 2023 - Companion Proceedings of the 2023 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity, 2023, : 7 - 9
  • [25] Large Language Models for Automated Program Repair
    Ribeiro, Francisco
    COMPANION PROCEEDINGS OF THE 2023 ACM SIGPLAN INTERNATIONAL CONFERENCE ON SYSTEMS, PROGRAMMING, LANGUAGES, AND APPLICATIONS: SOFTWARE FOR HUMANITY, SPLASH COMPANION 2023, 2023, : 7 - 9
  • [26] Opinion On Program Synthesis and Large Language Models
    Huttel, Hans
    COMMUNICATIONS OF THE ACM, 2025, 68 (01) : 33 - 35
  • [27] Evaluating Large Language Models for Material Selection
    Grandi, Daniele
    Jain, Yash Patawari
    Groom, Allin
    Cramer, Brandon
    Mccomb, Christopher
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [28] SmartLLM: A New Oracle System for Smart Contracts Calling Large Language Models
    Xu, Zhenan
    Wang, Jiuzheng
    Zha, Cong
    Li, Xinyi
    Yin, Hao
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 2668 - 2675
  • [29] Automatic Model Selection with Large Language Models for Reasoning
    Zhao, James Xu
    Xie, Yuxi
    Kawaguchi, Kenji
    He, Junxian
    Xie, Michael Qizhe
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 758 - 783
  • [30] Jigsaw: Large Language Models meet Program Synthesis
    Jain, Naman
    Vaidyanath, Skanda
    Iyer, Arun
    Natarajan, Nagarajan
    Parthasarathy, Suresh
    Rajamani, Sriram
    Sharma, Rahul
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1219 - 1231