Oracle-Guided Program Selection from Large Language Models

被引:0
|
作者
Fan, Zhiyu [1 ]
Ruan, Haifeng [1 ]
Mechtaev, Sergey [2 ]
Roychoudhury, Abhik [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Peking Univ, Beijing, Peoples R China
基金
新加坡国家研究基金会;
关键词
large language model; code generation; oracle inference; differential testing;
D O I
10.1145/3650212.3680308
中图分类号
学科分类号
摘要
While large language models (LLMs) have shown significant advancements in code generation, their susceptibility to producing incorrect code poses a significant challenge to the adoption of LLM-generated programs. This issue largely stems from the reliance on natural language descriptions as informal oracles in code generation. Current strategies to mitigate this involve selecting the best program from multiple LLM-generated alternatives, judged by criteria like the consistency of their execution results on an LLM-generated test suite. However, this approach has crucial limitations: (1) LLMs often generate redundant tests or tests that cannot distinguish between correct and incorrect solutions, (2) the used consistency criteria, such as the majority vote, fail to foster developer trust due to the absence of transparent rationale behind the made choices. In this work, we propose a new perspective on increasing the quality of LLM-generated code via program selection using the LLM as a test oracle. Our method is based on our experimentally confirmed observation that LLMs serve more effectively as oracles when tasked with selecting the correct output from multiple choices. Leveraging this insight, we first generate distinguishing inputs that capture semantic discrepancies of programs sampled from an LLM, and record outputs produced by the programs on these inputs. An LLM then selects the most likely to be correct output from these, guided by the natural language problem description. We implemented this idea in a tool LLIVIConECHoicf and evaluated its accuracy in generating and selecting standalone programs. Our experiments demonstrated its effectiveness in improving pass01 by 3.6-7% on HumanEval and MBPP benchmarks compared to the state-of-art cont:T. Most interestingly, the selected input-output specifications helped us to uncover incompleteness and ambiguities in task descriptions and also identify incorrect ground-truth implementations in the benchmarks.
引用
收藏
页码:628 / 640
页数:13
相关论文
共 50 条
  • [41] Multi-stage guided code generation for Large Language Models
    Han, Yewei
    Lyu, Chen
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139
  • [42] Self-Knowledge Guided Retrieval Augmentation for Large Language Models
    Wang, Yile
    Li, Peng
    Sun, Maosong
    Liu, Yang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10303 - 10315
  • [43] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
    Chen, Zheyi
    Xu, Liuchang
    Zheng, Hongting
    Chen, Luyao
    Tolba, Amr
    Zhao, Liang
    Yu, Keping
    Feng, Hailin
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (02): : 1753 - 1808
  • [44] Quo Vadis ChatGPT? From large language models to Large Knowledge Models
    Venkatasubramanian, Venkat
    Chakraborty, Arijit
    COMPUTERS & CHEMICAL ENGINEERING, 2025, 192
  • [45] The Language of Creativity: Evidence from Humans and Large Language Models
    Orwig, William
    Edenbaum, Emma R.
    Greene, Joshua D.
    Schacter, Daniel L.
    JOURNAL OF CREATIVE BEHAVIOR, 2024, 58 (01): : 128 - 136
  • [46] Leveraging Large Language Models for Automated Program Repair in Programming Education
    Murali, Pavithra Sripathanallur
    XRDS: Crossroads, 2025, 31 (02): : 58 - 60
  • [47] Empirical Evaluation of Large Language Models for Novice Program Fault Localization
    Liu, Yangtao
    Liu, Hengyuan
    Yang, Zezhong
    Li, Zheng
    Liu, Yong
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 180 - 191
  • [48] Harnessing large language models (LLMs) for candidate gene prioritization and selection
    Toufiq, Mohammed
    Rinchai, Darawan
    Bettacchioli, Eleonore
    Kabeer, Basirudeen Syed Ahamed
    Khan, Taushif
    Subba, Bishesh
    White, Olivia
    Yurieva, Marina
    George, Joshy
    Jourde-Chiche, Noemie
    Chiche, Laurent
    Palucka, Karolina
    Chaussabel, Damien
    JOURNAL OF TRANSLATIONAL MEDICINE, 2023, 21 (01)
  • [49] Harnessing large language models (LLMs) for candidate gene prioritization and selection
    Mohammed Toufiq
    Darawan Rinchai
    Eleonore Bettacchioli
    Basirudeen Syed Ahamed Kabeer
    Taushif Khan
    Bishesh Subba
    Olivia White
    Marina Yurieva
    Joshy George
    Noemie Jourde-Chiche
    Laurent Chiche
    Karolina Palucka
    Damien Chaussabel
    Journal of Translational Medicine, 21
  • [50] Counterexample Guided Inductive Synthesis Using Large Language Models and Satisfiability Solving
    Jha, Sumit Kumar
    Jha, Susmit
    Lincoln, Patrick
    Bastian, Nathaniel D.
    Velasquez, Alvaro
    Ewetz, Rickard
    Neema, Sandeep
    MILCOM 2023 - 2023 IEEE MILITARY COMMUNICATIONS CONFERENCE, 2023,