LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

被引:1
|
作者
Fakhoury, Sarah [1 ]
Naik, Aaditya [2 ]
Sakkas, Georgios [3 ]
Chakraborty, Saikat [1 ]
Lahiri, Shuvendu K. [1 ]
机构
[1] Microsoft Res, Redmond, WA 98052 USA
[2] Univ Penn, Philadelphia, PA 19104 USA
[3] Univ Calif San Diego, San Diego, CA 92037 USA
关键词
Intent disambiguation; code generation; LLMs; human factors; cognitive load; test generation;
D O I
10.1109/TSE.2024.3428972
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.
引用
收藏
页码:2254 / 2268
页数:15
相关论文
共 39 条
  • [31] Evaluation of the potential of LLM-based generative AIs in nutrition education: a comparative study of ChatGPT and Bing for Japanese registered dietitian licensure exam preparation
    Kosai, M.
    Nagamori, Y.
    Kawai, Y.
    Marumo, H.
    Shibuya, M.
    Negishi, T.
    Sawai, A.
    Miyamoto, L.
    DIABETOLOGIA, 2024, 67 : S396 - S396
  • [32] An Empirical Evaluation of Behavioral UML Diagrams Based on the Comprehension of Test Case Generation
    Hashim, Nor Laily
    Ibrahim, Haitham Raed
    Rejab, Mawarny Md.
    Romli, Rohaida
    Mohd, Haslina
    ADVANCED SCIENCE LETTERS, 2018, 24 (10) : 7257 - 7262
  • [33] From Fine-tuning to Output: An Empirical Investigation of Test Smells in Transformer-Based Test Code Generation
    Aljohani, Ahmed
    Do, Hyunsook
    39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1282 - 1291
  • [34] Test-driven simulation modelling: A case study using agent-based maritime search-operation simulation
    Onggo, Bhakti Stephan
    Karatas, Mumtaz
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2016, 254 (02) : 517 - 531
  • [35] AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
    Agostini, Nicolas Bohm
    Haris, Jude
    Gibson, Perry
    Jayaweera, Malith
    Rubin, Norm
    Tumeo, Antonino
    Abellan, Jose L.
    Cano, Jose
    Kaeli, David
    2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO, 2024, : 143 - 157
  • [36] Empirical Evaluation of the Quality of Conceptual Models Based on User Perceptions: A Case Study in the Transport Domain
    Cruzes, Daniela S.
    Vennesland, Audun
    Natvig, Marit K.
    CONCEPTUAL MODELING, ER 2013, 2013, 8217 : 414 - 428
  • [37] Optimizing Search-Based Unit Test Generation with Large Language Models: An Empirical Study
    Xiao, Danni
    Guo, Yimeng
    Li, Yanhui
    Chen, Lin
    PROCEEDINGS OF THE 15TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2024, 2024, : 71 - 80
  • [38] MEdit4CEP-Gam: A model-driven approach for user-friendly gamification design, monitoring and code generation in CEP-based systems
    Calderon, Alejandro
    Boubeta-Puig, Juan
    Ruiz, Mercedes
    INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 95 : 238 - 264
  • [39] An extensive power evaluation of a novel two-sample density-based empirical likelihood ratio test for paired data with an application to a treatment study of attention-deficit/hyperactivity disorder and severe mood dysregulation
    Tsai, Wan-Min
    Vexler, Albert
    Gurevich, Gregory
    JOURNAL OF APPLIED STATISTICS, 2013, 40 (06) : 1189 - 1208