LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation

被引：1

作者：

Fakhoury, Sarah ^{[1
]}

Naik, Aaditya ^{[2
]}

Sakkas, Georgios ^{[3
]}

Chakraborty, Saikat ^{[1
]}

Lahiri, Shuvendu K. ^{[1
]}

机构：

[1] Microsoft Res, Redmond, WA 98052 USA

[2] Univ Penn, Philadelphia, PA 19104 USA

[3] Univ Calif San Diego, San Diego, CA 92037 USA

来源：

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING | 2024年 / 50卷 / 09期

关键词：

Intent disambiguation; code generation; LLMs; human factors; cognitive load; test generation;

D O I：

10.1109/TSE.2024.3428972

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, given NL is informal, it does not lend easily to checking that the generated code correctly satisfies the user intent. In this paper, we propose a novel interactive workflow TiCoder for guided intent clarification (i.e., partial formalization) through tests to support the generation of more accurate code suggestions. Through a mixed methods user study with 15 programmers, we present an empirical evaluation of the effectiveness of the workflow to improve code generation accuracy. We find that participants using the proposed workflow are significantly more likely to correctly evaluate AI generated code, and report significantly less task-induced cognitive load. Furthermore, we test the potential of the workflow at scale with four different state-of-the-art LLMs on two python datasets, using an idealized proxy for a user feedback. We observe an average absolute improvement of 45.97% in the pass@1 code generation accuracy for both datasets and across all LLMs within 5 user interactions, in addition to the automatic generation of accompanying unit tests.

引用

页码：2254 / 2268

页数：15

共 39 条

[31] Evaluation of the potential of LLM-based generative AIs in nutrition education: a comparative study of ChatGPT and Bing for Japanese registered dietitian licensure exam preparation
Kosai, M.
Nagamori, Y.
Kawai, Y.
Marumo, H.
Shibuya, M.
Negishi, T.
Sawai, A.
Miyamoto, L.
DIABETOLOGIA, 2024, 67 : S396 - S396
[32] An Empirical Evaluation of Behavioral UML Diagrams Based on the Comprehension of Test Case Generation
Hashim, Nor Laily
Ibrahim, Haitham Raed
Rejab, Mawarny Md.
Romli, Rohaida
Mohd, Haslina
ADVANCED SCIENCE LETTERS, 2018, 24 (10) : 7257 - 7262
[33] From Fine-tuning to Output: An Empirical Investigation of Test Smells in Transformer-Based Test Code Generation
Aljohani, Ahmed
Do, Hyunsook
39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1282 - 1291
[34] Test-driven simulation modelling: A case study using agent-based maritime search-operation simulation
Onggo, Bhakti Stephan
Karatas, Mumtaz
EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2016, 254 (02) : 517 - 531
[35] AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Agostini, Nicolas Bohm
Haris, Jude
Gibson, Perry
Jayaweera, Malith
Rubin, Norm
Tumeo, Antonino
Abellan, Jose L.
Cano, Jose
Kaeli, David
2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO, 2024, : 143 - 157
[36] Empirical Evaluation of the Quality of Conceptual Models Based on User Perceptions: A Case Study in the Transport Domain
Cruzes, Daniela S.
Vennesland, Audun
Natvig, Marit K.
CONCEPTUAL MODELING, ER 2013, 2013, 8217 : 414 - 428
[37] Optimizing Search-Based Unit Test Generation with Large Language Models: An Empirical Study
Xiao, Danni
Guo, Yimeng
Li, Yanhui
Chen, Lin
PROCEEDINGS OF THE 15TH ASIA-PACIFIC SYMPOSIUM ON INTERNETWARE, INTERNETWARE 2024, 2024, : 71 - 80
[38] MEdit4CEP-Gam: A model-driven approach for user-friendly gamification design, monitoring and code generation in CEP-based systems
Calderon, Alejandro
Boubeta-Puig, Juan
Ruiz, Mercedes
INFORMATION AND SOFTWARE TECHNOLOGY, 2018, 95 : 238 - 264
[39] An extensive power evaluation of a novel two-sample density-based empirical likelihood ratio test for paired data with an application to a treatment study of attention-deficit/hyperactivity disorder and severe mood dysregulation
Tsai, Wan-Min
Vexler, Albert
Gurevich, Gregory
JOURNAL OF APPLIED STATISTICS, 2013, 40 (06) : 1189 - 1208

← 1 2 3 4 →