Evaluating Language Models for Generating and Judging Programming Feedback

被引：0

作者：

Koutcheme, Charles ^{[1
]}

Dainese, Nicola ^{[1
]}

Sarsa, Sami ^{[2
]}

Hellas, Arto ^{[1
]}

Leinonen, Juho ^{[1
]}

Ashraf, Syed ^{[1
]}

Denny, Paul ^{[3
]}

机构：

[1] Aalto Univ, Espoo, Finland

[2] Univ Jyvaskyla, Jyvaskyla, Finland

[3] Univ Auckland, Auckland, New Zealand

来源：

PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2 | 2025年

关键词：

open source; large language models; generative AI; automatic feedback; automatic evaluation; programming feedback; LLM-as-a-judge;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The emergence of large language models (LLMs) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.

引用

页码：624 / 630

页数：7

共 50 条

[21] THE POWER AND THE PORTABILITY - EVALUATING THE C PROGRAMMING LANGUAGE
MILLER, P
WATSON, JA
ELECTRONICS, 1984, 57 (08): : 152 - 154
[22] Evaluating a Natural Language Interface for Behavioral Programming
Gordon, Michal
Harel, David
2012 IEEE SYMPOSIUM ON VISUAL LANGUAGES AND HUMAN-CENTRIC COMPUTING (VL/HCC), 2012, : 167 - 170
[23] Evaluating a natural language interface for behavioral programming
Weizmann Institute of Science, Israel
Proc. of IEEE Symp. Vis. Lang. Hum.-Cent. Comput., VL/HCC, (167-170):
[24] Prompting Is Programming: A Query Language for Large Language Models
Beurer-Kellner, Luca
Fischer, Marc
Vechev, Martin
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2023, 7 (PLDI):
[25] A Survey of Programming Language Memory Models
Moiseenko, E.
Podkopaev, A.
Koznov, D.
PROGRAMMING AND COMPUTER SOFTWARE, 2021, 47 (06) : 439 - 456
[26] A Survey of Programming Language Memory Models
E. Moiseenko
A. Podkopaev
D. Koznov
Programming and Computer Software, 2021, 47 : 439 - 456
[27] Generating Profiles of News Commentators with Language Models
Power, William
Obradovic, Zoran
ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, PT II, AIAI 2024, 2024, 712 : 47 - 59
[28] Generating Benchmarks for Factuality Evaluation of Language Models
Muhlgay, Dor
Ram, Ori
Magar, Inbal
Levine, Yoav
Ratner, Nir
Belinkov, Yonatan
Abend, Omri
Leyton-Brown, Kevin
Shashua, Amnon
Shoham, Yoav
PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 49 - 66
[29] A NEW METHODOLOGY FOR GENERATING TEST CASES FOR A PROGRAMMING LANGUAGE COMPILER
BERRY, DM
SIGPLAN NOTICES, 1983, 18 (02): : 46 - 56
[30] CONJUNCTION OF A PROGRAMMING LANGUAGE AND TEXT FORMATTER FOR GENERATING RANDOMIZED QUESTIONNAIRES
COHEN, AJ
FOLEY, JE
BEHAVIOR RESEARCH METHODS INSTRUMENTS & COMPUTERS, 1984, 16 (06): : 545 - 547

← 1 2 3 4 5 →