Evaluating Language Models for Generating and Judging Programming Feedback

被引:0
|
作者
Koutcheme, Charles [1 ]
Dainese, Nicola [1 ]
Sarsa, Sami [2 ]
Hellas, Arto [1 ]
Leinonen, Juho [1 ]
Ashraf, Syed [1 ]
Denny, Paul [3 ]
机构
[1] Aalto Univ, Espoo, Finland
[2] Univ Jyvaskyla, Jyvaskyla, Finland
[3] Univ Auckland, Auckland, New Zealand
关键词
open source; large language models; generative AI; automatic feedback; automatic evaluation; programming feedback; LLM-as-a-judge;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The emergence of large language models (LLMs) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.
引用
收藏
页码:624 / 630
页数:7
相关论文
共 50 条
  • [31] Large language models and humans converge in judging public figures' personalities
    Cao, Xubo
    Kosinski, Michal
    PNAS NEXUS, 2024, 3 (10):
  • [32] AskIt: Unified Programming Interface for Programming with Large Language Models
    Okuda, Katsumi
    Amarasinghe, Saman
    2024 IEEE/ACM INTERNATIONAL SYMPOSIUM ON CODE GENERATION AND OPTIMIZATION, CGO, 2024, : 41 - 54
  • [33] Evaluating GPU Programming Models for the LUMI Supercomputer
    Markomanolis, George S.
    Alpay, Aksel
    Young, Jeffrey
    Klemm, Michael
    Malaya, Nicholas
    Esposito, Aniello
    Heikonen, Jussi
    Bastrakov, Sergei
    Debus, Alexander
    Kluge, Thomas
    Steiniger, Klaus
    Stephan, Jan
    Widera, Rene
    Bussmann, Michael
    SUPERCOMPUTING FRONTIERS, SCFA 2022, 2022, 13214 : 79 - 101
  • [34] Evaluating generative patent language models
    Lee, Jieh-Sheng
    WORLD PATENT INFORMATION, 2023, 72
  • [35] PROGRAMMING EVALUATING 2ND LANGUAGE CAI
    TUTTLE, HG
    FOREIGN LANGUAGE ANNALS, 1983, 16 (01) : 35 - 39
  • [36] Evaluating Approaches to Personalizing Language Models
    King, Milton
    Cook, Paul
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2461 - 2469
  • [37] Evaluating Text GANs as Language Models
    Tevet, Guy
    Habib, Gavriel
    Shwartz, Vered
    Berant, Jonathan
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 2241 - 2247
  • [38] LISP, A PROGRAMMING LANGUAGE AND ITS COMPUTATIONAL MODELS
    HAHNE, K
    MOCKEL, P
    THURLINGS, KJ
    SIEMENS FORSCHUNGS-UND ENTWICKLUNGSBERICHTE-SIEMENS RESEARCH AND DEVELOPMENT REPORTS, 1988, 17 (02): : 52 - 58
  • [39] Automatic implementation of programming language consistency models
    Sura, Z
    Wong, CL
    Fang, X
    Lee, JJ
    Midkiff, SP
    Padua, D
    LANGUAGES AND COMPILERS FOR PARALLEL COMPUTING, 2005, 2481 : 172 - 187
  • [40] THE C-LANGUAGE AND MODELS FOR SYSTEMS PROGRAMMING
    JOHNSON, SC
    KERNIGHAN, BW
    BYTE, 1983, 8 (08): : 48 - &