A Field Guide to Automatic Evaluation of LLM-Generated Summaries

被引:0
|
作者
van Schaik, Tempest A. [1 ]
Pugh, Brittany [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
关键词
Evaluation metrics; LLMs; summarization; offline evaluation;
D O I
10.1145/3626772.3661346
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language models (LLMs) are rapidly being adopted for tasks such as text summarization, in a wide range of industries. This has driven the need for scalable, automatic, reliable, and cost-effective methods to evaluate the quality of LLM-generated text. What is meant by evaluating an LLM is not yet well defined and there are widely different expectations about what kind of information evaluation will produce. Evaluation methods that were developed for traditional Natural Language Processing (NLP) tasks (before the rise of LLMs) remain applicable but are not sufficient for capturing high-level semantic qualities of summaries. Emerging evaluation methods that use LLMs to evaluate LLM-output, appear to be powerful but lacking in reliability. New elements of LLM generated text that were not an element of previous NLP tasks, such as the artifacts of hallucination, need to be considered. We outline the different types of LLM evaluation currently used in the literature but focus on offline, system-level evaluation of the text generated by LLMs. Evaluating LLM-generated summaries is a complex and fast-evolving area, and we propose strategies for applying evaluation methods to avoid common pitfalls. Despite having promising strategies for evaluating LLM summaries, we highlight some open challenges that remain.
引用
收藏
页码:2832 / 2836
页数:5
相关论文
共 50 条
  • [21] EE-LCE: An Event Extraction Framework Based on LLM-Generated CoT Explanation
    Yu, Yanhua
    Wang, Yuanlong
    Ma, Yunshan
    Li, Jie
    Lu, Kangkang
    Huang, Zhiyong
    Chua, Tat Seng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2024, 2024, 14884 : 28 - 40
  • [22] LLM-Generated Multiple Choice Practice Quizzes for PreClinical Medical Students; Use and Validity
    Berman, Jonathan
    McCoy, Use
    Camarata, Troy
    PHYSIOLOGY, 2024, 39
  • [23] Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs
    Balse, Rishabh
    Kumar, Viraj
    Prasad, Prajish
    Warriem, Jayakrishnan Madathil
    PROCEEDINGS OF THE 16TH ANNUAL ACM INDIA COMPUTE CONFERENCE, COMPUTE 2023, 2023, : 49 - 54
  • [24] ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks
    Nasution, Arbi Haza
    Onan, Aytug
    IEEE ACCESS, 2024, 12 : 71876 - 71900
  • [25] Manually-Curated Versus LLM-Generated Explanations for Complex Patient Cases: An Exploratory Study with Physicians
    Michalowski, Martin
    Wilk, Szymon
    Bauer, Jenny M.
    Carrier, Marc
    Delluc, Aurelien
    Le Gal, Gregoire
    Wang, Tzu-Fei
    Siegal, Deborah
    Michalowski, Wojtek
    ARTIFICIAL INTELLIGENCE IN MEDICINE, PT II, AIME 2024, 2024, 14845 : 313 - 323
  • [26] Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer Forums
    Siddiq, Mohammed Latif
    Zhang, Jiahao
    Santos, Joanna C. S.
    PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 190 - 201
  • [27] Automatic Evaluation of Video Summaries
    Valdes, Victor
    Martinez, Jose M.
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2012, 8 (03) : 1 - 21
  • [28] LLM-generated tips rival expert-created tips in helping students answer quantum-computing questions
    Lars Krupp
    Jonas Bley
    Isacco Gobbi
    Alexander Geng
    Sabine Müller
    Sungho Suh
    Ali Moghiseh
    Arcesio Castaneda Medina
    Valeria Bartsch
    Artur Widera
    Herwig Ott
    Paul Lukowicz
    Jakob Karolus
    Maximilian Kiefer-Emmanouilidis
    EPJ Quantum Technology, 2025, 12 (1)
  • [29] Evaluation of automatic summaries using QARLA
    Amigo, Enrique
    Gonzalo, Julio
    Peinado, Victor
    Penas, Anselmo
    Verdejo, Felisa
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 59 - 66
  • [30] Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach
    Nakamoto, Ryosuke
    Flanagan, Brendan
    Yamauchi, Taisei
    Dai, Yiling
    Takami, Kyosuke
    Ogata, Hiroaki
    COMPUTERS, 2023, 12 (11)