A Field Guide to Automatic Evaluation of LLM-Generated Summaries

被引：0

作者：

van Schaik, Tempest A. ^{[1
]}

Pugh, Brittany ^{[1
]}

机构：

[1] Microsoft, Redmond, WA 98052 USA

来源：

PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024 | 2024年

关键词：

Evaluation metrics; LLMs; summarization; offline evaluation;

D O I：

10.1145/3626772.3661346

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language models (LLMs) are rapidly being adopted for tasks such as text summarization, in a wide range of industries. This has driven the need for scalable, automatic, reliable, and cost-effective methods to evaluate the quality of LLM-generated text. What is meant by evaluating an LLM is not yet well defined and there are widely different expectations about what kind of information evaluation will produce. Evaluation methods that were developed for traditional Natural Language Processing (NLP) tasks (before the rise of LLMs) remain applicable but are not sufficient for capturing high-level semantic qualities of summaries. Emerging evaluation methods that use LLMs to evaluate LLM-output, appear to be powerful but lacking in reliability. New elements of LLM generated text that were not an element of previous NLP tasks, such as the artifacts of hallucination, need to be considered. We outline the different types of LLM evaluation currently used in the literature but focus on offline, system-level evaluation of the text generated by LLMs. Evaluating LLM-generated summaries is a complex and fast-evolving area, and we propose strategies for applying evaluation methods to avoid common pitfalls. Despite having promising strategies for evaluating LLM summaries, we highlight some open challenges that remain.

引用

页码：2832 / 2836

页数：5

共 50 条

[21] EE-LCE: An Event Extraction Framework Based on LLM-Generated CoT Explanation
Yu, Yanhua
Wang, Yuanlong
Ma, Yunshan
Li, Jie
Lu, Kangkang
Huang, Zhiyong
Chua, Tat Seng
KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, KSEM 2024, 2024, 14884 : 28 - 40
[22] LLM-Generated Multiple Choice Practice Quizzes for PreClinical Medical Students; Use and Validity
Berman, Jonathan
McCoy, Use
Camarata, Troy
PHYSIOLOGY, 2024, 39
[23] Evaluating the Quality of LLM-Generated Explanations for Logical Errors in CS1 Student Programs
Balse, Rishabh
Kumar, Viraj
Prasad, Prajish
Warriem, Jayakrishnan Madathil
PROCEEDINGS OF THE 16TH ANNUAL ACM INDIA COMPUTE CONFERENCE, COMPUTE 2023, 2023, : 49 - 54
[24] ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks
Nasution, Arbi Haza
Onan, Aytug
IEEE ACCESS, 2024, 12 : 71876 - 71900
[25] Manually-Curated Versus LLM-Generated Explanations for Complex Patient Cases: An Exploratory Study with Physicians
Michalowski, Martin
Wilk, Szymon
Bauer, Jenny M.
Carrier, Marc
Delluc, Aurelien
Le Gal, Gregoire
Wang, Tzu-Fei
Siegal, Deborah
Michalowski, Wojtek
ARTIFICIAL INTELLIGENCE IN MEDICINE, PT II, AIME 2024, 2024, 14845 : 313 - 323
[26] Understanding Regular Expression Denial of Service (ReDoS): Insights from LLM-Generated Regexes and Developer Forums
Siddiq, Mohammed Latif
Zhang, Jiahao
Santos, Joanna C. S.
PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 190 - 201
[27] Automatic Evaluation of Video Summaries
Valdes, Victor
Martinez, Jose M.
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2012, 8 (03) : 1 - 21
[28] LLM-generated tips rival expert-created tips in helping students answer quantum-computing questions
Lars Krupp
Jonas Bley
Isacco Gobbi
Alexander Geng
Sabine Müller
Sungho Suh
Ali Moghiseh
Arcesio Castaneda Medina
Valeria Bartsch
Artur Widera
Herwig Ott
Paul Lukowicz
Jakob Karolus
Maximilian Kiefer-Emmanouilidis
EPJ Quantum Technology, 2025, 12 (1)
[29] Evaluation of automatic summaries using QARLA
Amigo, Enrique
Gonzalo, Julio
Peinado, Victor
Penas, Anselmo
Verdejo, Felisa
PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (35): : 59 - 66
[30] Enhancing Automated Scoring of Math Self-Explanation Quality Using LLM-Generated Datasets: A Semi-Supervised Approach
Nakamoto, Ryosuke
Flanagan, Brendan
Yamauchi, Taisei
Dai, Yiling
Takami, Kyosuke
Ogata, Hiroaki
COMPUTERS, 2023, 12 (11)

← 1 2 3 4 5 →