A Field Guide to Automatic Evaluation of LLM-Generated Summaries

被引：0

作者：

van Schaik, Tempest A. ^{[1
]}

Pugh, Brittany ^{[1
]}

机构：

[1] Microsoft, Redmond, WA 98052 USA

来源：

PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024 | 2024年

关键词：

Evaluation metrics; LLMs; summarization; offline evaluation;

D O I：

10.1145/3626772.3661346

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large Language models (LLMs) are rapidly being adopted for tasks such as text summarization, in a wide range of industries. This has driven the need for scalable, automatic, reliable, and cost-effective methods to evaluate the quality of LLM-generated text. What is meant by evaluating an LLM is not yet well defined and there are widely different expectations about what kind of information evaluation will produce. Evaluation methods that were developed for traditional Natural Language Processing (NLP) tasks (before the rise of LLMs) remain applicable but are not sufficient for capturing high-level semantic qualities of summaries. Emerging evaluation methods that use LLMs to evaluate LLM-output, appear to be powerful but lacking in reliability. New elements of LLM generated text that were not an element of previous NLP tasks, such as the artifacts of hallucination, need to be considered. We outline the different types of LLM evaluation currently used in the literature but focus on offline, system-level evaluation of the text generated by LLMs. Evaluating LLM-generated summaries is a complex and fast-evolving area, and we propose strategies for applying evaluation methods to avoid common pitfalls. Despite having promising strategies for evaluating LLM summaries, we highlight some open challenges that remain.

引用

页码：2832 / 2836

页数：5

共 50 条

[41] Automatic evaluation of summaries using N-gram co-occurrence statistics
Lin, CY
Hovy, E
HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 150 - 157
[42] Automatic generation of meta-summaries for evaluation of the handling of discursive structures and coherence in students
Atutxa, Unai
Molina-Villegas, Alejandro
Iruskieta, Mikel
PROCESAMIENTO DEL LENGUAJE NATURAL, 2021, (66): : 165 - 175
[43] An automatic method for guide-wire and catheter evaluation
不详
JOURNAL OF ENDOUROLOGY, 2005, 19 : A22 - A22
[44] A Framework for Automatic Field Evaluation of DTV Receivers
de Souza Junior, Manoel J.
Maia, Orlewilson B.
Leite, Samantha C.
de Lima Filho, Eddie B.
Izumi, Fabricio
Andrade, Robson R.
Correa, Paulo
2021 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2021,
[45] AUTOMATIC ERROR-CORRECTION AND QUERY EVALUATION OF OCR GENERATED TEXT
TAGHVA, K
BORSACK, J
CONDIT, A
ONLINE & CDROM REVIEW, 1994, 18 (01): : 47 - 47
[46] Automatic Pronunciation Evaluation of Language Learners' Utterances Generated through Shadowing
Luo, Dean
Shimomura, Naoya
Minematsu, Nobuaki
Yamauchi, Yutaka
Hirose, Keikichi
INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2807 - +
[47] Evaluation of the Magnetic Field Generated by the Inverter of an Electric Vehicle
Concha Moreno-Torres, Pablo
Lourd, Jerome
Lafoz, Marcos
Arribas, Jaime R.
IEEE TRANSACTIONS ON MAGNETICS, 2013, 49 (02) : 837 - 844
[48] Development of a field evaluation guide for unpaved rural roads
Calvert, EC
Haiar, K
Wilson, EM
SEVENTH INTERNATIONAL CONFERENCE ON LOW-VOLUME ROADS 1999, VOL 1: PLANNING, ADMINISTRATION, AND ENVIRONMENT; DESIGN; MATERIALS, CONSTRUCTION, AND MAINTENANCE; OPERATIONS AND SAFETY, 1999, (1652): : 86 - 89
[49] Development of a field evaluation guide for unpaved rural roads
Calvert, Eugene C.
Haiar, Keith
Wilson, Eugene M.
Transportation Research Record, 1999, 1 (1652): : 86 - 89
[50] FIELD-EVALUATION OF AN AUTOMATIC EXTERNAL DEFIBRILLATOR (AED)
CUMMINS, RO
EISENBERG, MS
BERGNER, L
MURRAY, JA
CIRCULATION, 1984, 70 (04) : 15 - 15

← 1 2 3 4 5 →