A Field Guide to Automatic Evaluation of LLM-Generated Summaries

被引:0
|
作者
van Schaik, Tempest A. [1 ]
Pugh, Brittany [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
关键词
Evaluation metrics; LLMs; summarization; offline evaluation;
D O I
10.1145/3626772.3661346
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language models (LLMs) are rapidly being adopted for tasks such as text summarization, in a wide range of industries. This has driven the need for scalable, automatic, reliable, and cost-effective methods to evaluate the quality of LLM-generated text. What is meant by evaluating an LLM is not yet well defined and there are widely different expectations about what kind of information evaluation will produce. Evaluation methods that were developed for traditional Natural Language Processing (NLP) tasks (before the rise of LLMs) remain applicable but are not sufficient for capturing high-level semantic qualities of summaries. Emerging evaluation methods that use LLMs to evaluate LLM-output, appear to be powerful but lacking in reliability. New elements of LLM generated text that were not an element of previous NLP tasks, such as the artifacts of hallucination, need to be considered. We outline the different types of LLM evaluation currently used in the literature but focus on offline, system-level evaluation of the text generated by LLMs. Evaluating LLM-generated summaries is a complex and fast-evolving area, and we propose strategies for applying evaluation methods to avoid common pitfalls. Despite having promising strategies for evaluating LLM summaries, we highlight some open challenges that remain.
引用
收藏
页码:2832 / 2836
页数:5
相关论文
共 50 条
  • [41] Automatic evaluation of summaries using N-gram co-occurrence statistics
    Lin, CY
    Hovy, E
    HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 150 - 157
  • [42] Automatic generation of meta-summaries for evaluation of the handling of discursive structures and coherence in students
    Atutxa, Unai
    Molina-Villegas, Alejandro
    Iruskieta, Mikel
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2021, (66): : 165 - 175
  • [43] An automatic method for guide-wire and catheter evaluation
    不详
    JOURNAL OF ENDOUROLOGY, 2005, 19 : A22 - A22
  • [44] A Framework for Automatic Field Evaluation of DTV Receivers
    de Souza Junior, Manoel J.
    Maia, Orlewilson B.
    Leite, Samantha C.
    de Lima Filho, Eddie B.
    Izumi, Fabricio
    Andrade, Robson R.
    Correa, Paulo
    2021 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2021,
  • [45] AUTOMATIC ERROR-CORRECTION AND QUERY EVALUATION OF OCR GENERATED TEXT
    TAGHVA, K
    BORSACK, J
    CONDIT, A
    ONLINE & CDROM REVIEW, 1994, 18 (01): : 47 - 47
  • [46] Automatic Pronunciation Evaluation of Language Learners' Utterances Generated through Shadowing
    Luo, Dean
    Shimomura, Naoya
    Minematsu, Nobuaki
    Yamauchi, Yutaka
    Hirose, Keikichi
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2807 - +
  • [47] Evaluation of the Magnetic Field Generated by the Inverter of an Electric Vehicle
    Concha Moreno-Torres, Pablo
    Lourd, Jerome
    Lafoz, Marcos
    Arribas, Jaime R.
    IEEE TRANSACTIONS ON MAGNETICS, 2013, 49 (02) : 837 - 844
  • [48] Development of a field evaluation guide for unpaved rural roads
    Calvert, EC
    Haiar, K
    Wilson, EM
    SEVENTH INTERNATIONAL CONFERENCE ON LOW-VOLUME ROADS 1999, VOL 1: PLANNING, ADMINISTRATION, AND ENVIRONMENT; DESIGN; MATERIALS, CONSTRUCTION, AND MAINTENANCE; OPERATIONS AND SAFETY, 1999, (1652): : 86 - 89
  • [49] Development of a field evaluation guide for unpaved rural roads
    Calvert, Eugene C.
    Haiar, Keith
    Wilson, Eugene M.
    Transportation Research Record, 1999, 1 (1652): : 86 - 89
  • [50] FIELD-EVALUATION OF AN AUTOMATIC EXTERNAL DEFIBRILLATOR (AED)
    CUMMINS, RO
    EISENBERG, MS
    BERGNER, L
    MURRAY, JA
    CIRCULATION, 1984, 70 (04) : 15 - 15