A Field Guide to Automatic Evaluation of LLM-Generated Summaries

被引:0
|
作者
van Schaik, Tempest A. [1 ]
Pugh, Brittany [1 ]
机构
[1] Microsoft, Redmond, WA 98052 USA
关键词
Evaluation metrics; LLMs; summarization; offline evaluation;
D O I
10.1145/3626772.3661346
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large Language models (LLMs) are rapidly being adopted for tasks such as text summarization, in a wide range of industries. This has driven the need for scalable, automatic, reliable, and cost-effective methods to evaluate the quality of LLM-generated text. What is meant by evaluating an LLM is not yet well defined and there are widely different expectations about what kind of information evaluation will produce. Evaluation methods that were developed for traditional Natural Language Processing (NLP) tasks (before the rise of LLMs) remain applicable but are not sufficient for capturing high-level semantic qualities of summaries. Emerging evaluation methods that use LLMs to evaluate LLM-output, appear to be powerful but lacking in reliability. New elements of LLM generated text that were not an element of previous NLP tasks, such as the artifacts of hallucination, need to be considered. We outline the different types of LLM evaluation currently used in the literature but focus on offline, system-level evaluation of the text generated by LLMs. Evaluating LLM-generated summaries is a complex and fast-evolving area, and we propose strategies for applying evaluation methods to avoid common pitfalls. Despite having promising strategies for evaluating LLM summaries, we highlight some open challenges that remain.
引用
收藏
页码:2832 / 2836
页数:5
相关论文
共 50 条
  • [31] Relative Evaluation of Informativeness in Machine Generated Summaries
    Kolluru, BalaKrishna
    Gotoh, Yoshihiko
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 145 - 148
  • [32] EVALUATION OF COMPUTER GENERATED NEONATAL DISCHARGE SUMMARIES
    LISSAUER, T
    PATERSON, CM
    SIMONS, A
    BEARD, RW
    ARCHIVES OF DISEASE IN CHILDHOOD-FETAL AND NEONATAL EDITION, 1991, 66 (04): : 433 - 436
  • [33] Terminology and Experimental Parameters for the Evaluation of Automatic Summaries
    Goulet, Marie-Josee
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2007, 48 (01): : 161 - 185
  • [34] Evaluation Of Automatic Text Summarizations Based On Human Summaries
    Kiyoumarsi, Farshad
    PROCEEDINGS OF 2ND GLOBAL CONFERENCE ON CONFERENCE ON LINGUISTICS AND FOREIGN LANGUAGE TEACHING, 2015, 192 : 83 - 91
  • [35] A Multimetric Approach for Evaluation of ChatGPT-Generated Text Summaries
    Arnold, Jonas Benedikt
    Horauf, Dominik
    IEEE Engineering Management Review, 2024, 52 (03): : 43 - 53
  • [36] ICE: Information coverage estimate for automatic evaluation abstractive summaries
    Lal, Daisy Monika
    Singh, Krishna Pratap
    Tiwary, Uma Shanker
    EXPERT SYSTEMS WITH APPLICATIONS, 2022, 189
  • [37] The Evaluation of an Approach for Automatic Generated Documentation
    Abid, Nahla
    Dragan, Natalia
    Collard, Michael L.
    Maletic, Jonathan I.
    2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2017, : 307 - 317
  • [38] Usability of Electronic Health Record-Generated Discharge Summaries: Heuristic Evaluation
    Tremoulet, Patrice D.
    Shah, Priyanka D.
    Acosta, Alisha A.
    Grant, Christian W.
    Kurtz, Jon T.
    Mounas, Peter
    Kirchhoff, Michael
    Wade, Elizabeth
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2021, 23 (04)
  • [39] Automatic Pyramid Evaluation Exploiting EDU-based Extractive Reference Summaries
    Hirao, Tsutomu
    Kamigaito, Hidetaka
    Nagata, Masaaki
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4177 - 4186
  • [40] Evidence for the automatic evaluation of self-generated actions
    Aarts, Kristien
    De Houwer, Jan
    Pourtois, Gilles
    COGNITION, 2012, 124 (02) : 117 - 127