FELM: Benchmarking Factuality Evaluation of Large Language Models

被引:0
|
作者
Chen, Shiqi [1 ,2 ]
Zhao, Yiran [3 ]
Zhang, Jinghan [2 ]
Chern, I-Chun [4 ]
Gao, Siyang [1 ]
Liu, Pengfei [5 ]
He, Junxian [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Carnegie Mellon Univ, Pittsburgh, PA USA
[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
    Chen, Liang
    Deng, Yang
    Bian, Yatao
    Qin, Zeyu
    Wu, Bingzhe
    Chua, Tat-Seng
    Wong, Kam-Fai
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6325 - 6341
  • [2] Generating Benchmarks for Factuality Evaluation of Language Models
    Muhlgay, Dor
    Ram, Ori
    Magar, Inbal
    Levine, Yoav
    Ratner, Nir
    Belinkov, Yonatan
    Abend, Omri
    Leyton-Brown, Kevin
    Shashua, Amnon
    Shoham, Yoav
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 49 - 66
  • [3] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation
    Wysocka, Magdalena
    Wysocki, Oskar
    Delmas, Maxime
    Mutel, Vincent
    Freitas, Andre
    JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 158
  • [4] Benchmarking medical large language models
    Bakhshandeh, Sadra
    NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [5] Benchmarking Large Language Models on CFLUE - A Chinese Financial Language Understanding Evaluation Dataset
    Zhu, Jie
    Li, Junhui
    Wen, Yalong
    Guo, Lifan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5673 - 5693
  • [6] Benchmarking DNA large language models on quadruplexes
    Cherednichenko, Oleksandr
    Herbert, Alan
    Poptsova, Maria
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 27 : 992 - 1000
  • [7] Benchmarking AutoGen with different large language models
    Barbarroxa, Rafael
    Ribeiro, Bruno
    Gomes, Luis
    Vale, Zita
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 263 - 264
  • [8] Benchmarking Large Language Models for News Summarization
    Zhang, Tianyi
    Ladhak, Faisal
    Durmus, Esin
    Liang, Percy
    Mckeown, Kathleen
    Hashimoto, Tatsunori B.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
  • [9] Benchmarking Large Language Models: Opportunities and Challenges
    Hodak, Miro
    Ellison, David
    Van Buren, Chris
    Jiang, Xiaotong
    Dholakia, Ajay
    PERFORMANCE EVALUATION AND BENCHMARKING, TPCTC 2023, 2024, 14247 : 77 - 89
  • [10] Factuality challenges in the era of large language models and opportunities for fact-checking
    Augenstein, Isabelle
    Baldwin, Timothy
    Cha, Meeyoung
    Chakraborty, Tanmoy
    Ciampaglia, Giovanni Luca
    Corney, David
    Diresta, Renee
    Ferrara, Emilio
    Hale, Scott
    Halevy, Alon
    Hovy, Eduard
    Ji, Heng
    Menczer, Filippo
    Miguez, Ruben
    Nakov, Preslav
    Scheufele, Dietram
    Sharma, Shivam
    Zagni, Giovanni
    NATURE MACHINE INTELLIGENCE, 2024, 6 (08) : 852 - 863