FELM: Benchmarking Factuality Evaluation of Large Language Models

被引:0
|
作者
Chen, Shiqi [1 ,2 ]
Zhao, Yiran [3 ]
Zhang, Jinghan [2 ]
Chern, I-Chun [4 ]
Gao, Siyang [1 ]
Liu, Pengfei [5 ]
He, Junxian [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Carnegie Mellon Univ, Pittsburgh, PA USA
[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)
引用
收藏
页数:22
相关论文
共 50 条
  • [41] RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
    Wang, Zekun Moore
    Peng, Zhongyuan
    Qu, Haoran
    Li, Jiaheng
    Zhou, Wangchunshu
    Wu, Yuhan
    Guo, Hongcheng
    Gan, Ruitong
    Ni, Zehao
    Yang, Jian
    Zhang, Man
    Zhang, Zhaoxiang
    Ouyang, Wanli
    Xu, Ke
    Huang, Stephen W.
    Fu, Jie
    Peng, Junran
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14743 - 14777
  • [42] On the Evaluation of Large Language Models in Unit Test Generation
    Yang, Lin
    Yang, Chen
    Gao, Shutao
    Wang, Weijing
    Wang, Bo
    Zhu, Qihao
    Chu, Xiao
    Zhou, Jianyi
    Liang, Guangtai
    Wang, Qianxiang
    Chen, Junjie
    Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 1607 - 1619
  • [43] Updating knowledge in Large Language Models: an Empirical Evaluation
    Marinelli, Alberto Roberto
    Carta, Antonio
    Passaro, Lucia C.
    IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS 2024, IEEE EAIS 2024, 2024, : 289 - 296
  • [44] PromptBench: A Unified Library for Evaluation of Large Language Models
    Zhu, Kaijie
    Zhao, Qinlin
    Chen, Hao
    Wang, Jindong
    Xie, Xing
    JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 22
  • [45] A Comprehensive Evaluation of Quantization Strategies for Large Language Models
    Jin, Renren
    Du, Jiangcun
    Huang, Wuwei
    Liu, Wei
    Lu, Jian
    Wang, Bin
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12186 - 12215
  • [46] Can Large Language Models Be an Alternative to Human Evaluation?
    Chiang, Cheng-Han
    Lee, Hung-yi
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15607 - 15631
  • [47] Are large language models qualified reviewers in originality evaluation?
    Huang, Shengzhi
    Huang, Yong
    Liu, Yinpeng
    Luo, Zhuoran
    Lu, Wei
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [48] On the Evaluation of Large Language Models in Unit Test Generation
    Yang, Lin
    Yang, Chen
    Gao, Shutao
    Wang, Weijing
    Wang, Bo
    Zhu, Qihao
    Chu, Xiao
    Zhou, Jianyi
    Liang, Guangtai
    Wang, Qianxiang
    Chen, Junjie
    arXiv,
  • [49] An Empirical Analysis on Large Language Models in Debate Evaluation
    Liu, Xinyi
    Liu, Pinxin
    He, Hangfeng
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 470 - 487
  • [50] Benchmarking protein language models for protein crystallization
    Mall, Raghvendra
    Kaushik, Rahul
    Martinez, Zachary A.
    Thomson, Matt W.
    Castiglione, Filippo
    SCIENTIFIC REPORTS, 2025, 15 (01):