FELM: Benchmarking Factuality Evaluation of Large Language Models

被引：0

作者：

Chen, Shiqi ^{[1
,2
]}

Zhao, Yiran ^{[3
]}

Zhang, Jinghan ^{[2
]}

Chern, I-Chun ^{[4
]}

Gao, Siyang ^{[1
]}

Liu, Pengfei ^{[5
]}

He, Junxian ^{[2
]}

机构：

[1] City Univ Hong Kong, Hong Kong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Carnegie Mellon Univ, Pittsburgh, PA USA

[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)

引用

页数：22

共 50 条

[41] RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
Wang, Zekun Moore
Peng, Zhongyuan
Qu, Haoran
Li, Jiaheng
Zhou, Wangchunshu
Wu, Yuhan
Guo, Hongcheng
Gan, Ruitong
Ni, Zehao
Yang, Jian
Zhang, Man
Zhang, Zhaoxiang
Ouyang, Wanli
Xu, Ke
Huang, Stephen W.
Fu, Jie
Peng, Junran
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14743 - 14777
[42] On the Evaluation of Large Language Models in Unit Test Generation
Yang, Lin
Yang, Chen
Gao, Shutao
Wang, Weijing
Wang, Bo
Zhu, Qihao
Chu, Xiao
Zhou, Jianyi
Liang, Guangtai
Wang, Qianxiang
Chen, Junjie
Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 1607 - 1619
[43] Updating knowledge in Large Language Models: an Empirical Evaluation
Marinelli, Alberto Roberto
Carta, Antonio
Passaro, Lucia C.
IEEE CONFERENCE ON EVOLVING AND ADAPTIVE INTELLIGENT SYSTEMS 2024, IEEE EAIS 2024, 2024, : 289 - 296
[44] PromptBench: A Unified Library for Evaluation of Large Language Models
Zhu, Kaijie
Zhao, Qinlin
Chen, Hao
Wang, Jindong
Xie, Xing
JOURNAL OF MACHINE LEARNING RESEARCH, 2024, 25 : 1 - 22
[45] A Comprehensive Evaluation of Quantization Strategies for Large Language Models
Jin, Renren
Du, Jiangcun
Huang, Wuwei
Liu, Wei
Lu, Jian
Wang, Bin
Xiong, Deyi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12186 - 12215
[46] Can Large Language Models Be an Alternative to Human Evaluation?
Chiang, Cheng-Han
Lee, Hung-yi
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15607 - 15631
[47] Are large language models qualified reviewers in originality evaluation?
Huang, Shengzhi
Huang, Yong
Liu, Yinpeng
Luo, Zhuoran
Lu, Wei
INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
[48] On the Evaluation of Large Language Models in Unit Test Generation
Yang, Lin
Yang, Chen
Gao, Shutao
Wang, Weijing
Wang, Bo
Zhu, Qihao
Chu, Xiao
Zhou, Jianyi
Liang, Guangtai
Wang, Qianxiang
Chen, Junjie
arXiv,
[49] An Empirical Analysis on Large Language Models in Debate Evaluation
Liu, Xinyi
Liu, Pinxin
He, Hangfeng
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 470 - 487
[50] Benchmarking protein language models for protein crystallization
Mall, Raghvendra
Kaushik, Rahul
Martinez, Zachary A.
Thomson, Matt W.
Castiglione, Filippo
SCIENTIFIC REPORTS, 2025, 15 (01):

← 1 2 3 4 5 →