FELM: Benchmarking Factuality Evaluation of Large Language Models

被引：0

作者：

Chen, Shiqi ^{[1
,2
]}

Zhao, Yiran ^{[3
]}

Zhang, Jinghan ^{[2
]}

Chern, I-Chun ^{[4
]}

Gao, Siyang ^{[1
]}

Liu, Pengfei ^{[5
]}

He, Junxian ^{[2
]}

机构：

[1] City Univ Hong Kong, Hong Kong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Carnegie Mellon Univ, Pittsburgh, PA USA

[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)

引用

页数：22

共 50 条

[31] A Survey on Evaluation of Large Language Models
Chang, Yupeng
Wang, Xu
Wang, Jindong
Wu, Yuan
Yang, Linyi
Zhu, Kaijie
Chen, Hao
Yi, Xiaoyuan
Wang, Cunxiang
Wang, Yidong
Ye, Wei
Zhang, Yue
Chang, Yi
Yu, Philip S.
Yang, Qiang
Xie, Xing
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
[32] StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Guo, Zhicheng
Cheng, Sijie
Wang, Hao
Liang, Shihao
Qin, Yujia
Li, Peng
Liu, Zhiyuan
Sun, Maosong
Liu, Yang
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11143 - 11156
[33] Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System
Huang, Hengguan
Wang, Songtao
Liu, Hongfu
Wang, Hao
Wang, Ye
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1624 - 1637
[34] EchoSwift An Inference Benchmarking and Configuration Discovery Tool for Large Language Models (LLMs)
Krishna, Karthik
Bandili, Ramana
COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 158 - 162
[35] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study
Tamberg, Karl
Bahsi, Hayretdin
IEEE ACCESS, 2025, 13 : 29698 - 29717
[36] Large language models and rheumatology: a comparative evaluation
Venerito, Vincenzo
Puttaswamy, Darshan
Iannone, Florenzo
Gupta, Latika
LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578
[37] Automatic Evaluation of Attribution by Large Language Models
Yue, Xiang
Wang, Boshi
Chen, Ziru
Zhang, Kai
Su, Yu
Sun, Huan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4615 - 4635
[38] Factuality Enhanced Language Models for Open-Ended Text Generation
Lee, Nayeon
Ping, Wei
Xu, Peng
Patwary, Mostofa
Fung, Pascale
Shoeybi, Mohammad
Catanzaro, Bryan
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[39] (sic) UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
Liang, Xun
Song, Shichao
Niu, Simin
Li, Zhiyu
Xiong, Feiyu
Tang, Bo
Wang, Yezhaohui
He, Dawei
Cheng, Peng
Wang, Zhonghao
Deng, Haiying
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5266 - 5293
[40] Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study
Xu, Liuchang
Zhao, Shuo
Lin, Qingming
Chen, Luyao
Luo, Qianqian
Wu, Sensen
Ye, Xinyue
Feng, Hailin
Du, Zhenhong
INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2025, 18 (01)

← 1 2 3 4 5 →