FELM: Benchmarking Factuality Evaluation of Large Language Models

被引：0

作者：

Chen, Shiqi ^{[1
,2
]}

Zhao, Yiran ^{[3
]}

Zhang, Jinghan ^{[2
]}

Chern, I-Chun ^{[4
]}

Gao, Siyang ^{[1
]}

Liu, Pengfei ^{[5
]}

He, Junxian ^{[2
]}

机构：

[1] City Univ Hong Kong, Hong Kong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Carnegie Mellon Univ, Pittsburgh, PA USA

[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)

引用

页数：22

共 50 条

[1] Beyond Factuality: A Comprehensive Evaluation of Large Language Models as Knowledge Generators
Chen, Liang
Deng, Yang
Bian, Yatao
Qin, Zeyu
Wu, Bingzhe
Chua, Tat-Seng
Wong, Kam-Fai
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6325 - 6341
[2] Generating Benchmarks for Factuality Evaluation of Language Models
Muhlgay, Dor
Ram, Ori
Magar, Inbal
Levine, Yoav
Ratner, Nir
Belinkov, Yonatan
Abend, Omri
Leyton-Brown, Kevin
Shashua, Amnon
Shoham, Yoav
PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 49 - 66
[3] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation
Wysocka, Magdalena
Wysocki, Oskar
Delmas, Maxime
Mutel, Vincent
Freitas, Andre
JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 158
[4] Benchmarking medical large language models
Bakhshandeh, Sadra
NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
[5] Benchmarking Large Language Models on CFLUE - A Chinese Financial Language Understanding Evaluation Dataset
Zhu, Jie
Li, Junhui
Wen, Yalong
Guo, Lifan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5673 - 5693
[6] Benchmarking DNA large language models on quadruplexes
Cherednichenko, Oleksandr
Herbert, Alan
Poptsova, Maria
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 27 : 992 - 1000
[7] Benchmarking AutoGen with different large language models
Barbarroxa, Rafael
Ribeiro, Bruno
Gomes, Luis
Vale, Zita
2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 263 - 264
[8] Benchmarking Large Language Models for News Summarization
Zhang, Tianyi
Ladhak, Faisal
Durmus, Esin
Liang, Percy
Mckeown, Kathleen
Hashimoto, Tatsunori B.
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
[9] Benchmarking Large Language Models: Opportunities and Challenges
Hodak, Miro
Ellison, David
Van Buren, Chris
Jiang, Xiaotong
Dholakia, Ajay
PERFORMANCE EVALUATION AND BENCHMARKING, TPCTC 2023, 2024, 14247 : 77 - 89
[10] Factuality challenges in the era of large language models and opportunities for fact-checking
Augenstein, Isabelle
Baldwin, Timothy
Cha, Meeyoung
Chakraborty, Tanmoy
Ciampaglia, Giovanni Luca
Corney, David
Diresta, Renee
Ferrara, Emilio
Hale, Scott
Halevy, Alon
Hovy, Eduard
Ji, Heng
Menczer, Filippo
Miguez, Ruben
Nakov, Preslav
Scheufele, Dietram
Sharma, Shivam
Zagni, Giovanni
NATURE MACHINE INTELLIGENCE, 2024, 6 (08) : 852 - 863

← 1 2 3 4 5 →