FELM: Benchmarking Factuality Evaluation of Large Language Models

被引：0

作者：

Chen, Shiqi ^{[1
,2
]}

Zhao, Yiran ^{[3
]}

Zhang, Jinghan ^{[2
]}

Chern, I-Chun ^{[4
]}

Gao, Siyang ^{[1
]}

Liu, Pengfei ^{[5
]}

He, Junxian ^{[2
]}

机构：

[1] City Univ Hong Kong, Hong Kong, Peoples R China

[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[3] Natl Univ Singapore, Singapore, Singapore

[4] Carnegie Mellon Univ, Pittsburgh, PA USA

[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)

引用

页数：22

共 50 条

[21] SEED-Bench: Benchmarking Multimodal Large Language Models
Li, Bohao
Ge, Yuying
Ge, Yixiao
Wang, Guangzhi
Wang, Rui
Zhang, Ruimao
Shi, Ying
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
[22] Quantifying Bias in Agentic Large Language Models: A Benchmarking Approach
Fernando, Riya
Norton, Isabel
Dogra, Pranay
Sarnaik, Rohit
Wazir, Hasan
Ren, Zitang
Gunda, Niveta Sree
Mukhopadhyay, Anushka
Lutz, Michael
2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 349 - 353
[23] Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
Karlsen, Egil
Luo, Xiao
Zincir-Heywood, Nur
Heywood, Malcolm
JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2024, 32 (03)
[24] Using Large Language Models for Robot-Assisted Therapeutic Role-Play: Factuality is not enough!
Hohn, Sviatlana
Nasir, Jauwairia
Paikan, Ali
Ziafati, Pouyan
Andre, Elisabeth
PROCEEDINGS OF THE 6TH CONFERENCE ON ACM CONVERSATIONAL USER INTERFACES, CUI 2024, 2024,
[25] Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models
Tan, Qingyu
Ng, Hwee Tou
Bing, Lidong
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14820 - 14835
[26] MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering
Alonso, Inigo
Oronoz, Maite
Agerri, Rodrigo
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 155
[27] Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions
Bereuter, Jean-Paul
Geissler, Mark Enrik
Klimova, Anna
Steiner, Robert-Patrick
Pfeiffer, Kevin
Kolbinger, Fiona R.
Wiest, Isabella C.
Muti, Hannah Sophie
Kather, Jakob Nikolas
JOURNAL OF SURGICAL EDUCATION, 2025, 82 (04)
[28] Benchmarking Large Language Models for Automated Verilog RTL Code Generation
Thakur, Shailja
Ahmad, Baleegh
Fan, Zhenxing
Pearce, Hammond
Tan, Benjamin
Karri, Ramesh
Dolan-Gavitt, Brendan
Garg, Siddharth
2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
[29] Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
Chen, Yihan
Xu, Benfeng
Wang, Quan
Liu, Yi
Mao, Zhendong
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17808 - 17816
[30] Benchmarking Causal Study to Interpret Large Language Models for Source Code
Rodriguez-Cardenas, Daniel
Palacio, David N.
Khati, Dipin
Burke, Henry
Poshyvanyk, Denys
2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334

← 1 2 3 4 5 →