FELM: Benchmarking Factuality Evaluation of Large Language Models

被引:0
|
作者
Chen, Shiqi [1 ,2 ]
Zhao, Yiran [3 ]
Zhang, Jinghan [2 ]
Chern, I-Chun [4 ]
Gao, Siyang [1 ]
Liu, Pengfei [5 ]
He, Junxian [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Carnegie Mellon Univ, Pittsburgh, PA USA
[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)
引用
收藏
页数:22
相关论文
共 50 条
  • [31] A Survey on Evaluation of Large Language Models
    Chang, Yupeng
    Wang, Xu
    Wang, Jindong
    Wu, Yuan
    Yang, Linyi
    Zhu, Kaijie
    Chen, Hao
    Yi, Xiaoyuan
    Wang, Cunxiang
    Wang, Yidong
    Ye, Wei
    Zhang, Yue
    Chang, Yi
    Yu, Philip S.
    Yang, Qiang
    Xie, Xing
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (03)
  • [32] StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
    Guo, Zhicheng
    Cheng, Sijie
    Wang, Hao
    Liang, Shihao
    Qin, Yujia
    Li, Peng
    Liu, Zhiyuan
    Sun, Maosong
    Liu, Yang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11143 - 11156
  • [33] Benchmarking Large Language Models on Communicative Medical Coaching: A Dataset and a Novel System
    Huang, Hengguan
    Wang, Songtao
    Liu, Hongfu
    Wang, Hao
    Wang, Ye
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1624 - 1637
  • [34] EchoSwift An Inference Benchmarking and Configuration Discovery Tool for Large Language Models (LLMs)
    Krishna, Karthik
    Bandili, Ramana
    COMPANION OF THE 15TH ACM/SPEC INTERNATIONAL CONFERENCE ON PERFORMANCE ENGINEERING, ICPE COMPANION 2024, 2024, : 158 - 162
  • [35] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study
    Tamberg, Karl
    Bahsi, Hayretdin
    IEEE ACCESS, 2025, 13 : 29698 - 29717
  • [36] Large language models and rheumatology: a comparative evaluation
    Venerito, Vincenzo
    Puttaswamy, Darshan
    Iannone, Florenzo
    Gupta, Latika
    LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578
  • [37] Automatic Evaluation of Attribution by Large Language Models
    Yue, Xiang
    Wang, Boshi
    Chen, Ziru
    Zhang, Kai
    Su, Yu
    Sun, Huan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4615 - 4635
  • [38] Factuality Enhanced Language Models for Open-Ended Text Generation
    Lee, Nayeon
    Ping, Wei
    Xu, Peng
    Patwary, Mostofa
    Fung, Pascale
    Shoeybi, Mohammad
    Catanzaro, Bryan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [39] (sic) UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation
    Liang, Xun
    Song, Shichao
    Niu, Simin
    Li, Zhiyu
    Xiong, Feiyu
    Tang, Bo
    Wang, Yezhaohui
    He, Dawei
    Cheng, Peng
    Wang, Zhonghao
    Deng, Haiying
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5266 - 5293
  • [40] Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study
    Xu, Liuchang
    Zhao, Shuo
    Lin, Qingming
    Chen, Luyao
    Luo, Qianqian
    Wu, Sensen
    Ye, Xinyue
    Feng, Hailin
    Du, Zhenhong
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2025, 18 (01)