FELM: Benchmarking Factuality Evaluation of Large Language Models

被引:0
|
作者
Chen, Shiqi [1 ,2 ]
Zhao, Yiran [3 ]
Zhang, Jinghan [2 ]
Chern, I-Chun [4 ]
Gao, Siyang [1 ]
Liu, Pengfei [5 ]
He, Junxian [2 ]
机构
[1] City Univ Hong Kong, Hong Kong, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Natl Univ Singapore, Singapore, Singapore
[4] Carnegie Mellon Univ, Pittsburgh, PA USA
[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Assessing factuality of text generated by large language models (LLMs) is an emerging yet crucial research area, aimed at alerting users to potential errors and guiding the development of more reliable LLMs. Nonetheless, the evaluators assessing factuality necessitate suitable evaluation themselves to gauge progress and foster advancements. This direction remains under-explored, resulting in substantial impediments to the progress of factuality evaluators. To mitigate this issue, we introduce a benchmark for Factuality Evaluation of large Language Models, referred to as FELM. In this benchmark, we collect responses generated from LLMs and annotate factuality labels in a fine-grained manner. Contrary to previous studies that primarily concentrate on the factuality of world knowledge (e.g. information from Wikipedia), FELM focuses on factuality across diverse domains, spanning from world knowledge to math and reasoning. Our annotation is based on text segments, which can help pinpoint specific factual errors. The factuality annotations are further supplemented by predefined error types and reference links that either support or contradict the statement. In our experiments, we investigate the performance of several LLM-based factuality evaluators on FELM, including both vanilla LLMs and those augmented with retrieval mechanisms and chain-of-thought processes. Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.(1)
引用
收藏
页数:22
相关论文
共 50 条
  • [21] SEED-Bench: Benchmarking Multimodal Large Language Models
    Li, Bohao
    Ge, Yuying
    Ge, Yixiao
    Wang, Guangzhi
    Wang, Rui
    Zhang, Ruimao
    Shi, Ying
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
  • [22] Quantifying Bias in Agentic Large Language Models: A Benchmarking Approach
    Fernando, Riya
    Norton, Isabel
    Dogra, Pranay
    Sarnaik, Rohit
    Wazir, Hasan
    Ren, Zitang
    Gunda, Niveta Sree
    Mukhopadhyay, Anushka
    Lutz, Michael
    2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 349 - 353
  • [23] Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
    Karlsen, Egil
    Luo, Xiao
    Zincir-Heywood, Nur
    Heywood, Malcolm
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2024, 32 (03)
  • [24] Using Large Language Models for Robot-Assisted Therapeutic Role-Play: Factuality is not enough!
    Hohn, Sviatlana
    Nasir, Jauwairia
    Paikan, Ali
    Ziafati, Pouyan
    Andre, Elisabeth
    PROCEEDINGS OF THE 6TH CONFERENCE ON ACM CONVERSATIONAL USER INTERFACES, CUI 2024, 2024,
  • [25] Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models
    Tan, Qingyu
    Ng, Hwee Tou
    Bing, Lidong
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14820 - 14835
  • [26] MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering
    Alonso, Inigo
    Oronoz, Maite
    Agerri, Rodrigo
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 155
  • [27] Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions
    Bereuter, Jean-Paul
    Geissler, Mark Enrik
    Klimova, Anna
    Steiner, Robert-Patrick
    Pfeiffer, Kevin
    Kolbinger, Fiona R.
    Wiest, Isabella C.
    Muti, Hannah Sophie
    Kather, Jakob Nikolas
    JOURNAL OF SURGICAL EDUCATION, 2025, 82 (04)
  • [28] Benchmarking Large Language Models for Automated Verilog RTL Code Generation
    Thakur, Shailja
    Ahmad, Baleegh
    Fan, Zhenxing
    Pearce, Hammond
    Tan, Benjamin
    Karri, Ramesh
    Dolan-Gavitt, Brendan
    Garg, Siddharth
    2023 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, 2023,
  • [29] Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
    Chen, Yihan
    Xu, Benfeng
    Wang, Quan
    Liu, Yi
    Mao, Zhendong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17808 - 17816
  • [30] Benchmarking Causal Study to Interpret Large Language Models for Source Code
    Rodriguez-Cardenas, Daniel
    Palacio, David N.
    Khati, Dipin
    Burke, Henry
    Poshyvanyk, Denys
    2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 329 - 334