Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models

被引:0
|
作者
Li, Ningke [1 ]
Li, Yuekang [2 ]
Liu, Yi [3 ]
Shi, Ling [3 ]
Wang, Kailong [1 ]
Wang, Haoyu [1 ]
机构
[1] Huazhong University of Science and Technology, Wuhan, China
[2] The University of New South Wales, Sydney, Australia
[3] Nanyang Technological University, Singapore, Singapore
关键词
Large language models (LLMs) have revolutionized language processing; but face critical challenges with security; privacy; and generating hallucinations - coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH); where LLMs produce content contradicting ground truth facts. Addressing FCH is difficult due to two key challenges: 1) Automatically constructing and updating benchmark datasets is hard; as existing methods rely on manually curated static benchmarks that cannot cover the broad; evolving spectrum of FCH cases. 2) Validating the reasoning behind LLM outputs is inherently difficult; especially for complex logical relations. To tackle these challenges; we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia; seamlessly integrated into Drowzee. Using logical reasoning rules; we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts; requiring them to provide reasoned answers. To validate their reasoning; we propose two semantic-aware oracles that assess the similarity between the semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains; with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts; out-of-distribution knowledge; and lack of logical reasoning capabilities. The results show that logic-based test cases generated by Drowzee effectively trigger and detect hallucinations. To further mitigate the identified FCHs; we explored model editing techniques; which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations. © 2024 Copyright held by the owner/author(s);
D O I
10.1145/3689776
中图分类号
学科分类号
摘要
引用
下载
收藏
相关论文
共 50 条
  • [41] Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection
    Singapore Management University, Singapore
    不详
    不详
    不详
    arXiv,
  • [42] Towards Autonomous Testing Agents via Conversational Large Language Models
    Feldt, Robert
    Kang, Sungmin
    Yoon, Juyeon
    Yoo, Shin
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1688 - 1693
  • [43] Comprehensive testing of large language models for extraction of structured data in pathology
    Bastian Grothey
    Jan Odenkirchen
    Adnan Brkic
    Birgid Schömig-Markiefka
    Alexander Quaas
    Reinhard Büttner
    Yuri Tolkach
    Communications Medicine, 5 (1):
  • [44] Getting pwn'd by AI: Penetration Testing with Large Language Models
    Happe, Andreas
    Cito, Juergen
    PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 2082 - 2086
  • [45] Weak supervision for Question Type Detection with large language models
    Martinek, Jiri
    Cerisara, Christophe
    Kral, Pavel
    Lenc, Ladislav
    Baloun, Josef
    INTERSPEECH 2022, 2022, : 3283 - 3287
  • [46] Investigating the Efficacy of Large Language Models for Code Clone Detection
    Khajezade, Mohamad
    Wu, Jie J. W.
    Fard, Fatemeh Hendijani
    Rodriguez-Perez, Gema
    Shehata, Mohamed Sami
    PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 161 - 165
  • [47] Assessing the Code Clone Detection Capability of Large Language Models
    Zhang, Zixian
    Saber, Takfarinas
    PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON CODE QUALITY, ICCQ 2024, 2024,
  • [48] Code Detection for Hardware Acceleration Using Large Language Models
    Martinez, Pablo Antonio
    Bernabe, Gregorio
    Garcia, Jose Manuel
    IEEE ACCESS, 2024, 12 : 35271 - 35281
  • [49] Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
    Maloyan, Narek
    Verma, Ekansh
    Nutfullin, Bulat
    Ashinov, Bislan
    arXiv,
  • [50] An Empirical Study on How Large Language Models Impact Software Testing Learning
    Mezzaro, Simone
    Gambi, Alessio
    Fraser, Gordon
    PROCEEDINGS OF 2024 28TH INTERNATION CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2024, 2024, : 555 - 564