Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models

被引：0

作者：

Li, Ningke ^{[1
]}

Li, Yuekang ^{[2
]}

Liu, Yi ^{[3
]}

Shi, Ling ^{[3
]}

Wang, Kailong ^{[1
]}

Wang, Haoyu ^{[1
]}

机构：

[1] Huazhong University of Science and Technology, Wuhan, China

[2] The University of New South Wales, Sydney, Australia

[3] Nanyang Technological University, Singapore, Singapore

来源：

Proceedings of the ACM on Programming Languages | 2024年 / 8卷 / OOPSLA2期

关键词：

Large language models (LLMs) have revolutionized language processing; but face critical challenges with security; privacy; and generating hallucinations - coherent but factually inaccurate outputs. A major issue is fact-conflicting hallucination (FCH); where LLMs produce content contradicting ground truth facts. Addressing FCH is difficult due to two key challenges: 1) Automatically constructing and updating benchmark datasets is hard; as existing methods rely on manually curated static benchmarks that cannot cover the broad; evolving spectrum of FCH cases. 2) Validating the reasoning behind LLM outputs is inherently difficult; especially for complex logical relations. To tackle these challenges; we introduce a novel logic-programming-aided metamorphic testing technique for FCH detection. We develop an extensive and extensible framework that constructs a comprehensive factual knowledge base by crawling sources like Wikipedia; seamlessly integrated into Drowzee. Using logical reasoning rules; we transform and augment this knowledge into a large set of test cases with ground truth answers. We test LLMs on these cases through template-based prompts; requiring them to provide reasoned answers. To validate their reasoning; we propose two semantic-aware oracles that assess the similarity between the semantic structures of the LLM answers and ground truth. Our approach automatically generates useful test cases and identifies hallucinations across six LLMs within nine domains; with hallucination rates ranging from 24.7% to 59.8%. Key findings include LLMs struggling with temporal concepts; out-of-distribution knowledge; and lack of logical reasoning capabilities. The results show that logic-based test cases generated by Drowzee effectively trigger and detect hallucinations. To further mitigate the identified FCHs; we explored model editing techniques; which proved effective on a small scale (with edits to fewer than 1000 knowledge pieces). Our findings emphasize the need for continued community efforts to detect and mitigate model hallucinations. © 2024 Copyright held by the owner/author(s);

D O I：

10.1145/3689776

中图分类号：

学科分类号：

摘要：

引用

下载

共 50 条

[41] Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection
Singapore Management University, Singapore
不详
不详
不详
arXiv,
[42] Towards Autonomous Testing Agents via Conversational Large Language Models
Feldt, Robert
Kang, Sungmin
Yoon, Juyeon
Yoo, Shin
2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1688 - 1693
[43] Comprehensive testing of large language models for extraction of structured data in pathology
Bastian Grothey
Jan Odenkirchen
Adnan Brkic
Birgid Schömig-Markiefka
Alexander Quaas
Reinhard Büttner
Yuri Tolkach
Communications Medicine, 5 (1):
[44] Getting pwn'd by AI: Penetration Testing with Large Language Models
Happe, Andreas
Cito, Juergen
PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 2082 - 2086
[45] Weak supervision for Question Type Detection with large language models
Martinek, Jiri
Cerisara, Christophe
Kral, Pavel
Lenc, Ladislav
Baloun, Josef
INTERSPEECH 2022, 2022, : 3283 - 3287
[46] Investigating the Efficacy of Large Language Models for Code Clone Detection
Khajezade, Mohamad
Wu, Jie J. W.
Fard, Fatemeh Hendijani
Rodriguez-Perez, Gema
Shehata, Mohamed Sami
PROCEEDINGS 2024 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION, ICPC 2024, 2024, : 161 - 165
[47] Assessing the Code Clone Detection Capability of Large Language Models
Zhang, Zixian
Saber, Takfarinas
PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON CODE QUALITY, ICCQ 2024, 2024,
[48] Code Detection for Hardware Acceleration Using Large Language Models
Martinez, Pablo Antonio
Bernabe, Gregorio
Garcia, Jose Manuel
IEEE ACCESS, 2024, 12 : 35271 - 35281
[49] Trojan Detection in Large Language Models: Insights from The Trojan Detection Challenge
Maloyan, Narek
Verma, Ekansh
Nutfullin, Bulat
Ashinov, Bislan
arXiv,
[50] An Empirical Study on How Large Language Models Impact Software Testing Learning
Mezzaro, Simone
Gambi, Alessio
Fraser, Gordon
PROCEEDINGS OF 2024 28TH INTERNATION CONFERENCE ON EVALUATION AND ASSESSMENT IN SOFTWARE ENGINEERING, EASE 2024, 2024, : 555 - 564

← 1 2 3 4 5 →