GPT-4;
Organ Toxicity;
Public Health;
Prompt Engineering;
Artificial Intelligence (AI);
Large Language Models (LLMs);
Drug Toxicity and Safety;
Liver;
Heart;
Kidney;
D O I:
10.1016/j.drudis.2025.104297
中图分类号:
R9 [药学];
学科分类号:
1007 ;
摘要:
The growing impact of large language models (LLMs), such as ChatGPT, prompts questions about the reliability of their application in public health. We compared drug toxicity assessments by GPT-4 for liver, heart, and kidney against expert assessments using US Food and Drug Administration (FDA) drug- labeling documents. Two approaches were assessed: a 'General prompt', mimicking the conversational style used by the general public, and an 'Expert prompt' engineered to represent an approach of an expert. The Expert prompt achieved higher accuracy (64-75%) compared with the General prompt (48- 72%), but the overall performance was moderate, indicating that caution is needed when using GPT-4 for public health. To improve reliability, an advanced framework,such as Retrieval Augmented Generation (RAG), might be required to leverage knowledge embedded in GPT-4.