Assessment of decision-making with locally run and web-based large language models versus human board recommendations in otorhinolaryngology, head and neck surgery

被引：0

作者：

Buhr, Christoph Raphael ^{[1
,2
]}

Ernst, Benjamin Philipp ^{[3
]}

Blaikie, Andrew ^{[2
]}

Smith, Harry ^{[4
]}

Kelsey, Tom ^{[4
]}

Matthias, Christoph ^{[1
]}

Fleischmann, Maximilian ^{[5
]}

Jungmann, Florian ^{[6
]}

Alt, Juergen ^{[7
]}

Brandts, Christian ^{[8
]}

Kaemmerer, Peer W. ^{[9
]}

Foersch, Sebastian ^{[10
]}

Kuhn, Sebastian ^{[11
,12
]}

Eckrich, Jonas ^{[1
]}

机构：

[1] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Dept Otorhinolaryngol, Langenbeckstr 1, D-55131 Mainz, Germany

[2] Univ St Andrews, Sch Med, St Andrews, Fife, Scotland

[3] Goethe Univ Frankfurt, Med Ctr, Theodor Stern Kai 7, D-60596 Frankfurt, Germany

[4] Univ St Andrews, Sch Comp Sci, St Andrews, Scotland

[5] Goethe Univ Frankfurt, Med Ctr, Theodor Stern Kai 7, D-60596 Frankfurt, Germany

[6] Marienhaus Hosp Saarlouis, Outpatient Dept Radiol & Nucl Med, Kapuzinerstr 4, D-66740 Saarlouis, Germany

[7] Johannes Gutenberg Univ Mainz, Dept Hematol & Med Oncol, Univ Med Ctr Mainz, Langenbeckstr 1, D-55131 Mainz, Germany

[8] Univ Med Ctr Frankfurt, Dept Hematol & Med Oncol, Theodor Stern Kai 7, D-60596 Frankfurt, Germany

[9] Johannes Gutenberg Univ Mainz, Univ Med Ctr Mainz, Dept Oral & Maxillofacial Surg Plast Surg, Langenbeckstr 1, D-55131 Mainz, Germany

[10] Johannes Gutenberg Univ Mainz, Univ Med Ctr, Inst Pathol, Langenbeckstr 1, D-55131 Mainz, Germany

[11] Philipps Univ Marburg, Inst Digital Med, Marburg, Germany

[12] Univ Hosp Giessen & Marburg, Marburg, Germany

来源：

EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY | 2025年 / 282卷 / 03期

关键词：

Large language models; LLM; Artificial intelligence; AI; ChatGPT; Llama; Otorhinolaryngology; ORL; Head and neck; Digital health; Chatbot; Language model; BENEFITS;

D O I：

10.1007/s00405-024-09153-3

中图分类号：

R76 [耳鼻咽喉科学];

学科分类号：

100213 ;

摘要：

IntroductionTumor boards are a cornerstone of modern cancer treatment. Given their advanced capabilities, the role of Large Language Models (LLMs) in generating tumor board decisions for otorhinolaryngology (ORL) head and neck surgery is gaining increasing attention. However, concerns over data protection and the use of confidential patient information in web-based LLMs have restricted their widespread adoption and hindered the exploration of their full potential. In this first study of its kind we compared standard human multidisciplinary tumor board recommendations (MDT) against a web-based LLM (ChatGPT-4o) and a locally run LLM (Llama 3) addressing data protection concerns.Material and methodsTwenty-five simulated tumor board cases were presented to an MDT composed of specialists from otorhinolaryngology, craniomaxillofacial surgery, medical oncology, radiology, radiation oncology, and pathology. This multidisciplinary team provided a comprehensive analysis of the cases. The same cases were input into ChatGPT-4o and Llama 3 using structured prompts, and the concordance between the LLMs' and MDT's recommendations was assessed. Four MDT members evaluated the LLMs' recommendations in terms of medical adequacy (using a six-point Likert scale) and whether the information provided could have influenced the MDT's original recommendations.ResultsChatGPT-4o showed 84% concordance (21 out of 25 cases) and Llama 3 demonstrated 92% concordance (23 out of 25 cases) with the MDT in distinguishing between curative and palliative treatment strategies. In 64% of cases (16/25) ChatGPT-4o and in 60% of cases (15/25) Llama, identified all first-line therapy options considered by the MDT, though with varying priority. ChatGPT-4o presented all the MDT's first-line therapies in 52% of cases (13/25), while Llama 3 offered a homologous treatment strategy in 48% of cases (12/25). Additionally, both models proposed at least one of the MDT's first-line therapies as their top recommendation in 28% of cases (7/25). The ratings for medical adequacy yielded a mean score of 4.7 (IQR: 4-6) for ChatGPT-4o and 4.3 (IQR: 3-5) for Llama 3. In 17% of the assessments (33/200), MDT members indicated that the LLM recommendations could potentially enhance the MDT's decisions.DiscussionThis study demonstrates the capability of both LLMs to provide viable therapeutic recommendations in ORL head and neck surgery. Llama 3, operating locally, bypasses many data protection issues and shows promise as a clinical tool to support MDT decisions. However at present, LLMs should augment rather than replace human decision-making.

引用

页码：1593 / 1607

页数：15

共 1 条

[1] Optimized Large Language Models Versus Multiple Sclerosis Specialists: Evaluating Answering Questions of Clinical Decision-Making, A Comparative Study based on clinical scenarios
Inojosa, Hernan
Weicken, Eva
Voigt, Isabel
Wenk, Judith
Wiest, Isabella
Ferber, Dyke
Gilbert, Stephen
Kather, Jakob
Akguen, Katja
Ziemssen, Tjalf
MULTIPLE SCLEROSIS JOURNAL, 2024, 30 (03) : 999 - 1000

← 1 →