Can GPT-3.5 generate and code discharge summaries?

被引:1
|
作者
Falis, Matus [1 ]
Gema, Aryo Pradipta [1 ]
Dong, Hang [2 ]
Daines, Luke [3 ]
Basetti, Siddharth [4 ]
Holder, Michael [5 ]
Penfold, Rose S. [6 ,7 ]
Birch, Alexandra [1 ]
Alex, Beatrice [8 ,9 ]
机构
[1] Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Scotland
[2] Univ Exeter, Dept Comp Sci, Exeter EX4 4QF, England
[3] Univ Edinburgh, Usher Inst, Ctr Med Informat, Edinburgh EH16 4UX, Scotland
[4] Natl Hlth Serv Highland, Dept Res Dev & Innovat, Inverness IV2 3JH, Scotland
[5] Univ Edinburgh, Usher Inst, Ctr Populat Hlth Sci, Edinburgh EH16 4UX, Scotland
[6] Univ Edinburgh, Usher Inst, Ageing & Hlth, Edinburgh EH16 4UX, Scotland
[7] Univ Edinburgh, Adv Care Res Ctr, Edinburgh EH16 4UX, Scotland
[8] Univ Edinburgh, Edinburgh Futures Inst, Edinburgh EH3 9EF, Scotland
[9] Univ Edinburgh, Sch Literatures Languages & Cultures, Edinburgh EH8 9LH, Scotland
基金
英国惠康基金; 英国工程与自然科学研究理事会;
关键词
ICD coding; data augmentation; large language model; clinical text generation; evaluation by clinicians;
D O I
10.1093/jamia/ocae132
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The aim of this study was to investigate GPT-3.5 in generating and coding medical documents with International Classification of Diseases (ICD)-10 codes for data augmentation on low-resource labels.Materials and Methods Employing GPT-3.5 we generated and coded 9606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (or generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on an MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices determined within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated on prompt-guided self-generated data and real MIMIC-IV data. Clinicians evaluated the clinical acceptability of the generated documents.Results Data augmentation results in slightly lower overall model performance but improves performance for the generation candidate codes and their families, including 1 absent from the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 identifies ICD-10 codes by their prompted descriptions but underperforms on real data. Evaluators highlight the correctness of generated concepts while suffering in variety, supporting information, and narrative.Discussion and Conclusion While GPT-3.5 alone given our prompt setting is unsuitable for ICD-10 coding, it supports data augmentation for training neural models. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Documents generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives.
引用
收藏
页码:2284 / 2293
页数:10
相关论文
共 50 条
  • [21] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    John C. Lin
    David N. Younessi
    Sai S. Kurapati
    Oliver Y. Tang
    Ingrid U. Scott
    Eye, 2023, 37 : 3694 - 3695
  • [22] The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study
    Ohta, Keiichi
    Ohta, Satomi
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
  • [23] Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules
    Gamble, Joel
    Ferguson, Duncan
    Yuen, Joanna
    Sheikh, Adnan
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (02): : 412 - 416
  • [24] Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties
    Luk, Dik Wai Anderson
    Ip, Whitney Chin Tung
    Shea, Yat-fung
    JOURNAL OF THE CHINESE MEDICAL ASSOCIATION, 2024, 87 (03) : 259 - 260
  • [25] From GPT-3.5 to GPT-4.o: A Leap in AI's Medical Exam Performance
    Kipp, Markus
    INFORMATION, 2024, 15 (09)
  • [26] A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course
    Yeadon, Will
    Peach, Alex
    Testrow, Craig
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [27] Examining Lexical Alignment in Human-Agent Conversations with GPT-3.5 and GPT-4 Models
    Wang, Boxuan
    Theune, Mariet
    Srivastava, Sumit
    CHATBOT RESEARCH AND DESIGN, CONVERSATIONS 2023, 2024, 14524 : 94 - 114
  • [28] Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4
    Lahat, Adi
    Sharif, Kassem
    Zoabi, Narmin
    Patt, Yonatan Shneor
    Sharif, Yousra
    Fisher, Lior
    Shani, Uria
    Arow, Mohamad
    Levin, Roni
    Klang, Eyal
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [29] BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study
    Cozzi, Andrea
    Pinker, Katja
    Hidber, Andri
    Zhang, Tianyu
    Bonomo, Luca
    Lo Gullo, Roberto
    Christianson, Blake
    Curti, Marco
    Rizzo, Stefania
    Del Grande, Filippo
    Mann, Ritse M.
    Schiaffino, Simone
    RADIOLOGY, 2024, 311 (01)
  • [30] A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?
    Nakajima, Nozomu
    Fujimori, Takahito
    Furuya, Masayuki
    Kanie, Yuya
    Imai, Hirotatsu
    Kita, Kosuke
    Uemura, Keisuke
    Okada, Seiji
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)