Can GPT-3.5 generate and code discharge summaries?

被引:1
|
作者
Falis, Matus [1 ]
Gema, Aryo Pradipta [1 ]
Dong, Hang [2 ]
Daines, Luke [3 ]
Basetti, Siddharth [4 ]
Holder, Michael [5 ]
Penfold, Rose S. [6 ,7 ]
Birch, Alexandra [1 ]
Alex, Beatrice [8 ,9 ]
机构
[1] Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Scotland
[2] Univ Exeter, Dept Comp Sci, Exeter EX4 4QF, England
[3] Univ Edinburgh, Usher Inst, Ctr Med Informat, Edinburgh EH16 4UX, Scotland
[4] Natl Hlth Serv Highland, Dept Res Dev & Innovat, Inverness IV2 3JH, Scotland
[5] Univ Edinburgh, Usher Inst, Ctr Populat Hlth Sci, Edinburgh EH16 4UX, Scotland
[6] Univ Edinburgh, Usher Inst, Ageing & Hlth, Edinburgh EH16 4UX, Scotland
[7] Univ Edinburgh, Adv Care Res Ctr, Edinburgh EH16 4UX, Scotland
[8] Univ Edinburgh, Edinburgh Futures Inst, Edinburgh EH3 9EF, Scotland
[9] Univ Edinburgh, Sch Literatures Languages & Cultures, Edinburgh EH8 9LH, Scotland
基金
英国惠康基金; 英国工程与自然科学研究理事会;
关键词
ICD coding; data augmentation; large language model; clinical text generation; evaluation by clinicians;
D O I
10.1093/jamia/ocae132
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The aim of this study was to investigate GPT-3.5 in generating and coding medical documents with International Classification of Diseases (ICD)-10 codes for data augmentation on low-resource labels.Materials and Methods Employing GPT-3.5 we generated and coded 9606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (or generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on an MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices determined within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated on prompt-guided self-generated data and real MIMIC-IV data. Clinicians evaluated the clinical acceptability of the generated documents.Results Data augmentation results in slightly lower overall model performance but improves performance for the generation candidate codes and their families, including 1 absent from the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 identifies ICD-10 codes by their prompted descriptions but underperforms on real data. Evaluators highlight the correctness of generated concepts while suffering in variety, supporting information, and narrative.Discussion and Conclusion While GPT-3.5 alone given our prompt setting is unsuitable for ICD-10 coding, it supports data augmentation for training neural models. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Documents generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives.
引用
收藏
页码:2284 / 2293
页数:10
相关论文
共 50 条
  • [31] Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions
    Moshirfar, Majid
    Altaf, Amal W.
    Stoakes, Isabella M.
    Tuttle, Jared J.
    Hoopes, Phillip C.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
  • [32] Prompting GPT-3.5 for Text-to-SQL with De-semanticization and Skeleton Retrieval
    Tian, Zhiliang (tianzhiliang@nudt.edu.cn), 1600, Springer Science and Business Media Deutschland GmbH (14326 LNAI):
  • [33] Application of Short Text Classification Model Based on GPT-3.5 in E-Commerce
    Peng, Yapeng
    Wang, Guangming
    Guo, Jiaqi
    Wang, Zhaoqi
    JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2024, 36 (01)
  • [34] Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination
    Krishna, Satheesh
    Bhambra, Nishaant
    Bleakney, Robert
    Bhayana, Rajesh
    RADIOLOGY, 2024, 311 (02)
  • [36] Advancements in AI for Gastroenterology Education: An Assessment of OpenAI's GPT-4 and GPT-3.5 in MKSAP Question Interpretation
    Patel, Akash
    Samreen, Isha
    Ahmed, Imran
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1580 - S1580
  • [37] Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan's Pulmonologist Exam
    Chen, Chih-Hsiung
    Hsieh, Kuang-Yu
    Huang, Kuo-En
    Lai, Hsien-Yun
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [38] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
    Kikuchi, Tomohiro
    Nakao, Takahiro
    Nakamura, Yuta
    Hanaoka, Shouhei
    Mori, Harushi
    Yoshikawa, Takeharu
    AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511
  • [39] Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
    Yudovich, Max Samuel
    Makarova, Elizaveta
    Hague, Christian Michael
    Raman, Jay Dilip
    JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2024, 21 : 17
  • [40] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Jamil S. Samaan
    Nithya Rajeev
    Wee Han Ng
    Nitin Srinivasan
    Jonathan A. Busam
    Yee Hui Yeo
    Kamran Samakar
    Obesity Surgery, 2024, 34 : 1987 - 1989