Can GPT-3.5 generate and code discharge summaries?

被引:1
|
作者
Falis, Matus [1 ]
Gema, Aryo Pradipta [1 ]
Dong, Hang [2 ]
Daines, Luke [3 ]
Basetti, Siddharth [4 ]
Holder, Michael [5 ]
Penfold, Rose S. [6 ,7 ]
Birch, Alexandra [1 ]
Alex, Beatrice [8 ,9 ]
机构
[1] Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Scotland
[2] Univ Exeter, Dept Comp Sci, Exeter EX4 4QF, England
[3] Univ Edinburgh, Usher Inst, Ctr Med Informat, Edinburgh EH16 4UX, Scotland
[4] Natl Hlth Serv Highland, Dept Res Dev & Innovat, Inverness IV2 3JH, Scotland
[5] Univ Edinburgh, Usher Inst, Ctr Populat Hlth Sci, Edinburgh EH16 4UX, Scotland
[6] Univ Edinburgh, Usher Inst, Ageing & Hlth, Edinburgh EH16 4UX, Scotland
[7] Univ Edinburgh, Adv Care Res Ctr, Edinburgh EH16 4UX, Scotland
[8] Univ Edinburgh, Edinburgh Futures Inst, Edinburgh EH3 9EF, Scotland
[9] Univ Edinburgh, Sch Literatures Languages & Cultures, Edinburgh EH8 9LH, Scotland
基金
英国惠康基金; 英国工程与自然科学研究理事会;
关键词
ICD coding; data augmentation; large language model; clinical text generation; evaluation by clinicians;
D O I
10.1093/jamia/ocae132
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The aim of this study was to investigate GPT-3.5 in generating and coding medical documents with International Classification of Diseases (ICD)-10 codes for data augmentation on low-resource labels.Materials and Methods Employing GPT-3.5 we generated and coded 9606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (or generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on an MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices determined within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated on prompt-guided self-generated data and real MIMIC-IV data. Clinicians evaluated the clinical acceptability of the generated documents.Results Data augmentation results in slightly lower overall model performance but improves performance for the generation candidate codes and their families, including 1 absent from the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 identifies ICD-10 codes by their prompted descriptions but underperforms on real data. Evaluators highlight the correctness of generated concepts while suffering in variety, supporting information, and narrative.Discussion and Conclusion While GPT-3.5 alone given our prompt setting is unsuitable for ICD-10 coding, it supports data augmentation for training neural models. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Documents generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives.
引用
收藏
页码:2284 / 2293
页数:10
相关论文
共 50 条
  • [41] Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology
    Taloni, Andrea
    Borselli, Massimiliano
    Scarsi, Valentina
    Rossi, Costanza
    Coco, Giulia
    Scorcia, Vincenzo
    Giannaccare, Giuseppe
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [42] AI-Enhanced Auto-Correction of Programming Exercises: How Effective is GPT-3.5?
    Azaiz, Imen
    Deckarm, Oliver
    Strickroth, Sven
    INTERNATIONAL JOURNAL OF ENGINEERING PEDAGOGY, 2023, 13 (08): : 67 - 83
  • [43] LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4 and Bard's Capacity to Handle Object-Oriented Programming Assignments
    Cipriano, Bruno Pereira
    Alves, Pedro
    2024 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING EDUCATION AND TRAINING, ICSE-SEET 2024, 2024, : 162 - 169
  • [44] Learning to Generate Structured Code Summaries From Hybrid Code Context
    Zhou, Ziyi
    Li, Mingchen
    Yu, Huiqun
    Fan, Guisheng
    Yang, Penghui
    Huang, Zijie
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (10) : 2512 - 2528
  • [45] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
    Meyer, Annika
    Riese, Janik
    Streichert, Thomas
    JMIR MEDICAL EDUCATION, 2024, 10
  • [46] Comparative analysis of GPT-3.5 and GPT-4.0 in Taiwan's medical technologist certification: A study in artificial intelligence advancements
    Yang, Wan-Hua
    Chan, Yun-Hsiang
    Huang, Cheng-Pin
    Chen, Tzeng-Ji
    JOURNAL OF THE CHINESE MEDICAL ASSOCIATION, 2024, 87 (05) : 525 - 530
  • [47] Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard
    Farhat, Faiza
    Chaudhry, Beenish Moalla
    Nadeem, Mohammad
    Sohail, Shahab Saquib
    Madsen, Dag Oivind
    JMIR MEDICAL EDUCATION, 2024, 10
  • [48] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)
  • [49] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Samaan, Jamil S.
    Rajeev, Nithya
    Ng, Wee Han
    Srinivasan, Nitin
    Busam, Jonathan A.
    Yeo, Yee Hui
    Samakar, Kamran
    OBESITY SURGERY, 2024, 34 (05) : 1987 - 1989
  • [50] Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American Academy of Ophthalmology
    Andrea Taloni
    Massimiliano Borselli
    Valentina Scarsi
    Costanza Rossi
    Giulia Coco
    Vincenzo Scorcia
    Giuseppe Giannaccare
    Scientific Reports, 13