Can GPT-3.5 generate and code discharge summaries?

被引:1
|
作者
Falis, Matus [1 ]
Gema, Aryo Pradipta [1 ]
Dong, Hang [2 ]
Daines, Luke [3 ]
Basetti, Siddharth [4 ]
Holder, Michael [5 ]
Penfold, Rose S. [6 ,7 ]
Birch, Alexandra [1 ]
Alex, Beatrice [8 ,9 ]
机构
[1] Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Scotland
[2] Univ Exeter, Dept Comp Sci, Exeter EX4 4QF, England
[3] Univ Edinburgh, Usher Inst, Ctr Med Informat, Edinburgh EH16 4UX, Scotland
[4] Natl Hlth Serv Highland, Dept Res Dev & Innovat, Inverness IV2 3JH, Scotland
[5] Univ Edinburgh, Usher Inst, Ctr Populat Hlth Sci, Edinburgh EH16 4UX, Scotland
[6] Univ Edinburgh, Usher Inst, Ageing & Hlth, Edinburgh EH16 4UX, Scotland
[7] Univ Edinburgh, Adv Care Res Ctr, Edinburgh EH16 4UX, Scotland
[8] Univ Edinburgh, Edinburgh Futures Inst, Edinburgh EH3 9EF, Scotland
[9] Univ Edinburgh, Sch Literatures Languages & Cultures, Edinburgh EH8 9LH, Scotland
基金
英国惠康基金; 英国工程与自然科学研究理事会;
关键词
ICD coding; data augmentation; large language model; clinical text generation; evaluation by clinicians;
D O I
10.1093/jamia/ocae132
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives The aim of this study was to investigate GPT-3.5 in generating and coding medical documents with International Classification of Diseases (ICD)-10 codes for data augmentation on low-resource labels.Materials and Methods Employing GPT-3.5 we generated and coded 9606 discharge summaries based on lists of ICD-10 code descriptions of patients with infrequent (or generation) codes within the MIMIC-IV dataset. Combined with the baseline training set, this formed an augmented training set. Neural coding models were trained on baseline and augmented data and evaluated on an MIMIC-IV test set. We report micro- and macro-F1 scores on the full codeset, generation codes, and their families. Weak Hierarchical Confusion Matrices determined within-family and outside-of-family coding errors in the latter codesets. The coding performance of GPT-3.5 was evaluated on prompt-guided self-generated data and real MIMIC-IV data. Clinicians evaluated the clinical acceptability of the generated documents.Results Data augmentation results in slightly lower overall model performance but improves performance for the generation candidate codes and their families, including 1 absent from the baseline training data. Augmented models display lower out-of-family error rates. GPT-3.5 identifies ICD-10 codes by their prompted descriptions but underperforms on real data. Evaluators highlight the correctness of generated concepts while suffering in variety, supporting information, and narrative.Discussion and Conclusion While GPT-3.5 alone given our prompt setting is unsuitable for ICD-10 coding, it supports data augmentation for training neural models. Augmentation positively affects generation code families but mainly benefits codes with existing examples. Augmentation reduces out-of-family errors. Documents generated by GPT-3.5 state prompted concepts correctly but lack variety, and authenticity in narratives.
引用
收藏
页码:2284 / 2293
页数:10
相关论文
共 50 条
  • [1] Prompted Opinion Summarization with GPT-3.5
    Bhaskari, Adithya
    Fabbri, Alexander R.
    Durrett, Greg
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 9282 - 9300
  • [2] Comparing Gemini Pro and GPT-3.5 in Algorithmic Problems
    Souza, Debora
    COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 698 - 700
  • [3] Correspondence on Chat GPT-4, GPT-3.5 and drug information queries
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    JOURNAL OF TELEMEDICINE AND TELECARE, 2023,
  • [4] The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery:Comparative Analysis
    Liu, Jiayu
    Liang, Xiuting
    Fang, Dandong
    Zheng, Jiqi
    Yin, Chengliang
    Xie, Hui
    Li, Yanteng
    Sun, Xiaochun
    Tong, Yue
    Che, Hebin
    Hu, Ping
    Yang, Fan
    Wang, Bingxian
    Chen, Yuanyuan
    Cheng, Gang
    Zhang, Jianning
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [5] GPT-4 in Nuclear Medicine Education: Does It Outperform GPT-3.5?
    Currie, Geoffrey M.
    JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 314 - 317
  • [6] Large Language Models for Code Obfuscation Evaluation of the Obfuscation Capabilities of OpenAI's GPT-3.5 on C Source Code
    Kochberger, Patrick
    Gramberger, Maximilian
    Schrittwieser, Sebastian
    Lawitschka, Caroline
    Weippl, Edgar R.
    PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, SECRYPT 2023, 2023, : 7 - 19
  • [7] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
    Kaneda, Yudai
    Takahashi, Ryo
    Kaneda, Uiri
    Akashima, Shiori
    Okita, Haruna
    Misaki, Sadaya
    Yamashiro, Akimi
    Ozaki, Akihiko
    Tanimoto, Tetsuya
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [8] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Maciej Rosoł
    Jakub S. Gąsior
    Jonasz Łaba
    Kacper Korzeniewski
    Marcel Młyńczak
    Scientific Reports, 13
  • [9] AcademicWriting with GPT-3.5 (ChatGPT): Reflections on Practices, Efficacy and Transparency
    Buruk, Oguz 'Oz'
    PROCEEDINGS OF THE 26TH INTERNATIONAL ACADEMIC MINDTREK, MINDTREK 2023, 2023, : 144 - 153
  • [10] ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4
    Currie, Geoff
    Robbie, Stephanie
    Tually, Peter
    JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 307 - 313