The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study

被引:21
|
作者
Ito, Naoki [1 ,2 ]
Kadomatsu, Sakina [1 ,3 ]
Fujisawa, Mineto [1 ,2 ]
Fukaguchi, Kiyomitsu [1 ,4 ]
Ishizawa, Ryo [1 ,5 ]
Kanda, Naoki [1 ,6 ]
Kasugai, Daisuke [1 ,7 ]
Nakajima, Mikio [1 ,8 ]
Goto, Tadahiro [1 ]
Tsugawa, Yusuke [9 ,10 ]
机构
[1] TXP Med Co Ltd, 41-1 HO Kanda 706, Tokyo 1010042, Japan
[2] Univ Tokyo, Fac Med, Tokyo, Japan
[3] Int Univ Hlth & Welf, Fac Med, Chiba, Japan
[4] Shonan Kamakura Gen Hosp, Dept Emergency Med, Kanagawa, Japan
[5] Tokyo Med Ctr Natl Hosp Org, Dept Emergency & Crit Care Med, Tokyo, Japan
[6] Jichi Med Univ Hosp, Div Gen Internal Med, Tochigi, Japan
[7] Nagoya Univ, Grad Sch Med, Dept Emergency & Crit Care Med, Aichi, Japan
[8] Tokyo Fdn Ambulance Serv Dev, Emergency Life Saving Tech Acad, Tokyo, Japan
[9] Univ Calif Los Angeles, David Geffen Sch Med, Div Gen Internal Med & Hlth Serv Res, Los Angeles, CA USA
[10] UCLA, Fielding Sch Publ Hlth, Dept Hlth Policy & Management, Los Angeles, CA USA
来源
JMIR MEDICAL EDUCATION | 2023年 / 9卷
基金
美国国家卫生研究院;
关键词
GPT-4; racial and ethnic bias; typical clinical vignettes; diagnosis; triage; artificial intelligence; AI; race; clinical vignettes; physician; efficiency; decision-making; bias; GPT;
D O I
10.2196/47532
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as "correct" or "incorrect." Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients' race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4's ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.
引用
收藏
页数:10
相关论文
共 28 条
  • [1] Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study
    Zack T.
    Lehman E.
    Suzgun M.
    Rodriguez J.A.
    Celi L.A.
    Gichoya J.
    Jurafsky D.
    Szolovits P.
    Bates D.W.
    Abdulnour R.-E.E.
    Butte A.J.
    Alsentzer E.
    The Lancet Digital Health, 2024, 6 (01): : e12 - e22
  • [2] Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study
    Zack, Travis
    Lehman, Eric
    Suzgun, Mirac
    Rodriguez, Jorge A.
    Celi, Leo Anthony
    Gichoya, Judy
    Jurafsky, Dan
    Szolovits, Peter
    Bates, David W.
    Abdulnour, Raja-Elie E.
    Butte, Atul J.
    Alsentzer, Emily
    LANCET DIGITAL HEALTH, 2024, 6 (01):
  • [3] Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    RADIOLOGY, 2024, 310 (01)
  • [4] Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy
    Gertz, Roman Johannes
    Dratsch, Thomas
    Bunck, Alexander Christian
    Lennartz, Simon
    Iuga, Andra-Iza
    Hellmich, Martin Gunnar
    Persigehl, Thorsten
    Pennig, Lenhard
    Gietzen, Carsten Herbert
    Fervers, Philipp
    Maintz, David
    Hahnfeldt, Robert
    Kottlors, Jonathan
    RADIOLOGY, 2024, 311 (01)
  • [5] Using GPT-4 to write a scientific review article: a pilot evaluation study
    Wang, Zhiping Paul
    Bhandary, Priyanka
    Wang, Yizhou
    Moore, Jason H.
    BIODATA MINING, 2024, 17 (01):
  • [6] Teaching Plan Generation and Evaluation With GPT-4: Unleashing the Potential of LLM in Instructional Design
    Hu, Bihao
    Zheng, Longwei
    Zhu, Jiayi
    Ding, Lishan
    Wang, Yilei
    Gu, Xiaoqing
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2024, 17 : 1471 - 1485
  • [7] Assessing the accuracy and efficiency of Chat GPT-4 Omni (GPT-4o) in biomedical statistics Comparative study with traditional tools
    Meo, Anusha S.
    Shaikh, Narmeen
    Meo, Sultan A.
    SAUDI MEDICAL JOURNAL, 2024, 45 (12) : 1383 - 1390
  • [8] Evaluating the accuracy, time and cost of GPT-4 and GPT-4o in liver disease diagnoses using cases from "What is Your Diagnosis"
    Guo, Yusheng
    Li, Tianxiang
    Xie, Jiao
    Luo, Miao
    Zheng, Chuansheng
    JOURNAL OF HEPATOLOGY, 2025, 82 (01) : e15 - e17
  • [9] Leveraging GPT-4 for Accuracy in Education: A Comparative Study on Retrieval-Augmented Generation in MOOCs
    Miladi, Fatma
    Psyche, Valery
    Lemire, Daniel
    ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2024, PT I, 2024, 2150 : 427 - 434
  • [10] Exploring Diagnostic Precision and Triage Proficiency: A Comparative Study of GPT-4 and Bard in Addressing Common Ophthalmic Complaints
    Zandi, Roya
    Fahey, Joseph D.
    Drakopoulos, Michael
    Bryan, John M.
    Dong, Siyuan
    Bryar, Paul J.
    Bidwell, Ann E.
    Bowen, R. Chris
    Lavine, Jeremy A.
    Mirza, Rukhsana G.
    BIOENGINEERING-BASEL, 2024, 11 (02):