The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study

被引:21
|
作者
Ito, Naoki [1 ,2 ]
Kadomatsu, Sakina [1 ,3 ]
Fujisawa, Mineto [1 ,2 ]
Fukaguchi, Kiyomitsu [1 ,4 ]
Ishizawa, Ryo [1 ,5 ]
Kanda, Naoki [1 ,6 ]
Kasugai, Daisuke [1 ,7 ]
Nakajima, Mikio [1 ,8 ]
Goto, Tadahiro [1 ]
Tsugawa, Yusuke [9 ,10 ]
机构
[1] TXP Med Co Ltd, 41-1 HO Kanda 706, Tokyo 1010042, Japan
[2] Univ Tokyo, Fac Med, Tokyo, Japan
[3] Int Univ Hlth & Welf, Fac Med, Chiba, Japan
[4] Shonan Kamakura Gen Hosp, Dept Emergency Med, Kanagawa, Japan
[5] Tokyo Med Ctr Natl Hosp Org, Dept Emergency & Crit Care Med, Tokyo, Japan
[6] Jichi Med Univ Hosp, Div Gen Internal Med, Tochigi, Japan
[7] Nagoya Univ, Grad Sch Med, Dept Emergency & Crit Care Med, Aichi, Japan
[8] Tokyo Fdn Ambulance Serv Dev, Emergency Life Saving Tech Acad, Tokyo, Japan
[9] Univ Calif Los Angeles, David Geffen Sch Med, Div Gen Internal Med & Hlth Serv Res, Los Angeles, CA USA
[10] UCLA, Fielding Sch Publ Hlth, Dept Hlth Policy & Management, Los Angeles, CA USA
来源
JMIR MEDICAL EDUCATION | 2023年 / 9卷
基金
美国国家卫生研究院;
关键词
GPT-4; racial and ethnic bias; typical clinical vignettes; diagnosis; triage; artificial intelligence; AI; race; clinical vignettes; physician; efficiency; decision-making; bias; GPT;
D O I
10.2196/47532
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as "correct" or "incorrect." Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients' race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4's ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage.
引用
收藏
页数:10
相关论文
共 28 条
  • [21] Retrieval-augmented generation improves precision and trust of a GPT-4 model for emergency radiology diagnosis and classification: a proof-of-concept study
    Fink, Anna
    Nattenmueller, Johanna
    Rau, Stephan
    Rau, Alexander
    Tran, Hien
    Bamberg, Fabian
    Reisert, Marco
    Kotter, Elmar
    Diallo, Thierno
    Russe, Maximilian F.
    EUROPEAN RADIOLOGY, 2025,
  • [22] What's Going On With Me and How Can I Better Manage My Health? The Potential of GPT-4 to Transform Discharge Letters Into Patient-Centered Letters to Enhance Patient Safety: Prospective, Exploratory Study
    Eisinger, Felix
    Holderried, Friederike
    Mahling, Moritz
    Stegemann-Philipps, Christian
    Herrmann-Werner, Anne
    Nazarenus, Eric
    Sonanini, Alessandra
    Guthoff, Martina
    Eickhoff, Carsten
    Holderried, Martin
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [23] Automated imaging technologies for the diagnosis of glaucoma: a comparative diagnostic study for the evaluation of the diagnostic accuracy, performance as triage tests and cost-effectiveness (GATE study)
    Azuara-Blanco, Augusto
    Banister, Katie
    Boachie, Charles
    McMeekin, Peter
    Gray, Joanne
    Burr, Jennifer
    Bourne, Rupert
    Garway-Heath, David
    Batterbury, Mark
    Hernandez, Rodolfo
    McPherson, Gladys
    Ramsay, Craig
    Cook, Jonathan
    HEALTH TECHNOLOGY ASSESSMENT, 2016, 20 (08) : 1 - +
  • [24] Racial and Ethnic Differences in Initiation of Menthol Tobacco Smoking and Subsequent Tobacco Use in the Population Assessment of Tobacco and Health Study, Waves 1-4 (2013-2018)
    Yan, Xinyu
    Salloum, Ramzi G.
    Leong, Man-Chong
    Khalil, Georges E.
    Lee, Ji-Hyun
    Lou, Xiang-Yang
    NICOTINE & TOBACCO RESEARCH, 2023, 25 (08) : 1440 - 1446
  • [25] Evaluation of ChatGPT-4 for the detection of surgical site infections from electronic health records after colorectal surgery: A pilot diagnostic accuracy study
    Badia, Josep M.
    Casanova-Portoles, Daniel
    Membrilla, Estela
    Rubies, Carles
    Pujol, Miquel
    Sancho, Joan
    JOURNAL OF INFECTION AND PUBLIC HEALTH, 2025, 18 (02)
  • [26] Differences in health care visits coded for potential proxy conditions/symptoms of gastroesophageal reflux disease (GERD) before and after a GERD diagnosis: A pediatric database study
    Nelson, Suzanne P.
    Orenstein, Susan R.
    El-Serag, Hashem
    Wu, Eric Q.
    Kothari, Smita
    Persson, Bjorn
    Beaulieu, Nicolas
    Arana, Mateo
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2007, 102 : S451 - S451
  • [27] Evaluation of Accuracy of Fibrosis-4 (Fib-4) Index in Various Age Groups With Biopsy-Proven NAFLD or NASH: A Tertiary Health Care Network Retrospective Study
    Lin, Frank
    Obeid, Ayah
    Kaur, Parampreet
    Chaput, Kimberly
    Liaquat, Hammad
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1306 - S1306