Putting ChatGPT's Medical Advice to the (Turing) Test: Survey Study

被引:61
|
作者
Nov, Oded [1 ,4 ]
Singh, Nina [2 ]
Mann, Devin [2 ,3 ]
机构
[1] NYU, Tandon Sch Engn, Dept Technol Management, New York, NY USA
[2] NYU, Grossman Sch Med, Dept Populat Hlth, New York, NY USA
[3] NYU, Med Ctr Informat Technol, Langone Hlth, New York, NY USA
[4] NYU, Tandon Sch Engn, Dept Technol Management, 5 Metrotech, New York, NY 11201 USA
来源
JMIR MEDICAL EDUCATION | 2023年 / 9卷
基金
美国国家科学基金会;
关键词
artificial intelligence; AI; ChatGPT; Chat Generative Pre-trained Transformer; large language model; patient-provider interaction; chatbot; feasibility; ethics; privacy; language model; machine learning; PATIENT; IMPACT;
D O I
10.2196/46939
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Chatbots are being piloted to draft responses to patient questions, but patients' ability to distinguish between provider and chatbot responses and patients' trust in chatbots' functions are not well established. Objective: This study aimed to assess the feasibility of using ChatGPT (Chat Generative Pre-trained Transformer) or a similar artificial intelligence-based chatbot for patient-provider communication. Methods: A survey study was conducted in January 2023. Ten representative, nonadministrative patient-provider interactions were extracted from the electronic health record. Patients' questions were entered into ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider's response. In the survey, each patient question was followed by a provider- or ChatGPT-generated response. Participants were informed that 5 responses were provider generated and 5 were chatbot generated. Participants were asked-and incentivized financially-to correctly identify the response source. Participants were also asked about their trust in chatbots' functions in patient-provider communication, using a Likert scale from 1-5. Results: A US-representative sample of 430 study participants aged 18 and older were recruited on Prolific, a crowdsourcing platform for academic studies. In all, 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. Overall, 53.3% (209/392) of respondents analyzed were women, and the average age was 47.1 (range 18-91) years. The correct classification of responses ranged between 49% (192/392) to 85.7% (336/392) for different questions. On average, chatbot responses were identified correctly in 65.5% (1284/1960) of the cases, and human provider responses were identified correctly in 65.1% (1276/1960) of the cases. On average, responses toward patients' trust in chatbots' functions were weakly positive (mean Likert score 3.4 out of 5), with lower trust as the health-related complexity of the task in the questions increased. Conclusions: ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower-risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in health care.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] A reporter's advice to medical researchers
    Haney, DQ
    CLINICAL CANCER RESEARCH, 2005, 11 (19) : 6755 - 6756
  • [32] Evaluating ChatGPT's Multilingual Performance in Clinical Nutrition Advice Using Synthetic Medical Text: Insights from Central Asia
    Adilmetova, Gulnoza
    Nassyrov, Ruslan
    Meyerbekova, Aizhan
    Karabay, Aknur
    Varol, Huseyin Atakan
    Chan, Mei-Yen
    JOURNAL OF NUTRITION, 2025, 155 (03): : 729 - 735
  • [33] ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions
    Funk, Paul F.
    Hoch, Cosima C.
    Knoedler, Samuel
    Knoedler, Leonard
    Cotofana, Sebastian
    Sofo, Giuseppe
    Bashiri Dezfouli, Ali
    Wollenberg, Barbara
    Guntinas-Lichius, Orlando
    Alfertshofer, Michael
    EUROPEAN JOURNAL OF INVESTIGATION IN HEALTH PSYCHOLOGY AND EDUCATION, 2024, 14 (03) : 657 - 668
  • [34] Visual Turing test is not sufficient to evaluate the performance of medical generative models
    Yamamoto, Shoichiro
    Higaki, Akinori
    EUROPEAN RADIOLOGY EXPERIMENTAL, 2023, 7 (01)
  • [35] Visual Turing test is not sufficient to evaluate the performance of medical generative models
    Shoichiro Yamamoto
    Akinori Higaki
    European Radiology Experimental, 7
  • [36] The FDA’s Message Testing: Putting Health Literacy Advice into Practice
    Duckhorn, Jodi
    Lappin, Brian
    Weinberg, Jessica
    Zwanziger, Lee L.
    Studies in Health Technology and Informatics, 2020, 269 : 332 - 340
  • [37] The FDA's Message Testing: Putting health literacy advice into practice
    Duckhorn J.
    Lappin B.
    Weinberg J.
    Zwanziger L.L.
    Information Services and Use, 2019, 39 (1-2): : 59 - 67
  • [38] A preliminary test of ChatGPT's ESG literacy
    Gao, Wei
    Ju, Ming
    MANAGERIAL FINANCE, 2024,
  • [39] A study with Paperless Electronic Medical Advice
    Hu, Sheng Li
    Chi, Jin Qing
    2016 INTERNATIONAL CONFERENCE ON MECHATRONICS, MANUFACTURING AND MATERIALS ENGINEERING (MMME 2016), 2016, 63
  • [40] Putting ChatGPT vision (GPT-4V) to the test: risk perception in traffic images
    Driessen, Tom
    Dodou, Dimitra
    Bazilinskyy, Pavlo
    de Winter, Joost
    ROYAL SOCIETY OPEN SCIENCE, 2024, 11 (05):