Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models

被引:12
|
作者
Lai, Honghao [1 ,2 ]
Ge, Long [1 ,2 ,3 ]
Sun, Mingyao [4 ]
Pan, Bei [5 ]
Huang, Jiajie [6 ]
Hou, Liangying [5 ,7 ]
Yang, Qiuyu [1 ,2 ]
Liu, Jiayi [1 ,2 ]
Liu, Jianing [6 ]
Ye, Ziying [1 ,2 ]
Xia, Danni [1 ,2 ]
Zhao, Weilong [1 ,2 ]
Wang, Xiaoman [5 ]
Liu, Ming [5 ,7 ]
Talukdar, Jhalok Ronjan [7 ]
Tian, Jinhui [3 ,5 ]
Yang, Kehu [3 ,5 ]
Estill, Janne [5 ,8 ]
机构
[1] Lanzhou Univ, Sch Publ Hlth, Dept Hlth Policy & Management, Lanzhou, Peoples R China
[2] Lanzhou Univ, Evidence Based Social Sci Res Ctr, Sch Publ Hlth, 199 Donggang West Rd, Lanzhou 730000, Peoples R China
[3] Key Lab Evidence Based Med & Knowledge Translat Ga, Lanzhou, Peoples R China
[4] Lanzhou Univ, Evidence Based Nursing Ctr, Sch Nursing, Lanzhou, Peoples R China
[5] Lanzhou Univ, Sch Basic Med Sci, Evidence Based Med Ctr, Lanzhou, Peoples R China
[6] Gansu Univ Chinese Med, Coll Nursing, Lanzhou, Peoples R China
[7] McMaster Univ, Dept Hlth Res Methods Evidence & Impact, Hamilton, ON, Canada
[8] Univ Geneva, Inst Global Hlth, Geneva, Switzerland
关键词
DOUBLE-BLIND; PRIMARY INSOMNIA; INTERRATER RELIABILITY; REBOUND INSOMNIA; WEIGHT-LOSS; LONG-TERM; RED MEAT; EFFICACY; SAFETY; DIET;
D O I
10.1001/jamanetworkopen.2024.12687
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Importance Large language models (LLMs) may facilitate the labor-intensive process of systematic reviews. However, the exact methods and reliability remain uncertain. Objective To explore the feasibility and reliability of using LLMs to assess risk of bias (ROB) in randomized clinical trials (RCTs). Design, Setting, and Participants A survey study was conducted between August 10, 2023, and October 30, 2023. Thirty RCTs were selected from published systematic reviews. Main Outcomes and Measures A structured prompt was developed to guide ChatGPT (LLM 1) and Claude (LLM 2) in assessing the ROB in these RCTs using a modified version of the Cochrane ROB tool developed by the CLARITY group at McMaster University. Each RCT was assessed twice by both models, and the results were documented. The results were compared with an assessment by 3 experts, which was considered a criterion standard. Correct assessment rates, sensitivity, specificity, and F1 scores were calculated to reflect accuracy, both overall and for each domain of the Cochrane ROB tool; consistent assessment rates and Cohen kappa were calculated to gauge consistency; and assessment time was calculated to measure efficiency. Performance between the 2 models was compared using risk differences. Results Both models demonstrated high correct assessment rates. LLM 1 reached a mean correct assessment rate of 84.5% (95% CI, 81.5%-87.3%), and LLM 2 reached a significantly higher rate of 89.5% (95% CI, 87.0%-91.8%). The risk difference between the 2 models was 0.05 (95% CI, 0.01-0.09). In most domains, domain-specific correct rates were around 80% to 90%; however, sensitivity below 0.80 was observed in domains 1 (random sequence generation), 2 (allocation concealment), and 6 (other concerns). Domains 4 (missing outcome data), 5 (selective outcome reporting), and 6 had F1 scores below 0.50. The consistent rates between the 2 assessments were 84.0% for LLM 1 and 87.3% for LLM 2. LLM 1's kappa exceeded 0.80 in 7 and LLM 2's in 8 domains. The mean (SD) time needed for assessment was 77 (16) seconds for LLM 1 and 53 (12) seconds for LLM 2. Conclusions In this survey study of applying LLMs for ROB assessment, LLM 1 and LLM 2 demonstrated substantial accuracy and consistency in evaluating RCTs, suggesting their potential as supportive tools in systematic review processes.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Assessing risk of bias in randomized controlled trials
    Purgato, Marianna
    Barbui, Corrado
    Cipriani, Andrea
    EPIDEMIOLOGIA E PSICHIATRIA SOCIALE-AN INTERNATIONAL JOURNAL FOR EPIDEMIOLOGY AND PSYCHIATRIC SCIENCES, 2010, 19 (04): : 296 - 297
  • [2] Assessing political bias in large language models
    Rettenberger, Luca
    Reischl, Markus
    Schutera, Mark
    JOURNAL OF COMPUTATIONAL SOCIAL SCIENCE, 2025, 8 (02):
  • [3] A revised tool for assessing risk of bias in randomized trials
    Higgins, Julian P. T.
    Sterne, Jonathan A. C.
    Savovic, Jelena
    Page, Matthew J.
    Hrobjartsson, Asbjorn
    Boutron, Isabelle
    Reeves, Barney
    Eldridge, Sandra
    COCHRANE DATABASE OF SYSTEMATIC REVIEWS, 2016, 10 : 29 - 31
  • [4] Assessment of risk of bias in randomized clinical trials in surgery
    Gurusamy, K. S.
    Gluud, C.
    Nikolova, D.
    Davidson, B. R.
    BRITISH JOURNAL OF SURGERY, 2009, 96 (04) : 342 - 349
  • [5] Assessing Risk of Bias in Randomized Controlled Trials for Autism Spectrum Disorder
    Martins Okuda, Paola Matiko
    Klaiman, Cheryl
    Bradshaw, Jessica
    Reid, Morganne
    Cogo-Moreira, Hugo
    FRONTIERS IN PSYCHIATRY, 2017, 8
  • [6] A revised tool for assessing risk of bias in randomized trials (RoB 2.0)
    Savovic, Jalena
    Page, Matthew
    Elbers, Roy
    Hrobjartsson, Asbjorn
    Boutron, Isabelle
    Reeves, Barney
    Sterne, Jonathan
    Higgins, Julian
    TRIALS, 2017, 18
  • [7] Instruments assessing risk of bias of randomized trials frequently included items that are not addressing risk of bias issues
    Wang, Ying
    Ghadimi, Maryam
    Wang, Qi
    Hou, Liangying
    Zeraatkar, Dena
    Iqbal, Atiya
    Ho, Cameron
    Yao, Liang
    Hu, Malini
    Ye, Zhikang
    Couban, Rachel
    Armijo-Olivo, Susan
    Bassler, Dirk
    Briel, Matthias
    Gluud, Lise Lotte
    Glasziou, Paul
    Jackson, Rod
    Keitz, Sheri A.
    Letelier, Luz M.
    Ravaud, Philippe
    Schulz, Kenneth F.
    Siemieniuk, Reed A. C.
    Brignardello-Petersen, Romina
    Guyatt, Gordon H.
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2022, 152 : 218 - 225
  • [8] Matching patients to clinical trials with large language models
    Jin, Qiao
    Wang, Zifeng
    Floudas, Charalampos S.
    Chen, Fangyuan
    Gong, Changlin
    Bracken-Clarke, Dara
    Xue, Elisabetta
    Yang, Yifan
    Sun, Jimeng
    Lu, Zhiyong
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [9] Conversational Complexity for Assessing Risk in Large Language Models
    Burden, John
    Cebrian, Manuel
    Hernandez-Orallo, Jose
    arXiv,
  • [10] Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials
    Suster, Simon
    Baldwin, Timothy
    Verspoor, Karin
    RESEARCH SYNTHESIS METHODS, 2024, 15 (06) : 988 - 1000