Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

被引：11

作者：

Khraisha, Qusai ^{[1
,2
]}

Put, Sophie ^{[3
]}

Kappenberg, Johanna ^{[2
]}

Warraitch, Azza ^{[1
,2
]}

Hadfield, Kristin ^{[1
,2
]}

机构：

[1] Trinity Coll Dublin, Trinity Ctr Global Hlth, Dublin, Ireland

[2] Trinity Coll Dublin, Sch Psychol, Dublin, Ireland

[3] Univ York, Dept Educ, York, England

来源：

RESEARCH SYNTHESIS METHODS | 2024年 / 15卷 / 04期

关键词：

artificial intelligence (AI); GPT; large language models (LLMs); machine learning; natural language processing (NLP); systematic reviews;

D O I：

10.1002/jrsm.1715

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a "human-out-of-the-loop" approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (similar to 1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (similar to 1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching "human-like" levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.

引用

页码：616 / 626

页数：11