Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions

被引:0
|
作者
Severino, Joao Victor Bruneti [1 ,2 ]
de Paula, Pedro Angelo Basei
Berger, Matheus Nespolo [1 ]
Loures, Filipe Silveira [3 ]
Todeschini, Solano Amadori [3 ]
Roeder, Eduardo Augusto [1 ,3 ]
Veiga, Maria Han [4 ]
Guedes, Murilo [2 ]
Marques, Gustavo Lenci [1 ,2 ,3 ]
机构
[1] Univ Fed Parana, Curitiba, Brazil
[2] Pontificia Univ Catolica Parana, Curitiba, Brazil
[3] Voa Hlth, Belo Horizonte, Brazil
[4] Ohio State Univ, Math, Columbus, OH USA
关键词
Artificial intelligence; Health Equity; Machine Learning; Medical Informatics Applications; Universal Health Care;
D O I
10.1136/bmjhci-2024-101195
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Objective The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.Methods This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.Results Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8x7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.Conclusions 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.
引用
收藏
页数:4
相关论文
共 50 条
  • [21] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
    Buckley, Thomas A.
    Crowe, Byron
    Abdulnour, Raja-Elie E.
    Rodman, Adam
    Manrai, Arjun K.
    JAMA HEALTH FORUM, 2025, 6 (03):
  • [22] PharmaLLM: A Medicine Prescriber Chatbot Exploiting Open-Source Large Language Models
    Ayesha Azam
    Zubaira Naz
    Muhammad Usman Ghani Khan
    Human-Centric Intelligent Systems, 2024, 4 (4): : 527 - 544
  • [23] Comparing Comprehension Measured by Multiple-Choice and Open-Ended Questions
    Ozuru, Yasuhiro
    Briner, Stephen
    Kurby, Christopher A.
    McNamara, Danielle S.
    CANADIAN JOURNAL OF EXPERIMENTAL PSYCHOLOGY-REVUE CANADIENNE DE PSYCHOLOGIE EXPERIMENTALE, 2013, 67 (03): : 215 - 227
  • [24] Open-Source Large Language Models in Radiology: A Review and Tutorialfor PracticalResearch and ClinicalDeployment
    Savage, Cody H.
    Kanhere, Adway
    Parekh, Vishwa
    Langlotz, Curtis P.
    Joshi, Anupam
    Huang, Heng
    Doo, Florence X.
    RADIOLOGY, 2025, 314 (01)
  • [25] Automated Essay Scoring and Revising Based on Open-Source Large Language Models
    Song, Yishen
    Zhu, Qianta
    Wang, Huaibo
    Zheng, Qinhua
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2024, 17 : 1920 - 1930
  • [26] Open-source large language models in action: A bioinformatics chatbot for PRIDE database
    Bai, Jingwen
    Kamatchinathan, Selvakumar
    Kundu, Deepti J.
    Bandla, Chakradhar
    Vizcaino, Juan Antonio
    Perez-Riverol, Yasset
    PROTEOMICS, 2024, 24 (21-22)
  • [27] Open-source large language models in medical education: Balancing promise and challenges
    Ray, Partha Pratim
    ANATOMICAL SCIENCES EDUCATION, 2024, 17 (06) : 1361 - 1362
  • [28] STUDENTS' ACTIONS IN OPEN AND MULTIPLE-CHOICE QUESTIONS REGARDING UNDERSTANDING OF AVERAGES
    Garcia Cruz, Juan Antonio
    Joaquim Garrett, Alexandre
    PME 30: PROCEEDINGS OF THE 30TH CONFERENCE OF THE INTERNATIONAL GROUP FOR THE PSYCHOLOGY OF MATHEMATICS EDUCATION, VOL 3, 2006, : 161 - 168
  • [30] RELATIONSHIPS AMONG MULTIPLE-CHOICE AND OPEN-ENDED ANALYTICAL QUESTIONS
    BRIDGEMAN, B
    ROCK, DA
    JOURNAL OF EDUCATIONAL MEASUREMENT, 1993, 30 (04) : 313 - 329