MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

被引:0
|
作者
Cai, Yan [1 ]
Wang, Linlin [1 ,2 ]
Wang, Ye [1 ]
de Melo, Gerard [3 ,4 ]
Zhang, Ya [2 ,5 ]
Wang, Yanfeng [2 ,5 ]
He, Liang [1 ]
机构
[1] East China Normal Univ, Shanghai, Peoples R China
[2] Shanghai Artificial Intelligence Lab, Shanghai, Peoples R China
[3] Hasso Plattner Inst, Potsdam, Germany
[4] Univ Potsdam, Potsdam, Germany
[5] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establishing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.
引用
收藏
页码:17709 / 17717
页数:9
相关论文
共 50 条
  • [1] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    [J]. PEERJ COMPUTER SCIENCE, 2024, 10
  • [2] Natural Language Processing in Large-Scale Neural Models for Medical Screenings
    Stille, Catharina Marie
    Bekolay, Trevor
    Blouw, Peter
    Kroeger, Bernd J.
    [J]. FRONTIERS IN ROBOTICS AND AI, 2019, 6
  • [3] A Large-Scale Homography Benchmark
    Barath, Daniel
    Mishkin, Dmytro
    Polic, Michal
    Forstner, Wolfgang
    Matas, Jiri
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21360 - 21370
  • [4] CCMB: A Large-scale Chinese Cross-modal Benchmark
    Xie, Chunyu
    Cai, Heng
    Li, Jincheng
    Kong, Fanjing
    Wu, Xiaoyu
    Song, Jianfei
    Morimitsu, Henrique
    Yao, Lin
    Wang, Dexin
    Zhang, Xiangzheng
    Leng, Dawei
    Zhang, Baochang
    Ji, Xiangyang
    Deng, Yafeng
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4219 - 4227
  • [5] Evaluating large language models on medical evidence summarization
    Tang, Liyan
    Sun, Zhaoyi
    Idnay, Betina
    Nestor, Jordan G.
    Soroush, Ali
    Elias, Pierre A.
    Xu, Ziyang
    Ding, Ying
    Durrett, Greg
    Rousseau, Justin F.
    Weng, Chunhua
    Peng, Yifan
    [J]. NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [6] Evaluating large language models on medical evidence summarization
    Liyan Tang
    Zhaoyi Sun
    Betina Idnay
    Jordan G. Nestor
    Ali Soroush
    Pierre A. Elias
    Ziyang Xu
    Ying Ding
    Greg Durrett
    Justin F. Rousseau
    Chunhua Weng
    Yifan Peng
    [J]. npj Digital Medicine, 6
  • [7] Large Language Models as Commonsense Knowledge for Large-Scale Task Planning
    Zhao, Zirui
    Lee, Wee Sun
    Hsu, David
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Testing and evaluating large-scale agricultural simulation models
    Johnson, Ian R.
    [J]. 19TH INTERNATIONAL CONGRESS ON MODELLING AND SIMULATION (MODSIM2011), 2011, : 84 - 96
  • [9] A large-scale approach for evaluating asset pricing models
    Barras, Laurent
    [J]. JOURNAL OF FINANCIAL ECONOMICS, 2019, 134 (03) : 549 - 569
  • [10] OmniArt: A Large-scale Artistic Benchmark
    Strezoski, Gjorgji
    Worring, Marcel
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (04)