共 50 条
- [21] MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models BIG DATA MINING AND ANALYTICS, 2024, 7 (04): : 1116 - 1128
- [24] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study IEEE ACCESS, 2025, 13 : 29698 - 29717
- [25] FELM: Benchmarking Factuality Evaluation of Large Language Models ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
- [26] Benchmarking Biomedical Relation Knowledge in Large Language Models BIOINFORMATICS RESEARCH AND APPLICATIONS, PT II, ISBRA 2024, 2024, 14955 : 482 - 495
- [27] Benchmarking Cognitive Biases in Large Language Models as Evaluators FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 517 - 545
- [28] TOMBENCH: Benchmarking Theory of Mind in Large Language Models PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15959 - 15983