共 50 条
- [1] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
- [2] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
- [8] Evaluating Intelligence and Knowledge in Large Language Models [J]. TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2024,
- [9] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code [J]. arXiv, 2024,
- [10] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,