共 50 条
- [2] DebugBench: Evaluating Debugging Capability of Large Language Models Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024, : 4173 - 4198
- [3] Establishing vocabulary tests as a benchmark for evaluating large language models PLOS ONE, 2024, 19 (12):
- [4] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
- [5] FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XIII, ICIC 2024, 2024, 14874 : 96 - 107
- [8] JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 870 - 882
- [9] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
- [10] Evaluating the Application of Large Language Models to Generate Feedback in Programming Education 2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,