共 50 条
- [1] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
- [2] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
- [3] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models 2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
- [4] Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (05): : 1132 - 1145
- [5] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 386 - 403
- [8] ANALOGICAL - A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3534 - 3549
- [10] CLiMP: A Benchmark for Chinese Language Model Evaluation 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2784 - 2790