Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

被引：0

作者：

Chen, Yuyan ^{[1
]}

Wu, Chenwei ^{[2
]}

Yan, Songzhou ^{[1
]}

Liu, Panjun ^{[3
]}

Zhou, Haoyu

Xiao, Yanghua ^{[1
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Data Sci, Sch Comp Sci, Shanghai, Peoples R China

[2] Univ Michigan, Elect Engn & Comp Sci Dept, Ann Arbor, MI 48109 USA

[3] Beijing Inst Technol, Sch Comp Sci, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.

引用

页码：3138 / 3167

页数：30

共 50 条

[1] A bilingual benchmark for evaluating large language models
Alkaoud, Mohamed
PEERJ COMPUTER SCIENCE, 2024, 10
[2] DebugBench: Evaluating Debugging Capability of Large Language Models
Tian, Runchu
Ye, Yining
Qin, Yujia
Cong, Xin
Lin, Yankai
Pan, Yinxu
Wu, Yesai
Hui, Haotian
Liu, Weichuan
Liu, Zhiyuan
Sun, Maosong
Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024, : 4173 - 4198
[3] Establishing vocabulary tests as a benchmark for evaluating large language models
Martinez, Gonzalo
Conde, Javier
Merino-Gomez, Elena
Bermudez-Margaretto, Beatriz
Hernandez, Jose Alberto
Reviriego, Pedro
Brysbaert, Marc
PLOS ONE, 2024, 19 (12):
[4] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Cai, Yan
Wang, Linlin
Wang, Ye
de Melo, Gerard
Zhang, Ya
Wang, Yanfeng
He, Liang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
[5] FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
Zhang, Huaiwen
Chen, Yu
Wang, Ming
Feng, Shi
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XIII, ICIC 2024, 2024, 14874 : 96 - 107
[6] DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
Doris, Anna C.
Grandi, Daniele
Tomich, Ryan
Alam, Md Ferdous
Ataei, Mohammadmehdi
Cheong, Hyunmin
Ahmed, Faez
JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
[7] Evaluating the effectiveness of large language models in patient education for conjunctivitis
Wang, Jingyuan
Shi, Runhan
Le, Qihua
Shan, Kun
Chen, Zhi
Zhou, Xujiao
He, Yao
Hong, Jiaxu
BRITISH JOURNAL OF OPHTHALMOLOGY, 2024,
[8] JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
Cao, Jialun
Chen, Zhiyong
Wu, Jiarong
Cheung, Shing-Chi
Xu, Chang
Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 870 - 882
[9] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
Valmeekam, Karthik
Marquez, Matthew
Olmo, Alberto
Sreedharan, Sarath
Kambhampati, Subbarao
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[10] Evaluating the Application of Large Language Models to Generate Feedback in Programming Education
Jacobs, Sven
Jaschke, Steffen
2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,

← 1 2 3 4 5 →