Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

被引:0
|
作者
Chen, Yuyan [1 ]
Wu, Chenwei [2 ]
Yan, Songzhou [1 ]
Liu, Panjun [3 ]
Zhou, Haoyu
Xiao, Yanghua [1 ]
机构
[1] Fudan Univ, Shanghai Key Lab Data Sci, Sch Comp Sci, Shanghai, Peoples R China
[2] Univ Michigan, Elect Engn & Comp Sci Dept, Ann Arbor, MI 48109 USA
[3] Beijing Inst Technol, Sch Comp Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
引用
收藏
页码:3138 / 3167
页数:30
相关论文
共 50 条
  • [1] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [2] DebugBench: Evaluating Debugging Capability of Large Language Models
    Tian, Runchu
    Ye, Yining
    Qin, Yujia
    Cong, Xin
    Lin, Yankai
    Pan, Yinxu
    Wu, Yesai
    Hui, Haotian
    Liu, Weichuan
    Liu, Zhiyuan
    Sun, Maosong
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2024, : 4173 - 4198
  • [3] Establishing vocabulary tests as a benchmark for evaluating large language models
    Martinez, Gonzalo
    Conde, Javier
    Merino-Gomez, Elena
    Bermudez-Margaretto, Beatriz
    Hernandez, Jose Alberto
    Reviriego, Pedro
    Brysbaert, Marc
    PLOS ONE, 2024, 19 (12):
  • [4] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
    Cai, Yan
    Wang, Linlin
    Wang, Ye
    de Melo, Gerard
    Zhang, Ya
    Wang, Yanfeng
    He, Liang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
  • [5] FEEL: A Framework for Evaluating Emotional Support Capability with Large Language Models
    Zhang, Huaiwen
    Chen, Yu
    Wang, Ming
    Feng, Shi
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT XIII, ICIC 2024, 2024, 14874 : 96 - 107
  • [6] DesignQA: A Multimodal Benchmark for Evaluating Large Language Models' Understanding of Engineering Documentation
    Doris, Anna C.
    Grandi, Daniele
    Tomich, Ryan
    Alam, Md Ferdous
    Ataei, Mohammadmehdi
    Cheong, Hyunmin
    Ahmed, Faez
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [7] Evaluating the effectiveness of large language models in patient education for conjunctivitis
    Wang, Jingyuan
    Shi, Runhan
    Le, Qihua
    Shan, Kun
    Chen, Zhi
    Zhou, Xujiao
    He, Yao
    Hong, Jiaxu
    BRITISH JOURNAL OF OPHTHALMOLOGY, 2024,
  • [8] JavaBench: A Benchmark of Object-Oriented Code Generation for Evaluating Large Language Models
    Cao, Jialun
    Chen, Zhiyong
    Wu, Jiarong
    Cheung, Shing-Chi
    Xu, Chang
    Proceedings - 2024 39th ACM/IEEE International Conference on Automated Software Engineering, ASE 2024, : 870 - 882
  • [9] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
    Valmeekam, Karthik
    Marquez, Matthew
    Olmo, Alberto
    Sreedharan, Sarath
    Kambhampati, Subbarao
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] Evaluating the Application of Large Language Models to Generate Feedback in Programming Education
    Jacobs, Sven
    Jaschke, Steffen
    2024 IEEE GLOBAL ENGINEERING EDUCATION CONFERENCE, EDUCON 2024, 2024,