Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

被引:0
|
作者
Chen, Yuyan [1 ]
Wu, Chenwei [2 ]
Yan, Songzhou [1 ]
Liu, Panjun [3 ]
Zhou, Haoyu
Xiao, Yanghua [1 ]
机构
[1] Fudan Univ, Shanghai Key Lab Data Sci, Sch Comp Sci, Shanghai, Peoples R China
[2] Univ Michigan, Elect Engn & Comp Sci Dept, Ann Arbor, MI 48109 USA
[3] Beijing Inst Technol, Sch Comp Sci, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.
引用
收藏
页码:3138 / 3167
页数:30
相关论文
共 50 条
  • [21] EVALUATING THE PERFORMANCE OF DIFFERENT LARGE LANGUAGE MODELS ON HEALTH CONSULTATION AND PATIENT EDUCATION IN UROLITHIASIS
    Song, Haifeng
    Xia, Yi
    Song, Yan
    Li, Jianxing
    Zhang, Guangyuan
    Xiao, Bo
    JOURNAL OF UROLOGY, 2024, 211 (05): : E391 - E392
  • [22] Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis
    Song, Haifeng
    Xia, Yi
    Luo, Zhichao
    Liu, Hui
    Song, Yan
    Zeng, Xue
    Li, Tianjie
    Zhong, Guangxin
    Li, Jianxing
    Chen, Ming
    Zhang, Guangyuan
    Xiao, Bo
    JOURNAL OF MEDICAL SYSTEMS, 2023, 47 (01)
  • [23] Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis
    Haifeng Song
    Yi Xia
    Zhichao Luo
    Hui Liu
    Yan Song
    Xue Zeng
    Tianjie Li
    Guangxin Zhong
    Jianxing Li
    Ming Chen
    Guangyuan Zhang
    Bo Xiao
    Journal of Medical Systems, 47
  • [24] HI- TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
    He, Yinghui
    Wu, Yufan
    Jia, Yilin
    Mihalcea, Rada
    Chen, Yulong
    Deng, Naihao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10691 - 10706
  • [25] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
    Yu, Linhao
    Leng, Yongqi
    Huang, Yufei
    Wu, Shang
    Liu, Haixin
    Ji, Xinmeng
    Zhao, Jiahui
    Song, Jinwang
    Cui, Tingting
    Cheng, Xiaoqing
    Liu, Tao
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11817 - 11837
  • [26] ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
    Li, Chunyuan
    Liu, Haotian
    Li, Liunian Harold
    Zhang, Pengchuan
    Aneja, Jyoti
    Yang, Jianwei
    Jin, Ping
    Hu, Houdong
    Liu, Zicheng
    Lee, Yong Jae
    Gao, Jianfeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [27] Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models' Understanding of Discourse Relations
    Miao, Yisong
    Liu, Hongfu
    Lei, Wenqiang
    Chen, Nancy F.
    Kan, Min-Yen
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6277 - 6295
  • [28] The Use of Large Language Models in Education
    Xing, Wanli
    Nixon, Nia
    Crossley, Scott
    Denny, Paul
    Lan, Andrew
    Stamper, John
    Yu, Zhou
    INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2025,
  • [29] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
    Liang, Wenhan
    Wu, Huijia
    Gao, Jun
    Shang, Yuhu
    He, Zhaofeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
  • [30] Evaluating large language models for annotating proteins
    Vitale, Rosario
    Bugnon, Leandro A.
    Fenoy, Emilio Luis
    Milone, Diego H.
    Stegmayer, Georgina
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)