Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models

被引：0

作者：

Chen, Yuyan ^{[1
]}

Wu, Chenwei ^{[2
]}

Yan, Songzhou ^{[1
]}

Liu, Panjun ^{[3
]}

Zhou, Haoyu

Xiao, Yanghua ^{[1
]}

机构：

[1] Fudan Univ, Shanghai Key Lab Data Sci, Sch Comp Sci, Shanghai, Peoples R China

[2] Univ Michigan, Elect Engn & Comp Sci Dept, Ann Arbor, MI 48109 USA

[3] Beijing Inst Technol, Sch Comp Sci, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Teachers are important to imparting knowledge and guiding learners, and the role of large language models (LLMs) as potential educators is emerging as an important area of study. Recognizing LLMs' capability to generate educational content can lead to advances in automated and personalized learning. While LLMs have been tested for their comprehension and problem-solving skills, their capability in teaching remains largely unexplored. In teaching, questioning is a key skill that guides students to analyze, evaluate, and synthesize core concepts and principles. Therefore, our research introduces a benchmark to evaluate the questioning capability in education as a teacher of LLMs through evaluating their generated educational questions, utilizing Anderson and Krathwohl's taxonomy across general, monodisciplinary, and interdisciplinary domains. We shift the focus from LLMs as learners to LLMs as educators, assessing their teaching capability through guiding them to generate questions. We apply four metrics, including relevance, coverage, representativeness, and consistency, to evaluate the educational quality of LLMs' outputs. Our results indicate that GPT-4 demonstrates significant potential in teaching general, humanities, and science courses; Claude2 appears more apt as an interdisciplinary teacher. Furthermore, the automatic scores align with human perspectives.

引用

页码：3138 / 3167

页数：30

共 50 条

[21] EVALUATING THE PERFORMANCE OF DIFFERENT LARGE LANGUAGE MODELS ON HEALTH CONSULTATION AND PATIENT EDUCATION IN UROLITHIASIS
Song, Haifeng
Xia, Yi
Song, Yan
Li, Jianxing
Zhang, Guangyuan
Xiao, Bo
JOURNAL OF UROLOGY, 2024, 211 (05): : E391 - E392
[22] Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis
Song, Haifeng
Xia, Yi
Luo, Zhichao
Liu, Hui
Song, Yan
Zeng, Xue
Li, Tianjie
Zhong, Guangxin
Li, Jianxing
Chen, Ming
Zhang, Guangyuan
Xiao, Bo
JOURNAL OF MEDICAL SYSTEMS, 2023, 47 (01)
[23] Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis
Haifeng Song
Yi Xia
Zhichao Luo
Hui Liu
Yan Song
Xue Zeng
Tianjie Li
Guangxin Zhong
Jianxing Li
Ming Chen
Guangyuan Zhang
Bo Xiao
Journal of Medical Systems, 47
[24] HI- TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
He, Yinghui
Wu, Yufan
Jia, Yilin
Mihalcea, Rada
Chen, Yulong
Deng, Naihao
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 10691 - 10706
[25] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
Yu, Linhao
Leng, Yongqi
Huang, Yufei
Wu, Shang
Liu, Haixin
Ji, Xinmeng
Zhao, Jiahui
Song, Jinwang
Cui, Tingting
Cheng, Xiaoqing
Liu, Tao
Xiong, Deyi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11817 - 11837
[26] ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
Li, Chunyuan
Liu, Haotian
Li, Liunian Harold
Zhang, Pengchuan
Aneja, Jyoti
Yang, Jianwei
Jin, Ping
Hu, Houdong
Liu, Zicheng
Lee, Yong Jae
Gao, Jianfeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
[27] Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models' Understanding of Discourse Relations
Miao, Yisong
Liu, Hongfu
Lei, Wenqiang
Chen, Nancy F.
Kan, Min-Yen
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6277 - 6295
[28] The Use of Large Language Models in Education
Xing, Wanli
Nixon, Nia
Crossley, Scott
Denny, Paul
Lan, Andrew
Stamper, John
Yu, Zhou
INTERNATIONAL JOURNAL OF ARTIFICIAL INTELLIGENCE IN EDUCATION, 2025,
[29] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
Liang, Wenhan
Wu, Huijia
Gao, Jun
Shang, Yuhu
He, Zhaofeng
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
[30] Evaluating large language models for annotating proteins
Vitale, Rosario
Bugnon, Leandro A.
Fenoy, Emilio Luis
Milone, Diego H.
Stegmayer, Georgina
BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)

← 1 2 3 4 5 →