CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

被引：0

作者：

Yu, Linhao ^{[1
]}

Leng, Yongqi ^{[1
]}

Huang, Yufei ^{[1
]}

Wu, Shang ^{[2
]}

Liu, Haixin ^{[3
]}

Ji, Xinmeng ^{[3
]}

Zhao, Jiahui ^{[1
]}

Song, Jinwang ^{[3
]}

Cui, Tingting ^{[3
]}

Cheng, Xiaoqing ^{[3
]}

Liu, Tao ^{[3
]}

Xiong, Deyi ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China

[2] Kuming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming, Yunnan, Peoples R China

[3] Zhengzhou Univ, Sch Comp & Artificial Intelligence, Zhengzhou, Henan, Peoples R China

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年

关键词：

AI;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at https://github.com/ tjunlp-lab/CMoralEval.

引用

页码：11817 / 11837

页数：21

共 50 条

[1] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
Liang, Wenhan
Wu, Huijia
Gao, Jun
Shang, Yuhu
He, Zhaofeng
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
[2] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Cai, Yan
Wang, Linlin
Wang, Ye
de Melo, Gerard
Zhang, Ya
Wang, Yanfeng
He, Liang
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
[3] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
Li, Junyi
Cheng, Xiaoxue
Zhao, Wayne Xin
Nie, Jian-Yun
Wen, Ji-Rong
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
[4] Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
Gao, Dawei
Wang, Haibin
Li, Yaliang
Sun, Xiuyu
Qian, Yichen
Ding, Bolin
Zhou, Jingren
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (05): : 1132 - 1145
[5] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
Liu, Xin
Zhu, Yichen
Gu, Jindong
Lan, Yunshi
Yang, Chao
Qiao, Yu
COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 386 - 403
[6] A bilingual benchmark for evaluating large language models
Alkaoud, Mohamed
PEERJ COMPUTER SCIENCE, 2024, 10
[7] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
Jahan, Israt
Laskar, Md Tahmid Rahman
Peng, Chun
Huang, Jimmy Xiangji
COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
[8] ANALOGICAL - A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models
Wijesiriwardene, Thilini
Wickramarachchi, Ruwan
Gajera, Bimal G.
Gowaikar, Shreeyash Mukul
Gupta, Chandan
Chadha, Aman
Reganti, Aishwarya Naresh
Sheth, Amit
Das, Amitava
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3534 - 3549
[9] DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
Li, Haodong
Zhang, Xiaofeng
Qu, Haicheng
REMOTE SENSING, 2025, 17 (04)
[10] CLiMP: A Benchmark for Chinese Language Model Evaluation
Xiang, Beilei
Yang, Changbing
Li, Yu
Warstadt, Alex
Kann, Katharina
16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2784 - 2790

← 1 2 3 4 5 →