CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models

被引:0
|
作者
Yu, Linhao [1 ]
Leng, Yongqi [1 ]
Huang, Yufei [1 ]
Wu, Shang [2 ]
Liu, Haixin [3 ]
Ji, Xinmeng [3 ]
Zhao, Jiahui [1 ]
Song, Jinwang [3 ]
Cui, Tingting [3 ]
Cheng, Xiaoqing [3 ]
Liu, Tao [3 ]
Xiong, Deyi [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
[2] Kuming Univ Sci & Technol, Fac Informat Engn & Automat, Kunming, Yunnan, Peoples R China
[3] Zhengzhou Univ, Sch Comp & Artificial Intelligence, Zhengzhou, Henan, Peoples R China
关键词
AI;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. The dataset is publicly available at https://github.com/ tjunlp-lab/CMoralEval.
引用
收藏
页码:11817 / 11837
页数:21
相关论文
共 50 条
  • [1] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
    Liang, Wenhan
    Wu, Huijia
    Gao, Jun
    Shang, Yuhu
    He, Zhaofeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
  • [2] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
    Cai, Yan
    Wang, Linlin
    Wang, Ye
    de Melo, Gerard
    Zhang, Ya
    Wang, Yanfeng
    He, Liang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
  • [3] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
    Li, Junyi
    Cheng, Xiaoxue
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
  • [4] Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation
    Gao, Dawei
    Wang, Haibin
    Li, Yaliang
    Sun, Xiuyu
    Qian, Yichen
    Ding, Bolin
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (05): : 1132 - 1145
  • [5] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
    Liu, Xin
    Zhu, Yichen
    Gu, Jindong
    Lan, Yunshi
    Yang, Chao
    Qiao, Yu
    COMPUTER VISION - ECCV 2024, PT LVI, 2025, 15114 : 386 - 403
  • [6] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [7] A comprehensive evaluation of large Language models on benchmark biomedical text processing tasks
    Jahan, Israt
    Laskar, Md Tahmid Rahman
    Peng, Chun
    Huang, Jimmy Xiangji
    COMPUTERS IN BIOLOGY AND MEDICINE, 2024, 171
  • [8] ANALOGICAL - A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models
    Wijesiriwardene, Thilini
    Wickramarachchi, Ruwan
    Gajera, Bimal G.
    Gowaikar, Shreeyash Mukul
    Gupta, Chandan
    Chadha, Aman
    Reganti, Aishwarya Naresh
    Sheth, Amit
    Das, Amitava
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3534 - 3549
  • [9] DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark
    Li, Haodong
    Zhang, Xiaofeng
    Qu, Haicheng
    REMOTE SENSING, 2025, 17 (04)
  • [10] CLiMP: A Benchmark for Chinese Language Model Evaluation
    Xiang, Beilei
    Yang, Changbing
    Li, Yu
    Warstadt, Alex
    Kann, Katharina
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2784 - 2790