A bilingual benchmark for evaluating large language models

被引：0

作者：

Alkaoud, Mohamed ^{[1
]}

机构：

[1] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh, Saudi Arabia

来源：

PEERJ COMPUTER SCIENCE | 2024年 / 10卷

关键词：

Natural language processing; Large language models; Multilingual NLP; LLM evaluation; Arabic NLP; ChatGPT;

D O I：

10.7717/peerj-cs.1893

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This work introduces a new benchmark for the bilingual evaluation of large language models (LLMs) in English and Arabic. While LLMs have transformed various fields, their evaluation in Arabic remains limited. This work addresses this gap by proposing a novel evaluation method for LLMs in both Arabic and English, allowing for a direct comparison between the performance of the two languages. We build a new evaluation dataset based on the General Aptitude Test (GAT), a standardized test widely used for university admissions in the Arab world, that we utilize to measure the linguistic capabilities of LLMs. We conduct several experiments to examine the linguistic capabilities of ChatGPT and quantify how much better it is at English than Arabic. We also examine the effect of changing task descriptions from Arabic to English and vice-versa. In addition to that, we find that fastText can surpass ChatGPT in finding Arabic word analogies. We conclude by showing that GPT-4 Arabic linguistic capabilities are much better than ChatGPT's Arabic capabilities and are close to ChatGPT's English capabilities.

引用

页数：22

共 50 条

[1] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Cai, Yan
Wang, Linlin
Wang, Ye
de Melo, Gerard
Zhang, Ya
Wang, Yanfeng
He, Liang
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
[2] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
Valmeekam, Karthik
Marquez, Matthew
Olmo, Alberto
Sreedharan, Sarath
Kambhampati, Subbarao
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[3] Evaluating theories of bilingual language control using computational models
Lowry, Mark
Dube, Chad
Schotter, Elizabeth
[J]. JOURNAL OF MEMORY AND LANGUAGE, 2021, 117
[4] BioCoder: a benchmark for bioinformatics code generation with large language models
Tang, Xiangru
Qian, Bill
Gao, Rick
Chen, Jiakang
Chen, Xinyun
Gerstein, Mark B.
[J]. BIOINFORMATICS, 2024, 40
[5] Evaluating large language models for annotating proteins
Vitale, Rosario
Bugnon, Leandro A.
Fenoy, Emilio Luis
Milone, Diego H.
Stegmayer, Georgina
[J]. BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
[6] Evaluating large language models as agents in the clinic
Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
[J]. npj Digital Medicine, 7
[7] Evaluating large language models as agents in the clinic
Mehandru, Nikita
Miao, Brenda Y.
Almaraz, Eduardo Rodriguez
Sushil, Madhumita
Butte, Atul J.
Alaa, Ahmed
[J]. NPJ DIGITAL MEDICINE, 2024, 7 (01)
[8] Evaluating Intelligence and Knowledge in Large Language Models
Bianchini, Francesco
[J]. TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2024,
[9] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
Feng, Jia
Liu, Jiachen
Gao, Cuiyun
Chong, Chun Yong
Wang, Chaozheng
Gao, Shan
Xia, Xin
[J]. arXiv, 2024,
[10] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Zhou, Wangchunshu
Zeng, Yan
Diao, Shizhe
Zhang, Xinsong
[J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,

← 1 2 3 4 5 →