A bilingual benchmark for evaluating large language models

被引:0
|
作者
Alkaoud, Mohamed [1 ]
机构
[1] King Saud Univ, Coll Comp & Informat Sci, Dept Comp Sci, Riyadh, Saudi Arabia
关键词
Natural language processing; Large language models; Multilingual NLP; LLM evaluation; Arabic NLP; ChatGPT;
D O I
10.7717/peerj-cs.1893
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This work introduces a new benchmark for the bilingual evaluation of large language models (LLMs) in English and Arabic. While LLMs have transformed various fields, their evaluation in Arabic remains limited. This work addresses this gap by proposing a novel evaluation method for LLMs in both Arabic and English, allowing for a direct comparison between the performance of the two languages. We build a new evaluation dataset based on the General Aptitude Test (GAT), a standardized test widely used for university admissions in the Arab world, that we utilize to measure the linguistic capabilities of LLMs. We conduct several experiments to examine the linguistic capabilities of ChatGPT and quantify how much better it is at English than Arabic. We also examine the effect of changing task descriptions from Arabic to English and vice-versa. In addition to that, we find that fastText can surpass ChatGPT in finding Arabic word analogies. We conclude by showing that GPT-4 Arabic linguistic capabilities are much better than ChatGPT's Arabic capabilities and are close to ChatGPT's English capabilities.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
    Cai, Yan
    Wang, Linlin
    Wang, Ye
    de Melo, Gerard
    Zhang, Ya
    Wang, Yanfeng
    He, Liang
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17709 - 17717
  • [2] PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
    Valmeekam, Karthik
    Marquez, Matthew
    Olmo, Alberto
    Sreedharan, Sarath
    Kambhampati, Subbarao
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [3] Evaluating theories of bilingual language control using computational models
    Lowry, Mark
    Dube, Chad
    Schotter, Elizabeth
    [J]. JOURNAL OF MEMORY AND LANGUAGE, 2021, 117
  • [4] BioCoder: a benchmark for bioinformatics code generation with large language models
    Tang, Xiangru
    Qian, Bill
    Gao, Rick
    Chen, Jiakang
    Chen, Xinyun
    Gerstein, Mark B.
    [J]. BIOINFORMATICS, 2024, 40
  • [5] Evaluating large language models for annotating proteins
    Vitale, Rosario
    Bugnon, Leandro A.
    Fenoy, Emilio Luis
    Milone, Diego H.
    Stegmayer, Georgina
    [J]. BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
  • [6] Evaluating large language models as agents in the clinic
    Nikita Mehandru
    Brenda Y. Miao
    Eduardo Rodriguez Almaraz
    Madhumita Sushil
    Atul J. Butte
    Ahmed Alaa
    [J]. npj Digital Medicine, 7
  • [7] Evaluating large language models as agents in the clinic
    Mehandru, Nikita
    Miao, Brenda Y.
    Almaraz, Eduardo Rodriguez
    Sushil, Madhumita
    Butte, Atul J.
    Alaa, Ahmed
    [J]. NPJ DIGITAL MEDICINE, 2024, 7 (01)
  • [8] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    [J]. TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2024,
  • [9] ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code
    Feng, Jia
    Liu, Jiachen
    Gao, Cuiyun
    Chong, Chun Yong
    Wang, Chaozheng
    Gao, Shan
    Xia, Xin
    [J]. arXiv, 2024,
  • [10] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
    Zhou, Wangchunshu
    Zeng, Yan
    Diao, Shizhe
    Zhang, Xinsong
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,