Establishing vocabulary tests as a benchmark for evaluating large language models

被引:0
|
作者
Martinez, Gonzalo [1 ]
Conde, Javier [2 ]
Merino-Gomez, Elena [3 ]
Bermudez-Margaretto, Beatriz [4 ]
Hernandez, Jose Alberto [1 ]
Reviriego, Pedro [2 ]
Brysbaert, Marc [5 ]
机构
[1] Univ Carlos III Madrid, Dept Ingn Telemat, Leganes, Spain
[2] Univ Politecn Madrid, ETSI Telecomunicac, Madrid, Spain
[3] Univ Valladolid, Escuela Ingn Ind, Valladolid, Spain
[4] Univ Salamanca, Dept Psicol Basica Psicobiol & Metodol Las CC Com, Salamanca, Spain
[5] Univ Ghent, Dept Expt Psychol, Ghent, Belgium
来源
PLOS ONE | 2024年 / 19卷 / 12期
关键词
WORD RECOGNITION; ACQUISITION; LEXTALE;
D O I
10.1371/journal.pone.0308259
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.
引用
收藏
页数:17
相关论文
共 50 条
  • [21] Evaluating large language models for annotating proteins
    Vitale, Rosario
    Bugnon, Leandro A.
    Fenoy, Emilio Luis
    Milone, Diego H.
    Stegmayer, Georgina
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
  • [22] BioCoder: a benchmark for bioinformatics code generation with large language models
    Tang, Xiangru
    Qian, Bill
    Gao, Rick
    Chen, Jiakang
    Chen, Xinyun
    Gerstein, Mark B.
    BIOINFORMATICS, 2024, 40 : i266 - i276
  • [23] SafetyBench: Evaluating the Safety of Large Language Models
    Zhang, Zhexin
    Lei, Leqi
    Wu, Lindong
    Sun, Rui
    Huang, Yongkang
    Long, Chong
    Liu, Xiao
    Lei, Xuanyu
    Tang, Jie
    Huang, Minlie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15537 - 15553
  • [24] Evaluating Large Language Models for Material Selection
    Grandi, Daniele
    Jain, Yash Patawari
    Groom, Allin
    Cramer, Brandon
    Mccomb, Christopher
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [25] Evaluating large language models in pediatric nephrology
    Filler, Guido
    Niel, Olivier
    PEDIATRIC NEPHROLOGY, 2025,
  • [26] Evaluating large language models as agents in the clinic
    Nikita Mehandru
    Brenda Y. Miao
    Eduardo Rodriguez Almaraz
    Madhumita Sushil
    Atul J. Butte
    Ahmed Alaa
    npj Digital Medicine, 7
  • [27] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
    Edalat, Camellia
    Kirupaharan, Nila
    Dalvin, Lauren A.
    Mishra, Kapil
    Marshall, Rayna
    Xu, Hannah
    Francis, Jasmine H.
    Berkenstock, Meghan
    RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
  • [28] The Two Word Test as a semantic benchmark for large language models
    Riccardi, Nicholas
    Yang, Xuan
    Desai, Rutvik H.
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [29] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
  • [30] Evaluating large language models for software testing
    Li, Yihao
    Liu, Pan
    Wang, Haiyang
    Chu, Jie
    Wong, W. Eric
    COMPUTER STANDARDS & INTERFACES, 2025, 93